Skip to content

Lookup, Join, Merge – When to use

January 20, 2012

There are three different stages in DataStage that can be used to combine two different data sets.  They are lookups, joins and merges.  Each one has it’s benefits and drawbacks.  Below is a synopsis on when to use each stage.

Lookup
  • Does not need to be partitioned or sorted
  • Only one primary input but can have many reference tables
  • All done from memory, thus faster
  • Can have one reject link
  • Use lookups with small amounts of data ( >5 million)
Merge
  • Must be pre-hashed and sorted on merge keys
  • Only one master but can have many updates
  • Can have multiple reject links
  • Use merges with large amounts of data
Join
  • Must be pre-hashed and sorted on merge keys
  • Must have two inputs
  • Cannot have reject links
  • Use joins with large amounts of data
As you can see, each stage has it’s advantages.  Use Lookups when you are dealing with small amounts of data.  The data is cached, so it is stored in memory, which makes it faster.  However, if you are combining large amounts of data, you will see a decline in performance.  When this happens, switch out the Lookup for a join or merge.  Both Join and Merge pretty much do the same thing, except merges can have a reject link.  So if you do no need a reject link, go with a Join.
When you know when to use each stage, it will help your job performance.  When in doubt, just run the same job twice.  Once with a lookup and once with a join/merge and see which stage is better.

From → Stages

Leave a Comment

Leave a comment