Skip to content

Lookup, Join, Merge – When to use

There are three different stages in DataStage that can be used to combine two different data sets.  They are lookups, joins and merges.  Each one has it’s benefits and drawbacks.  Below is a synopsis on when to use each stage.

Lookup
  • Does not need to be partitioned or sorted
  • Only one primary input but can have many reference tables
  • All done from memory, thus faster
  • Can have one reject link
  • Use lookups with small amounts of data ( >5 million)
Merge
  • Must be pre-hashed and sorted on merge keys
  • Only one master but can have many updates
  • Can have multiple reject links
  • Use merges with large amounts of data
Join
  • Must be pre-hashed and sorted on merge keys
  • Must have two inputs
  • Cannot have reject links
  • Use joins with large amounts of data
As you can see, each stage has it’s advantages.  Use Lookups when you are dealing with small amounts of data.  The data is cached, so it is stored in memory, which makes it faster.  However, if you are combining large amounts of data, you will see a decline in performance.  When this happens, switch out the Lookup for a join or merge.  Both Join and Merge pretty much do the same thing, except merges can have a reject link.  So if you do no need a reject link, go with a Join.
When you know when to use each stage, it will help your job performance.  When in doubt, just run the same job twice.  Once with a lookup and once with a join/merge and see which stage is better.

Transformer Trick

A quick time-saving trick I learned a whole back has to do with the transformer stage.  When you passing five or 500 columns, this will be helpful.

When you need to apply a function or formula to many or all the columns you are passing through a transformer, there is a tool called ‘Derivation Substitution’  To get here, select all of the columns you need to transform.  Right-click on a selected column derivation inside the transformer and a sub-menu will pop up.  Click on ‘Derivation Substitution…’

Here you can insert any function or formula you need to apply to your selected columns.  For instance, it is good practice in most occasions to trim and handle any nulls before you insert into your target.  Say all my columns are strings, so I would want to apply the Trim and NullToEmpty functions to all of my selected columns.  But if I have 100 columns, this could take a long time doing this one by one.  Instead, select all my desired columns, and go to the ‘Substitute Derivation’ menu.

Once there, I type in  Trim(NullToEmpty($1))  in the text box.  The $1 will replace with whatever the column you are passing in your transformer.

Take a look at the before and after images below.  The first image is before I applied the functions.  The second image is after I applied them.  You can see that the $1 replaces with the link_name.column_name.

 

 

Hopefully this little tick will help you save some time the next time you are developing.

Welcome to datastagedude’s Blog About DataStage

Hi all.  This is the first entry of this blog.  This blog is intended to give my experiences on working with IBM’s tool, Infosphere DataStage.  It can range from all sorts of topics ranging from my personal experiences, debugging errors, tips, tricks and everything under the sun.

Let me first begin by giving a little background about myself.  I have been working with DataStage since August 2010  (hence, I am no expertby any means).  I am working for an IT consulting firm in Ohio and have been working there since July 2010, pretty much straight out of graduating college.  My first gig was a developer for a major medical company.  I was not doing data warehousing, just custom jobs.  From there, I went to being a Senior Consultant, where another senior and I were in charge of a team of three developers on a data warehouse project for a school district.  I am currently still on this project, but everyone besides myself have rolled off since October 2011.  I became DataStage v8.0 certified in December 2010, but I am currently working with DataStage v8.5 on this project.

I used multiple sites for my DataStage information, for example, http://www.dsxchange.com, http://www.dstagebox.com, and http://www.ittoolbox.com.  All these websites are fantastic sources of DataStage related info and I highly recommend them all.

I will try and post as frequently as possible, so if you read the blog and like it, just leave a quick comment.  Even if you don’t, I’d appreciate the feedback either way.

Follow

Get every new post delivered to your Inbox.