Link to home
Start Free TrialLog in
Avatar of Daftwillie
Daftwillie

asked on

SSIS OuterJoin

Heya,

I'm still a bit new to SSIS and am looking for some advice. I have a datawarehouse that requires data to be moved around. Obviously my first thought is "SSIS is perfect for this" but there are maybe some techniques I need to refine a bit.

What I want to be able to do is load data from one table to the next, but only if that data does not already exist in the target table. I've done something using a lookup transformation and a conditional split, but I wonder if there is a better, more efficient, way of doing it. For my way to work I have to override the errors produced by the lookup when it doesn't find anything.

As an example of what I am doing Imagine the following two tables.

RAW_PEOPLE
FirstName, LastName, Gender, DOB
Bob, Fleming, M, 01/01/1978
Rose, Cutting, F, 31/12/1985
Hugo, First, M, 12/12/2000

STG_PEOPLE
PeopleID, FirstName, LastName, Gender, DOB
1001, Bob, Fleming, M, 01/01/1978

As you can see in the stage table, Bob Fleming is already there, so I don't want to load him, I do want to load the others though. In SQL I would do this like this

INSERT INTO STG_PEOPLE
SELECT rp.FirstName, rp.LastName, rp.Gender, rp.DOB
FROM RAW_PEOPLE rp LEFT OUTER JOIN STG_PEOPLE sp
ON rp.FirstName = sp.FirstName AND rp.LastName = sp.LastName AND rp.DOB = sp.DOB
WHERE sp.PeopleID IS NULL

Is there a nice elegant way of achiving this using SSIS?
ASKER CERTIFIED SOLUTION
Avatar of Guy Hengel [angelIII / a3]
Guy Hengel [angelIII / a3]
Flag of Luxembourg image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi,

You can improve the performance if you can sub-select only the changed rows and then join.

Thanks,
Sam
Using the lookup component and use the red output line to link to the destination component in order to insert only new data.
helped?
regards,
Pedro
Well, you can always cheat and do:

INSERT INTO STG_PEOPLE
SELECT rp.FirstName, rp.LastName, rp.Gender, rp.DOB
FROM RAW_PEOPLE rp
WHERE rp.FirstName + rp.LastName NOT IN
(
  SELECT FirstName + LastName
  from STG_PEOPLE
)

But I can't vouch for performance.

However, even in a Data Warehouse, it's acceptable to store unique key values for just this occasion: you want to insert unique rows that do not already exist.  It would be even easier if you could match by PeopleID.
Avatar of Daftwillie
Daftwillie

ASKER

I ended up cheating myself.

The PeopleID is assigned to the data when it moves from RAW to STG. I added a "Metadata" column to the raw table of PeopleID. I then try to update this column first and just load in anything with a NULL PeopleID from RAW. Easier, I just wanted to make sure there wasn't an easy snazzy way of doing it in SSIS.
If you're looking for snazzy and performance, try this:

http://www.codeplex.com/kimballscd

No solution given but pretty much confirmed my suspisions that SSIS doesn;t really deal with this very well. Ended up giving SSIS the elbow and just wrote a large migration script in SQL, was much easier.