Link to home
Start Free TrialLog in
Avatar of Marcus Aurelius
Marcus AureliusFlag for United States of America

asked on

SSIS how to COMPARE a data column from different servers?

Experts,

I have a SSIS requirement to VERIFY MemberIDs from one Server with MemberIDs on another Server. I need to create a matched list and an unmatched list.

I need to compare data from these 2 tables like this:

select memberID from server1.dbo.table1 t1

select memberID from server2.dbo.table2 t2

Compare SQL:

select t2.memberID from server2.dbo.table2 t2
where t2.memberID in
(select t1.memberID from server1.dbo.table1 t1)

---------

I believe I'll need to use a VARIABLE to capture the memberIDs from Table1 and use this variable in the compare SQL in order to do the comparison.

Is this the best way or are there better ways to handle this within SSIS?

Thanks
M
ASKER CERTIFIED SOLUTION
Avatar of Nakul Vachhrajani
Nakul Vachhrajani
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Marcus Aurelius

ASKER

I'm attempting to use LOOKUP transform. But the CACHE is never populated with lookup data??

What could be the issue?

When I run my simple query in SSMS it pulls in 800K rows of data. However when I try to use this same query in the LOOKUP transform I don't get any data and so all records show as being NOT MATCHED...which I know is not true.

(Note: I even tried using the actual TABLE source to pull in the lookup data, still no data populated in cache...???)
How are you checking whether data has been cached or not? Can you post a screenshot of how you have configured the LookUp transformation task (especially, the match keys)?
I now have the missing data corrected,..it was an ACCESS ISSUE!

HOWEVER, I still need assistance:

Can I place LOGIC within the LOOKUP?

I need to match on the MEMBERID from table1 but also need to compare a DATE field from Server1.Table1 to 2 Date fields in Server2.Table1.

I need to compare like this:

Server1.Table1.MemberID = Server2.Table1.MemberID
AND
Server1.Table1.EncounterDate between Server2.Table1.StartDate and Server2.Table1.EndDate

If above is correct the this is a MATCH else NOT MATCHED

What is best way to accomplish this logic within SSIS..??
I am evaluating how to use a LookUp tranform for range validations. Typically, LookUps are for getting the values of an enumeration or quickly validating whether an Id exists in a cached source or not.

My knee-jerk reaction would be to use the fastest way, which in this case would be the option #3 I proposed earlier. In this option, you will need to get the data from both sources into temporary tables and then use a simple T-SQL LEFT OUTER JOIN. If you get a not-null value in the columns for Server2.Table1, it's a match else it's not.

If you do not want to use T-SQL, the other (purely SSIS) way is option #2 (using sort and Merge operators). The bottleneck in this is the sort operator which will stop the data pipeline untill it gets all records from the source so that it can sort the data. Hence, this method can work well, but only when you have sorted inputs. If you do not have sorted inputs, this actually may not perform as expected.
Here's how to achieve the requirement using the Look Up transformation:

Option 01:
1. Perform the Look up only on MemberId - Those not matching on the MemberId can straight-away go to the no-match output whereas those matching on the MemberId should "go down" the data pipeline
2. Connect a "Derived Column" transformation to the LookUp Match Output
3. In the Derived Column, add two columns:
    a. Column - StartDateValidation: Server1.Table1.EncounterDate >= Server2.Table1.StartDate
    b. Column - EndDateValidation: Server1.Table1.EncounterDate <= Server2.Table1.EndDate
4. Connect a Conditional Split to the Derived Column transformation
    a. Condition to check: StartDateValidation == TRUE && EndDateValidation == TRUE
    (You can name this output as "Match Output")
5. The "Match Output" from the Conditional Split can now go to your subsequent destination for matched records
6. The conditional split default output can go to your "No-match" output logic

Option 02 (Not recommended due to extremely poor performance):
1. You can use the "Partial Cache" or "No Cache" modes of the LookUp transformation.
2. Map both the MemberId to the LookUp source and EncounterDate to the StartDate from the lookup source
3. In the "Advanced" tab of the LookUp transform, a query can be used as under:
    SELECT *
    FROM (SELECT MemberId,
                               StartDate,
                               EndDate
                FROM Server2.Table1) AS [refTable]
                WHERE [refTable].MemberId = ?
                       AND ? BETWEEN [refTable].StartDate AND [refTable].EndDate
4. Click on the Parameters button and map Parameter0 to the MemberId from the input and Parameter1 to the EncounterDate
5. Continue with processing of match & no-match output as required

The drawback here is that because we need to use Partial Cache or No Cache modes, a lot of round-trips to the lookup cache are required and a large number of buffers are also used, degrading performance terribly (I could visually see a huge performance hit in option #2 over option #1 with a data-set as small as 121,000 rows and a lookup of 500 rows).