Link to home
Start Free TrialLog in
Avatar of sifuhall
sifuhallFlag for United States of America

asked on

I need an efficient way to compare 4 large tables.

There really has to be a better way to do this.  Any help is greatly appreciated.

Currently we receive two very large text files from our client every week.  This is an automated procedure on our client's side and unfortunately they are not willing at all to modify this process and include a unique identifier on either table.

Currently we import these two tables, then drop the previous tables (assuming the import went properly), then index the new tables.  We then perform this at our 6 other offices around the wold.

I want to modify this so we import the new tables and then compare them to the old and only send the changes (either an add or a delete) to the clients.

One table contains approx. 11 million records and the other has approx. 24 million records.

There are no unique identifiers on either table.

I have tried 3rd party tools such as Red Gate SQL Data Compare and it requires a unique indentifier (and if I had this I could do it myself).

I have also tried the undocumented getchecksum() however when I compare the old and new tables with getchecksum() it shows about 99% of the records have a difference and this should only be around 10 - 15%.

Any thoughts on this are greatly appreciated!
Avatar of ShogunWade
ShogunWade

you cant rely 100% on checksum by it's very nature of being a hashing algorythm it is an approximate measure of equality.

Ultimately i think you will need to join the records together and compare.  
Avatar of Jay Toops
can you build a key from some of the data fields?
ie is there any combination of data elements that doesn't get changed that would enable you to perform this?
Jay
Avatar of sifuhall

ASKER

Here is an example of the smaller table (approx. 11 million records):

EE      varchar      9
EmpGrpNo      varchar      7
EmpLast      varchar      20
EmpFirst      varchar      12
EmpAddr1      varchar      32
EmpAddr2      varchar      32
EmpCity      varchar      21
EmpState      varchar      2
EmpZip1      varchar      5
EmpZip2      varchar      4
EmpDOB      varchar      8
ID      int      4

EE is the SSN of the primary member (which will be repeated for each family member).
EmpGrpNo will be repeated for each person for a employee group
EmpLast is the last name of the primary member (which will be repeated for each family member)
EmpFirst is of the primary member (which will be repeated for each family member)
EmpAddr1 is of the primary member (which will be repeated for each family member)
EmpAddr2 is of the primary member (which will be repeated for each family member)
EmpCity is of the primary member (which will be repeated for each family member)
EmpState is of the primary member (which will be repeated for each family member)
EmpZip1 is of the primary member (which will be repeated for each family member)
EmpZip2 is of the primary member (which will be repeated for each family member)
EmpDOB is of the primary member (which will be repeated for each family member)
ID is the person Identifying number of this member (0 for primary member, 1 for spouse, 2 for a child, etc.)

So how can I create a statement that identifies if any of the fields have changed?

I really wish we could modify the data sent from the user to indicate when it was last updated, etc. but that is not possible,
ok this is not to tough .. but will require some cpu crunching.
your unique KEY IS  (Drumroll please)
EE+ID

so you can do this

select t1.* from
table1 t1, table2 t2
where
t1.EE = t2.EE and t2.ID =t2.ID and
t1.EmpGrpNo+t1.EmpLast+t1.EmpFirst+t1.EmpAddr1+t1.EmpAddr2+t1.EmpCity+t1.EmpState+t1.EmpZip1+t1.EmpZip2+t1.EmpDOB <> t2.EmpGrpNo+t2.EmpLast+t2.EmpFirst+t2.EmpAddr1+t2.EmpAddr2+t2.EmpCity+t2.EmpState+t2.EmpZip1+t2.EmpZip2+t2.EmpDOB

This will give you a list of records changed.

Jay
O.K.

I have faced this problem with > 100 million row per day fields and developed a solution that should work for you as well.

BTW, you have a unique identifier; it is simply the compound of EE and ID, in that order.

You were also kind of on the right track with getchecksum() but by the time it gets into the database, it is too late.

What you need to do is create a program to pre-process the flat files that adds the CRC-32 value to the end of the record before they are loaded into the database.  If you like, you can have this same program make a connection to the database with DBlib and use the Bulk Copy access directly.  Certainly not necessary but it takes an extra step out of the process.

So you have run the files through the pre-processor and loaded them into working tables in the database.  (The permenant tables also must retain the CRC32 field for this to work.)  Next build a coverage index on the working table with the EE, ID, and CRD32 fields.  Then perform a full outer join between the premeant and working tables stuffing the results into another working table, or better yet, a temporary table.  The select list should include the EE, ID, and CRC32 from the permanant table as well as all the fields from the working table.

Now it gets easy.  Insert from the temporary table into the permenant table (INSERT/SELECT) all of the rows where the permanant.EE field is NULL in the temporary table.  This handles new records.

Next, update (with a correlated subquery) the permenant table with the rows from the temporary table everywhere the CRC32 fields in the temp table are NOT EQUAL.

Finally, delete from the permenant table (with a correlated subquery) everywhere the working.EE IS NULL in the temporary table.

You are done.  In fact, you can send the contents of the temporary table to the remote sites and they can do the same processing.

If there are a large percentiage of deletes, you may want to look at sending out just the key fields for that operation seperately so you aren't transmitting all of the name/address info when you don't need it.  Personally, I would just ZIP the file and ship it as is.  That way you only have one set of logic to deal with at all the sites.

Why CRC32.  Well, it is a 32 bit hash of the record which means that there are 4 billion possible values.  I believe getchecksum is only 16 bits which is 65K.  CRC32 (Cyclical Redundancy Check) is used widely in telecommunications and network protocols to assure nothing has been corrupted in transmission.  Also, the formation of the CRC32 is specifically designed to detect relatively small changes in the input string.  Given that skew, you can safely say you will miss less than one in 4 billion updates to an existing row.  In your case, you have 35 million rows per week, with 15% changes, and assuming 50% of those are updates as opposed to inserts or deletes, you will miss an update on average once every 29 years!!!

BTW, search the web and you will find various implementations of CRC32 you can scarf.

Of course, if you want to do it brute force, you can skip the CRC32 and do the comparison of each of the fields in the full outer join, but that is really ugly.

Hope that helps,
Bill
Create a table that uses an identity primary key, and a unique composite index on SSN + ID

Key    SSN                  ID
1       555 12 3456      0
2       555 12 3456      1
3       555 12 3456      2
4       999 12 3456      0
5       999 12 3456      1
etc.


Create another table that has the data that is the same for everyone

Address,Group,Addr, etc...


Then to find differences, take the entire data set and hash it using md5

When you recieve the data dump it into a temporary table, link based upon the unique key of (ssn+id) using an inner join
hash the rest of the row values on the temp table and if the hashes do not match then update record.

Then do a an outer join to catch the data the is *new* and just insert those records computing a new md5.


md5 is simple and is ported to many languages.

To get md5 for sql server as an extend stored procedure go here
http://www.codeproject.com/database/xp_md5.asp





for large sets, crc32 may be fast enough for you.

For algorithim comparision, crc32 is faster to compute but has a higher chance of two values computing to the same hash value since it's only 2^32 chance as compared to MD5 which is 2^128.
I am really thankful for all the replies and am still reading them.  It looks very promising, however let me state the EE + ID is not unique (but it is close).  Does this change anything?

For example, EE is the number of the primary subscriber, and ID is the person indicator.  ID will be 0 for the primary subscriber and 1 for the spouse, but 2 for every child the person has, etc.

Does this change anything?
O.K., I'll bite.  Why is that not unique?  Everybody has to be belong to a primary subscriber and each dependant has a different sequence number.  Assuming the files you are getting are just straight extracts, (and that is a pretty good bet), how could the EE + ID not be unique?

You could implement the CRC32 algorithm as an XP, but I would do a quick check to see shat the overhead is.  I am willing to bet that it is high enough you might not want to do it that way.

Bill

BTW, g0rath, I addressed that issue in the body of my first message.  I think one missed update every 29 years is an acceptable rate of "error".  I am quite certain that there are other problems with the source and target systems that will introduce errors at a much higher rate than that.  ;-)
Dependents do not have a different sequence number.  The number will be 2 for any and all children.

For example:

EE            ID
12345       0          - Primary subscriber
12345       1          - Spouse
12345       2          - child
12345       2          - a different child
12345       2          - yet another child


etc.
Then you just include FIRST NAME or date of birth with it
EE+ID+FIRSTNAME

jay
EmpLast is the last name of the primary member (which will be repeated for each family member)
EmpFirst is of the primary member (which will be repeated for each family member)
EmpAddr1 is of the primary member (which will be repeated for each family member)
EmpAddr2 is of the primary member (which will be repeated for each family member)
EmpCity is of the primary member (which will be repeated for each family member)
EmpState is of the primary member (which will be repeated for each family member)
EmpZip1 is of the primary member (which will be repeated for each family member)
EmpZip2 is of the primary member (which will be repeated for each family member)
EmpDOB is of the primary member (which will be repeated for each family member)
ID is a very misleading name.  It is actually a TYPE indicator.

Well, in that case, you had better put both the first name and the date of birth on the key.  "Changes" to either of those to fields will wind up being delete/inserts but the end result will be the same.  This will work out well unless your customer happens to insure George Foreman and his wife has twins!!! ;-)

(OK,  for those of you who don't know, George Foreman, a famous boxer, has six children, all of whom are named George, including his daughters.)
What ABOUT TWINS same DOB
I don't think I am being very clear.

first name, last name, dob, etc all belong to the primary subscriber, and the id is not incremented per family member.

There really and truly is no unique identifier for this table.  

Really, there isn't.

If we could modify the records sent from the client to have a unique identifier I would not have this problem.

But we cannot.

Does this affect the md5 suggestion?  I have researched this a bit and seems like a viable option (if I base the md5 hash on every field).  I can easily implement this during the import and then simply compare the md5 hash results.

So if I compare the md5 hash with the old and new tables and it matches this should indicate nothing has changed in the record, correct?  If the hash is in the new table and not the old this record is an add, correct?  And, if the hash is in the old table and not in the new it means this record has been updated and should also be deleted (since the updated version will be already included in the adds), correct?

Many thanks for all the help with this brainteaser.
That's why you need the NAME AND the DOB.
You should never rely solely on hash algorythms for comparison.  hashing by its nature is "lossy" thus two un-identical records can potentially have the same hash value
If that is truely the case, they are sending you a bucketload of garbage.  Most of the time, you are going to have just a bunch of redundant information that they were to lazy/stupid to cleanse.  How in the heck are you using this stuff.  With no primary key, it is all but worthless for any downstream application.

If I were you, I would try to eliminate the duplicates.  I am willing to bet that at least half the rows are dupes.

If you were on UNIX/LINUX you could just brute force it and do a SORT (on everything but the ID)  with the Unique flag set and you would be done.  Unfortunately, you will have to load these into the database and then eliminate the dupes there.

INSERT INTO working_table2
SELECT EE, EmpLast, .................EmpDOB, MIN(ID)
  FROM working_table
GROUP BY EE, EmpLast, .................EmpDOB

With this, you will always get the Employee's address, the spouses address if it is different, and the childrens addresses if they are different from both of the parents.

With no primary key, it is just a dumb list of stuff and there is nothing you can do to make it smarter aside from eliminating the redundant rows.

BTW, does this stuff come in fixed width or delimited?  I bet fixed width given what we can guess about the source for this data.  In that case, after you import the data and eliminate the dupes, BCP it out in delimited form before you ZIP it.  It can't hurt to get rid of the extra spaces.

Bill
agreeing .. to IGNORE duplicate records
because
update SSN-ID-DOB will update ALL .. SSN-ID-DOB records..
which given your table layout would all be the same anyway

Jay
Grant300 you are 100% correct that this is crap.  Even more than I metioned.  The method we use to retrieve these files is horrible as well.  Both tables are sent in a single file and our developers have a routine that reads each record in and places it in the correct table based on record length.

I estimate only about 15% of the records change weekly, however the entire set is sent each week.

If we could control the data sent from the client it would be so much easier, but we can't (and that is probably why they are our client and not someone else's).

Currently we receive this file and split it into the new tables.  And repeat this process in each of our other sites around the world.

My goal is to only send the records needed to our other sites (and reduce the transmission time as well as processing time).
ASKER CERTIFIED SOLUTION
Avatar of grant300
grant300

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I FIND it Extremely difficult to believe that 15% of that data changes on a weekly basis.
its probably a much lower figure