Solved

I need an efficient way to compare 4 large tables.

Posted on 2004-09-07
22
201 Views
Last Modified: 2008-02-01
There really has to be a better way to do this.  Any help is greatly appreciated.

Currently we receive two very large text files from our client every week.  This is an automated procedure on our client's side and unfortunately they are not willing at all to modify this process and include a unique identifier on either table.

Currently we import these two tables, then drop the previous tables (assuming the import went properly), then index the new tables.  We then perform this at our 6 other offices around the wold.

I want to modify this so we import the new tables and then compare them to the old and only send the changes (either an add or a delete) to the clients.

One table contains approx. 11 million records and the other has approx. 24 million records.

There are no unique identifiers on either table.

I have tried 3rd party tools such as Red Gate SQL Data Compare and it requires a unique indentifier (and if I had this I could do it myself).

I have also tried the undocumented getchecksum() however when I compare the old and new tables with getchecksum() it shows about 99% of the records have a difference and this should only be around 10 - 15%.

Any thoughts on this are greatly appreciated!
0
Comment
Question by:sifuhall
  • 6
  • 6
  • 6
  • +2
22 Comments
 
LVL 18

Expert Comment

by:ShogunWade
ID: 11997520
you cant rely 100% on checksum by it's very nature of being a hashing algorythm it is an approximate measure of equality.

Ultimately i think you will need to join the records together and compare.  
0
 
LVL 10

Expert Comment

by:Jay Toops
ID: 11997572
can you build a key from some of the data fields?
ie is there any combination of data elements that doesn't get changed that would enable you to perform this?
Jay
0
 
LVL 1

Author Comment

by:sifuhall
ID: 11997726
Here is an example of the smaller table (approx. 11 million records):

EE      varchar      9
EmpGrpNo      varchar      7
EmpLast      varchar      20
EmpFirst      varchar      12
EmpAddr1      varchar      32
EmpAddr2      varchar      32
EmpCity      varchar      21
EmpState      varchar      2
EmpZip1      varchar      5
EmpZip2      varchar      4
EmpDOB      varchar      8
ID      int      4

EE is the SSN of the primary member (which will be repeated for each family member).
EmpGrpNo will be repeated for each person for a employee group
EmpLast is the last name of the primary member (which will be repeated for each family member)
EmpFirst is of the primary member (which will be repeated for each family member)
EmpAddr1 is of the primary member (which will be repeated for each family member)
EmpAddr2 is of the primary member (which will be repeated for each family member)
EmpCity is of the primary member (which will be repeated for each family member)
EmpState is of the primary member (which will be repeated for each family member)
EmpZip1 is of the primary member (which will be repeated for each family member)
EmpZip2 is of the primary member (which will be repeated for each family member)
EmpDOB is of the primary member (which will be repeated for each family member)
ID is the person Identifying number of this member (0 for primary member, 1 for spouse, 2 for a child, etc.)

So how can I create a statement that identifies if any of the fields have changed?

I really wish we could modify the data sent from the user to indicate when it was last updated, etc. but that is not possible,
0
 
LVL 10

Expert Comment

by:Jay Toops
ID: 11998028
ok this is not to tough .. but will require some cpu crunching.
your unique KEY IS  (Drumroll please)
EE+ID

so you can do this

select t1.* from
table1 t1, table2 t2
where
t1.EE = t2.EE and t2.ID =t2.ID and
t1.EmpGrpNo+t1.EmpLast+t1.EmpFirst+t1.EmpAddr1+t1.EmpAddr2+t1.EmpCity+t1.EmpState+t1.EmpZip1+t1.EmpZip2+t1.EmpDOB <> t2.EmpGrpNo+t2.EmpLast+t2.EmpFirst+t2.EmpAddr1+t2.EmpAddr2+t2.EmpCity+t2.EmpState+t2.EmpZip1+t2.EmpZip2+t2.EmpDOB

This will give you a list of records changed.

Jay
0
 
LVL 19

Expert Comment

by:grant300
ID: 11998210
O.K.

I have faced this problem with > 100 million row per day fields and developed a solution that should work for you as well.

BTW, you have a unique identifier; it is simply the compound of EE and ID, in that order.

You were also kind of on the right track with getchecksum() but by the time it gets into the database, it is too late.

What you need to do is create a program to pre-process the flat files that adds the CRC-32 value to the end of the record before they are loaded into the database.  If you like, you can have this same program make a connection to the database with DBlib and use the Bulk Copy access directly.  Certainly not necessary but it takes an extra step out of the process.

So you have run the files through the pre-processor and loaded them into working tables in the database.  (The permenant tables also must retain the CRC32 field for this to work.)  Next build a coverage index on the working table with the EE, ID, and CRD32 fields.  Then perform a full outer join between the premeant and working tables stuffing the results into another working table, or better yet, a temporary table.  The select list should include the EE, ID, and CRC32 from the permanant table as well as all the fields from the working table.

Now it gets easy.  Insert from the temporary table into the permenant table (INSERT/SELECT) all of the rows where the permanant.EE field is NULL in the temporary table.  This handles new records.

Next, update (with a correlated subquery) the permenant table with the rows from the temporary table everywhere the CRC32 fields in the temp table are NOT EQUAL.

Finally, delete from the permenant table (with a correlated subquery) everywhere the working.EE IS NULL in the temporary table.

You are done.  In fact, you can send the contents of the temporary table to the remote sites and they can do the same processing.

If there are a large percentiage of deletes, you may want to look at sending out just the key fields for that operation seperately so you aren't transmitting all of the name/address info when you don't need it.  Personally, I would just ZIP the file and ship it as is.  That way you only have one set of logic to deal with at all the sites.

Why CRC32.  Well, it is a 32 bit hash of the record which means that there are 4 billion possible values.  I believe getchecksum is only 16 bits which is 65K.  CRC32 (Cyclical Redundancy Check) is used widely in telecommunications and network protocols to assure nothing has been corrupted in transmission.  Also, the formation of the CRC32 is specifically designed to detect relatively small changes in the input string.  Given that skew, you can safely say you will miss less than one in 4 billion updates to an existing row.  In your case, you have 35 million rows per week, with 15% changes, and assuming 50% of those are updates as opposed to inserts or deletes, you will miss an update on average once every 29 years!!!

BTW, search the web and you will find various implementations of CRC32 you can scarf.

Of course, if you want to do it brute force, you can skip the CRC32 and do the comparison of each of the fields in the full outer join, but that is really ugly.

Hope that helps,
Bill
0
 
LVL 5

Expert Comment

by:g0rath
ID: 11998246
Create a table that uses an identity primary key, and a unique composite index on SSN + ID

Key    SSN                  ID
1       555 12 3456      0
2       555 12 3456      1
3       555 12 3456      2
4       999 12 3456      0
5       999 12 3456      1
etc.


Create another table that has the data that is the same for everyone

Address,Group,Addr, etc...


Then to find differences, take the entire data set and hash it using md5

When you recieve the data dump it into a temporary table, link based upon the unique key of (ssn+id) using an inner join
hash the rest of the row values on the temp table and if the hashes do not match then update record.

Then do a an outer join to catch the data the is *new* and just insert those records computing a new md5.


md5 is simple and is ported to many languages.

To get md5 for sql server as an extend stored procedure go here
http://www.codeproject.com/database/xp_md5.asp





0
 
LVL 5

Expert Comment

by:g0rath
ID: 11998283
for large sets, crc32 may be fast enough for you.

For algorithim comparision, crc32 is faster to compute but has a higher chance of two values computing to the same hash value since it's only 2^32 chance as compared to MD5 which is 2^128.
0
 
LVL 1

Author Comment

by:sifuhall
ID: 11998359
I am really thankful for all the replies and am still reading them.  It looks very promising, however let me state the EE + ID is not unique (but it is close).  Does this change anything?

For example, EE is the number of the primary subscriber, and ID is the person indicator.  ID will be 0 for the primary subscriber and 1 for the spouse, but 2 for every child the person has, etc.

Does this change anything?
0
 
LVL 19

Expert Comment

by:grant300
ID: 11998493
O.K., I'll bite.  Why is that not unique?  Everybody has to be belong to a primary subscriber and each dependant has a different sequence number.  Assuming the files you are getting are just straight extracts, (and that is a pretty good bet), how could the EE + ID not be unique?

You could implement the CRC32 algorithm as an XP, but I would do a quick check to see shat the overhead is.  I am willing to bet that it is high enough you might not want to do it that way.

Bill

BTW, g0rath, I addressed that issue in the body of my first message.  I think one missed update every 29 years is an acceptable rate of "error".  I am quite certain that there are other problems with the source and target systems that will introduce errors at a much higher rate than that.  ;-)
0
 
LVL 1

Author Comment

by:sifuhall
ID: 11998570
Dependents do not have a different sequence number.  The number will be 2 for any and all children.

For example:

EE            ID
12345       0          - Primary subscriber
12345       1          - Spouse
12345       2          - child
12345       2          - a different child
12345       2          - yet another child


etc.
0
 
LVL 10

Expert Comment

by:Jay Toops
ID: 11998615
Then you just include FIRST NAME or date of birth with it
EE+ID+FIRSTNAME

jay
0
Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

 
LVL 1

Author Comment

by:sifuhall
ID: 11998658
EmpLast is the last name of the primary member (which will be repeated for each family member)
EmpFirst is of the primary member (which will be repeated for each family member)
EmpAddr1 is of the primary member (which will be repeated for each family member)
EmpAddr2 is of the primary member (which will be repeated for each family member)
EmpCity is of the primary member (which will be repeated for each family member)
EmpState is of the primary member (which will be repeated for each family member)
EmpZip1 is of the primary member (which will be repeated for each family member)
EmpZip2 is of the primary member (which will be repeated for each family member)
EmpDOB is of the primary member (which will be repeated for each family member)
0
 
LVL 19

Expert Comment

by:grant300
ID: 11998873
ID is a very misleading name.  It is actually a TYPE indicator.

Well, in that case, you had better put both the first name and the date of birth on the key.  "Changes" to either of those to fields will wind up being delete/inserts but the end result will be the same.  This will work out well unless your customer happens to insure George Foreman and his wife has twins!!! ;-)

(OK,  for those of you who don't know, George Foreman, a famous boxer, has six children, all of whom are named George, including his daughters.)
0
 
LVL 10

Expert Comment

by:Jay Toops
ID: 11998909
What ABOUT TWINS same DOB
0
 
LVL 1

Author Comment

by:sifuhall
ID: 11998964
I don't think I am being very clear.

first name, last name, dob, etc all belong to the primary subscriber, and the id is not incremented per family member.

There really and truly is no unique identifier for this table.  

Really, there isn't.

If we could modify the records sent from the client to have a unique identifier I would not have this problem.

But we cannot.

Does this affect the md5 suggestion?  I have researched this a bit and seems like a viable option (if I base the md5 hash on every field).  I can easily implement this during the import and then simply compare the md5 hash results.

So if I compare the md5 hash with the old and new tables and it matches this should indicate nothing has changed in the record, correct?  If the hash is in the new table and not the old this record is an add, correct?  And, if the hash is in the old table and not in the new it means this record has been updated and should also be deleted (since the updated version will be already included in the adds), correct?

Many thanks for all the help with this brainteaser.
0
 
LVL 19

Expert Comment

by:grant300
ID: 11998975
That's why you need the NAME AND the DOB.
0
 
LVL 18

Expert Comment

by:ShogunWade
ID: 11999032
You should never rely solely on hash algorythms for comparison.  hashing by its nature is "lossy" thus two un-identical records can potentially have the same hash value
0
 
LVL 19

Expert Comment

by:grant300
ID: 11999150
If that is truely the case, they are sending you a bucketload of garbage.  Most of the time, you are going to have just a bunch of redundant information that they were to lazy/stupid to cleanse.  How in the heck are you using this stuff.  With no primary key, it is all but worthless for any downstream application.

If I were you, I would try to eliminate the duplicates.  I am willing to bet that at least half the rows are dupes.

If you were on UNIX/LINUX you could just brute force it and do a SORT (on everything but the ID)  with the Unique flag set and you would be done.  Unfortunately, you will have to load these into the database and then eliminate the dupes there.

INSERT INTO working_table2
SELECT EE, EmpLast, .................EmpDOB, MIN(ID)
  FROM working_table
GROUP BY EE, EmpLast, .................EmpDOB

With this, you will always get the Employee's address, the spouses address if it is different, and the childrens addresses if they are different from both of the parents.

With no primary key, it is just a dumb list of stuff and there is nothing you can do to make it smarter aside from eliminating the redundant rows.

BTW, does this stuff come in fixed width or delimited?  I bet fixed width given what we can guess about the source for this data.  In that case, after you import the data and eliminate the dupes, BCP it out in delimited form before you ZIP it.  It can't hurt to get rid of the extra spaces.

Bill
0
 
LVL 10

Expert Comment

by:Jay Toops
ID: 11999585
agreeing .. to IGNORE duplicate records
because
update SSN-ID-DOB will update ALL .. SSN-ID-DOB records..
which given your table layout would all be the same anyway

Jay
0
 
LVL 1

Author Comment

by:sifuhall
ID: 11999646
Grant300 you are 100% correct that this is crap.  Even more than I metioned.  The method we use to retrieve these files is horrible as well.  Both tables are sent in a single file and our developers have a routine that reads each record in and places it in the correct table based on record length.

I estimate only about 15% of the records change weekly, however the entire set is sent each week.

If we could control the data sent from the client it would be so much easier, but we can't (and that is probably why they are our client and not someone else's).

Currently we receive this file and split it into the new tables.  And repeat this process in each of our other sites around the world.

My goal is to only send the records needed to our other sites (and reduce the transmission time as well as processing time).
0
 
LVL 19

Accepted Solution

by:
grant300 earned 500 total points
ID: 11999995
You can try using the MD5 "hash" as the unique record identifier but, of course, you will be limited to Inserts and Deletes.

The process would be:
1) pre-process the file with the feed splitter program which has been modified to calculate and append the MD5 hash for each row
2) load that file into a working table
3) eliminate the duplicates with SELECT MAX(EE),MAX(EmpLast), ......MAX(EmpDOB),MIN(ID), MD5_VALUE
                                               GROUP BY MD5_VALUE
4) do an outer join between the working table and the production table stuffing the results into another working table with something like
       INSERT INTO workingtable2
       SELECT * FROM (SELECT WK.*, P.P_MD5_VALUE
                                   FROM workingtable W, productiontable P
                                 OUTER JOIN ON W.MD5_VALUE = P.MD5_VALUE) WK
        WHERE WK.MD5_VALUE <> WK.P_MD5_VALUE

5) do the DELETES with something like
       DELETE productiontable
           FROM productiontable P, workingtable2 W
         WHERE P.MD5_VALUE = W.P_MD5_VALUE
             AND W.W_MD5_VALUE IS NULL
6) do the INSERTS with
       INSERT INTO productiontable
       SELECT W2.EE, W2.EmpLast,.......W2.EmpDOB,W2.ID
         FROM workingtable2

Now, instead of shipping all of the flat file data, you need only extract (using BCP) workingtable 2 to a delimited flat file, ZIP it up, and send it to each of your sites around the world.  They can then load it and do steps 5 and 6.

That is the least amount of work you can do given what you have.

Bill
0
 
LVL 10

Expert Comment

by:Jay Toops
ID: 12000604
I FIND it Extremely difficult to believe that 15% of that data changes on a weekly basis.
its probably a much lower figure
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

When you hear the word proxy, you may become apprehensive. This article will help you to understand Proxy and when it is useful. Let's talk Proxy for SQL Server. (Not in terms of Internet access.) Typically, you'll run into this type of problem w…
Introduction SQL Server Integration Services can read XML files, that’s known by every BI developer.  (If you didn’t, don’t worry, I’m aiming this article at newcomers as well.) But how far can you go?  When does the XML Source component become …
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now