Get ONLY updated fields when comparing a HISTORY table to CURRENT Table.

brad2575
brad2575 used Ask the Experts™
on
I have MANY tables in my database and then the corresponding tables in a HISTORY database (that contains records EVERY time a record is updated, not just the most recent update).  I want to be able to run a query that will compare the current table to the HISTORY table and return ONLY the fields/values that have changed, it can return a records set with ALL the fields in it, but only the changed records would have values in them, the rest would be NULL.  The two table structures are almost exactly the same (history table has a few different/more for history tracking purposes but all data fields are the same.

I would like this to be dynamic if possible so I can pass the function/stored procedure a table name and it will return the changed fields ONLY based off the table I pass it.  

I am dealing with MILLIONS of records in the CURRENT table and many MILLIONS in the history table.  I have a few ways I know I can do this but none of them are very efficient when dealing with the number of records (and number of tables) I need to set this up for.

If you need more details on what I am trying to do please let me know.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Senior Software Developer
Commented:
I use 2 tools by Red Gate (http://www.red-gate.com/) called SQL Compare and SQL Data Compare which does just that

I work on a live, a test and a dev SQL server. When schemas change in dev, I need to get that moved as part of the rolling out of the apps. I also use the tool for data comparison.

It works really well.

Author

Commented:
That looks like it would work great but I can not install anything on the server, it all needs to be done with existing capabilities on the server (SQL, Stored Procedures, and even SSIS packages are fine).

Richard QuadlingSenior Software Developer
Commented:
You don't need to install anything on the server. I use this via my desktops.

Any SPs/etc. WOULD need to be installed on the server - or you use another server to do the work.


But.


SQL Server has the capability of generating a checksum for the row.

So, _something_ like this ...

select table_left.*, table_right.*
from table_left full outer join table_right on table_left.uniqueid = table_right.uniqueid
where checksum(table_left.*) <> checksum(table_right.*) or table_left.uniqueid is null or table_right.uniqueid is null

(UNTESTED)

that MAY work at the row level.

If the row is different, then you could have a row checker by using ...

CHECKUM(table_left.column).


You could use dynamic SQL to build all of this by parsing the schema info db to get all the tables and columns.

If the structure isn't changing, then I would do this task manually (or at least programmatically) to generate the SP.


Success in ‘20 With a Profitable Pricing Strategy

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Author

Commented:
so to clarify, this will give me the list of ROWS that changed by account number.
select table_left.*, table_right.*
from table_left full outer join table_right on table_left.uniqueid = table_right.uniqueid
where checksum(table_left.*) <> checksum(table_right.*) or table_left.uniqueid is null or table_right.uniqueid is null

and then I would need to run this one AFTER the above to get the values that changed for the above rows?
CHECKUM(table_left.column).

Or is there a way to combine the two so that I get all the rows that changed but ONLY return the values that changed (I can return the full data set but all the values that did not change would be null).
Richard QuadlingSenior Software Developer
Commented:
I don't know all the names of the columns in all the tables.

So, you could write a SQL sp, which built a massive sql statement which did all of that for you.

Essentially you have to know which rows are different. If not, you would be testing every single column/row across 2 tables. That's a MASSIVE hit.

I would probably put the row differences into a temp table (RowsDifferent) and then process the columns in pairs (left and right). Each difference would result in a second temp table (ColumnsDifferent).

I'm not fully upto speed on getting table and column names, so I would probably use my scripting language (PHP) to build the SQL statement. I can use PHP to get the table and column names.


So, the query you have at the moment will identify the rows that are different.

One of the issues is not knowing the column names.

Using select table.* is fine for a select, but ...

select table1.*, table2.*....

isn't that good as the result set will have duplicated column names.

You would have to name every column you are interested in ...

select table1.name as [T1_Name], table2.name as [T2_Name],...

sort of thing.

So, immediately you need to hard code the list of names.

So, assuming you've done that and now have a temp table containing all the different rows, your next query will be along the lines of ...

select isnull(T1_UniqueID, T2_UniqueID) as [ID], 'xxx' as [Column], T1_xxx as [Before], T2_xxx as [After]
from #temp_RowsDifferent
where checksum(T1_xxx) <> checksum(T2_xxx) or T1_xxx is null or T2_xxx is null

The xxx is the name of the column from the left/right tables.

You would need one of these statements for every column.

The ID column is the left/rights primary key value - this is how you know which row is being mentioned when you have a difference in a column.

The ColumnsDifferent table could/should/would look like ...

UniqueID INT IDENTITY(1,1) Primary Key Unique
ID -- for me this would always be an integer, but if your primary keys on the left/right tables is not an integer, then this would also need to be amended appropriately.
ColumnName varchar(50) -- the name of the column that is different.
Before varchar(50) -- the value before/left
After varchar(50) -- the value after/right

You could extend this to deal with an insert or a delete (i.e. in left but not right or in right but not left) ...

Action char(1) -- 'I'nsert, 'D'elete, 'U'pdate

and ...

select isnull(T1_UniqueID, T2_UniqueID) as [ID], 'xxx' as [Column], T1_xxx as [Before], T2_xxx as [After], case when T1_xxx is not null and T2_xxx is not null then 'U' when T1_xxx is null then 'D' else 'I' end as [Action]
from #temp_RowsDifferent
where checksum(T1_xxx) <> checksum(T2_xxx) or T1_xxx is null or T2_xxx is null


sort of thing.

All untested and may be buggy, so please check/test.

If you can create some code from this, I'll be happy to help debug, I just don't have the time at the moment to write it. Sorry.

Author

Commented:
This was very helpful.

I will be dealing with MULTI MILLIONS of records to check if there are changes, then MILLIONS of them have had changes.  How would this be on efficiency?

One of the ways I was looking at doing it but seemed very inefficient was to take the current table, join to history table since I last checked (only one row in history table, the oldest one so I can get ALL changes since my last check).  Then just do a case statement on each field something like this:
Case(When NewTableValue <> HistoryTableValue Then NewTableValue else NULL END)

I know the above example would be way more efficient to just get the set of rows that changed but then I would have to do two steps to get the rows AND the colums that changes when the above case would get me both at once.

Do you know what would be more efficient?

Author

Commented:
I also found a query like this:
SELECT SMAILADDR1, SMAILADDR2, SMAILCITY, SMAILPostalCode
FROM MAIN..Debtor
WHere ndebtorID = '1'
EXCEPT
SELECT SMAILADDR1, SMAILADDR2, SMAILCITY, SMAILPostalCode
FROM HIST..DebtorHIST
where ndebtorID = '1'

This gives me the rows and fields that changed all at once but I can only get it to work on 1 account at a time, so if I used this I would have to loop through every account to get the values I need and I KNOW this is not efficiant unless there is a way to do the above on ALL accounts at once?
Richard QuadlingSenior Software Developer
Commented:
http://msdn.microsoft.com/en-us/library/ms188055.aspx

What happens if you just drop the where clauses?

I don' think this will help as it is only the rows that are returned when there is a difference. You will still need to process the result set to know what is actually different.

http://weblogs.sqlteam.com/jeffs/archive/2007/05/02/60194.aspx is a worked example of an SP to compare tables using the EXCEPT facility.

Richard QuadlingSenior Software Developer

Commented:
Important note ....

"Of course, both tables must have primary keys in place; duplicate values in these tables will not make logical sense when trying to determine which rows match or not."

Richard QuadlingSenior Software Developer

Commented:
I think a Grade B PAQ is appropriate here.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial