Link to home
Create AccountLog in
Avatar of hmra
hmra

asked on

Compare data between two tables

I have the same 10 tables in two different databases. How can I make sure that the data within the tables is the same for each table? Some of the tables have more than 10 million records.

Thank you
Avatar of Éric Moreau
Éric Moreau
Flag of Canada image

Avatar of hmra
hmra

ASKER

Yes the Data compare from Redgate can help but the that tool is extremely slow when handling millions of records.

I need something more efficient
ASKER CERTIFIED SOLUTION
Avatar of sventhan
sventhan
Flag of United States of America image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Are these updatable tables?

Can you create views of these tables?

How much detail do you need for unequal tables?

We know that some of the tables have 10M rows.  How wide (columns) are these tables?

What is the data type of key columns?

What are the data types of non-key columns?

===========
Do I understand the problem correctly?
two databases: D1, D2
ten tables in each of these databases: A,B,C,D,E,F,G,H,I,J

You need to know that:
D1.A=D2.A
D1.B=D2.B
D1.C=D2.C
D1.D=D2.D
D1.E=D2.E
D1.F=D2.F
D1.G=D2.G
D1.H=D2.H
D1.I=D2.I
D1.J=D2.J

* Same number of rows
* Same key values
* Same non-key values for matched rows
Another question...
Can you partition these two sets of tables into their own table space?
Avatar of hmra

ASKER

I cannot update the table unless I move them into another database.

Yes I can create Views

I just need to know if the data being loaded into the tables are the same and if there is different I need to be able where the difference is.

Some of the tables have more than 50 columns, the data types are int, numeric or char

Key column is int

Data type on none key columns are numeric or char

this assumption is correct:

Do I understand the problem correctly?
two databases: D1, D2
ten tables in each of these databases: A,B,C,D,E,F,G,H,I,J

You need to know that:
D1.A=D2.A
D1.B=D2.B
D1.C=D2.C
D1.D=D2.D
D1.E=D2.E
D1.F=D2.F
D1.G=D2.G
D1.H=D2.H
D1.I=D2.I
D1.J=D2.J

* Same number of rows
* Same key values
* Same non-key values for matched rows


--This question I donot know.

Can you partition these two sets of tables into their own table space?
@hmra

Did you try my solution in the above post? You can compare 2 tables at a time. Did it help in performance?
Avatar of hmra

ASKER

Svethan;

I did test it and seems to work but could you please explain to me what the query is doing?

Thank you
Avatar of hmra

ASKER

sventhan:

It is there any way that I can get only the columns with different records?

Okay. This SQL has 3 parts.

Part 1
--------

This will list all the DATA exists in tableA but not in tableB

SELECT * FROM
(
SELECT * FROM tablea
WHERE EXISTS ( SELECT 1 FROM TABLEb WHERE TABLEA.primarykey = TABLEB.primarykey)
EXCEPT
SELECT * FROM tableb
WHERE EXISTS ( SELECT 1 FROM TABLEa WHERE TABLEA.primarykey = TABLEB.primarykey)
) temp1

Part 2
--------

This does just the opposite of Part 1 which is list all the DATA exists in tableB but not in tableA

SELECT * FROM
(
SELECT * FROM tableb
WHERE EXISTS ( SELECT 1 FROM TABLEa WHERE TABLEA.primarykey = TABLEB.primarykey)
EXCEPT
SELECT * FROM tablea
WHERE EXISTS ( SELECT 1 FROM TABLEb WHERE TABLEA.primarykey = TABLEB.primarykey)
)temp2


Part 3
--------

combine(UNION) both part1 and part2 and return the result.

The original author of this SQL is rrjegan (an expert here in EE).


check out the comments from matthewspatrick

https://www.experts-exchange.com/questions/26459011/Difference-in-all-fields-between-2-tables.html?sfQueryTermInfo=1+10+123456+30+a.col2+b.col2+coalesc+where
CREATE TABLE TABLEA (primarykey int, Col2 int, Col3 int, Col4 int, Col5 varchar(50), Col6 varchar(50), Col7 varchar(50), Col8 datetime, Col9 datetime, Col10 datetime)
CREATE TABLE TABLEB (primarykey int, Col2 int, Col3 int, Col4 int, Col5 varchar(50), Col6 varchar(50), Col7 varchar(50), Col8 datetime, Col9 datetime, Col10 datetime)

INSERT INTO TABLEA VALUES (1, 1, 4, 3, 'a', 'b', 'c', '2010-05-05', '2010-02-13', '2010-07-21')
INSERT INTO TABLEA VALUES (2, 4, 5, 1, 'a', 'b', 'c', '2010-07-25', '2010-07-19', NULL)
INSERT INTO TABLEA VALUES (3, NULL, NULL, 2, 'a', 'b', 'c', NULL, '2010-03-03', '2010-04-25')
INSERT INTO TABLEA VALUES (4, 3, 3, 2, 'a', 'b', 'c', '2010-01-01', '2010-07-10', '2010-01-03')
INSERT INTO TABLEA VALUES (5, 2, 4, 5, 'a', 'b', 'c', '2010-05-27', '2010-02-25', '2010-06-05')
INSERT INTO TABLEA VALUES (6, 5, 3, 1, 'a', 'b', 'c', '2010-05-12', '2010-01-30', '2010-08-03')
INSERT INTO TABLEA VALUES (7, 2, 5, 2, 'a', 'b', 'c', '2010-08-04', '2010-06-19', '2010-06-20')
INSERT INTO TABLEA VALUES (8, 1, 2, 2, 'a', 'b', 'c', '2010-07-26', '2010-03-16', '2010-02-15')
INSERT INTO TABLEA VALUES (9, 4, 3, 4, 'a', 'b', 'c', '2010-03-05', '2010-03-26', '2010-08-15')
INSERT INTO TABLEA VALUES (10, 2, 3, 2, 'a', 'b', 'c', '2010-02-27', '2010-06-17', '2010-08-28')
INSERT INTO TABLEA VALUES (11, 1, 2, 2, 'a', 'b', 'c', '2010-06-11', '2010-04-22', '2010-03-11')
INSERT INTO TABLEA VALUES (12, 2, 3, 1, 'a', 'b', 'c', '2010-02-01', '2010-09-04', '2010-02-15')
INSERT INTO TABLEA VALUES (13, 4, 5, 5, 'a', 'b', 'c', '2010-06-02', '2010-05-04', '2010-03-25')
INSERT INTO TABLEA VALUES (14, 4, 2, 1, 'a', 'b', 'c', '2010-07-14', '2010-06-19', '2010-07-17')
INSERT INTO TABLEA VALUES (15, 2, 1, 5, 'a', 'b', 'c', '2010-07-05', '2010-08-16', '2010-07-10')

INSERT INTO TABLEB VALUES (1, 1, 4, 3, 'a', 'b', 'c', '2010-05-05', '2010-02-13', '2010-07-21')
INSERT INTO TABLEB VALUES (2, 4, 5, 1, 'a', 'b', 'c', '2010-07-25', '2010-07-19', NULL)
INSERT INTO TABLEB VALUES (3, 2, 4, 2, 'a', 'b', 'c', '2010-05-11', '2010-03-03', '2010-04-25')
INSERT INTO TABLEB VALUES (4, 3, 3, 2, 'a', 'b', 'c', '2010-05-30', '2010-07-10', '2010-01-03')
INSERT INTO TABLEB VALUES (5, 2, 4, 5, 'a', 'b', NULL, '2010-05-27', '2010-02-25', '2010-06-05')
INSERT INTO TABLEB VALUES (6, 5, 3, 1, 'a', 'b', 'c', '2010-05-12', '2010-01-30', '2010-08-03')
INSERT INTO TABLEB VALUES (7, 2, 5, 2, 'a', 'b', 'c', '2010-08-04', '2010-06-19', '2010-06-20')
INSERT INTO TABLEB VALUES (8, 1, 2, 2, 'a', 'b', 'c', '2010-07-26', '2010-03-16', '2010-02-15')
INSERT INTO TABLEB VALUES (9, 4, 3, 4, 'a', 'b', 'c', '2010-03-05', '2010-03-26', '2010-08-15')
INSERT INTO TABLEB VALUES (10, 2, 3, 2, 'a', 'b', 'c', '2010-02-27', '2010-06-17', '2010-08-28')
INSERT INTO TABLEB VALUES (11, 1, 2, 2, 'a', 'b', 'c', '2010-06-11', '2010-04-22', '2010-03-11')
INSERT INTO TABLEB VALUES (12, 2, 3, 1, 'a', 'b', 'c', '2010-02-01', '2010-09-04', '2010-02-15')
INSERT INTO TABLEB VALUES (13, 4, 5, 5, 'a', 'b', 'c', '2010-06-02', '2010-05-04', '2010-03-25')
INSERT INTO TABLEB VALUES (14, 4, 2, 1, 'a', 'b', 'c', '2010-07-14', '2010-06-19', '2010-07-17')
INSERT INTO TABLEB VALUES (15, 2, 1, 5, 'a', 'b', 'c', '2010-07-05', '2010-08-16', '2010-07-10')

/* First query, finding mismatches (ignoring where both are NULL) */

SELECT 'TABLEA' AS Source, a.primarykey, a.Col2, a.Col3, a.Col4, a.Col5, a.Col6, a.Col7, 
    a.Col8, a.Col9, a.Col10
FROM TABLEA a INNER JOIN
    TABLEB b ON a.primarykey = b.primarykey
WHERE COALESCE(a.Col2, -123456) <> COALESCE(b.Col2, -123456) OR
    COALESCE(a.Col3, -123456) <> COALESCE(b.Col3, -123456) OR
    COALESCE(a.Col4, -123456) <> COALESCE(b.Col4, -123456) OR
    COALESCE(a.Col5, 'as3hdv7p@oip$') <> COALESCE(b.Col5, 'as3hdv7p@oip$') OR
    COALESCE(a.Col6, 'as3hdv7p@oip$') <> COALESCE(b.Col6, 'as3hdv7p@oip$') OR
    COALESCE(a.Col7, 'as3hdv7p@oip$') <> COALESCE(b.Col7, 'as3hdv7p@oip$') OR
    COALESCE(a.Col8, '1850-01-01 03:17:42') <> COALESCE(b.Col8, '1850-01-01 03:17:42') OR
    COALESCE(a.Col9, '1850-01-01 03:17:42') <> COALESCE(b.Col9, '1850-01-01 03:17:42') OR
    COALESCE(a.Col10, '1850-01-01 03:17:42') <> COALESCE(b.Col10, '1850-01-01 03:17:42')
UNION ALL
SELECT 'TABLEB' AS Source, b.primarykey, b.Col2, b.Col3, b.Col4, b.Col5, b.Col6, b.Col7, 
    b.Col8, b.Col9, b.Col10
FROM TABLEA a INNER JOIN
    TABLEB b ON a.primarykey = b.primarykey
WHERE COALESCE(a.Col2, -123456) <> COALESCE(b.Col2, -123456) OR
    COALESCE(a.Col3, -123456) <> COALESCE(b.Col3, -123456) OR
    COALESCE(a.Col4, -123456) <> COALESCE(b.Col4, -123456) OR
    COALESCE(a.Col5, 'as3hdv7p@oip$') <> COALESCE(b.Col5, 'as3hdv7p@oip$') OR
    COALESCE(a.Col6, 'as3hdv7p@oip$') <> COALESCE(b.Col6, 'as3hdv7p@oip$') OR
    COALESCE(a.Col7, 'as3hdv7p@oip$') <> COALESCE(b.Col7, 'as3hdv7p@oip$') OR
    COALESCE(a.Col8, '1850-01-01 03:17:42') <> COALESCE(b.Col8, '1850-01-01 03:17:42') OR
    COALESCE(a.Col9, '1850-01-01 03:17:42') <> COALESCE(b.Col9, '1850-01-01 03:17:42') OR
    COALESCE(a.Col10, '1850-01-01 03:17:42') <> COALESCE(b.Col10, '1850-01-01 03:17:42')
ORDER BY 2, 1

/* returns:

Source primarykey Col2 Col3 Col4 Col5 Col6 Col7 Col8                    Col9                    Col10
-----------------------------------------------------------------------------------------------------------------------
TABLEA 3          NULL NULL 2    a    b    c    NULL                    2010-03-03 00:00:00.000 2010-04-25 00:00:00.000
TABLEB 3          2    4    2    a    b    c    2010-05-11 00:00:00.000 2010-03-03 00:00:00.000 2010-04-25 00:00:00.000
TABLEA 4          3    3    2    a    b    c    2010-01-01 00:00:00.000 2010-07-10 00:00:00.000 2010-01-03 00:00:00.000
TABLEB 4          3    3    2    a    b    c    2010-05-30 00:00:00.000 2010-07-10 00:00:00.000 2010-01-03 00:00:00.000
TABLEA 5          2    4    5    a    b    c    2010-05-27 00:00:00.000 2010-02-25 00:00:00.000 2010-06-05 00:00:00.000
TABLEB 5          2    4    5    a    b    NULL 2010-05-27 00:00:00.000 2010-02-25 00:00:00.000 2010-06-05 00:00:00.000

*/


/* 2nd query, counting diffs (ignoring where both are null) by column */

SELECT SUM(CASE WHEN COALESCE(a.Col2, -123456) <> COALESCE(b.Col2, -123456) THEN 1 ELSE 0 END) AS Col2Count,
    SUM(CASE WHEN COALESCE(a.Col3, -123456) <> COALESCE(b.Col3, -123456) THEN 1 ELSE 0 END) AS Col3Count,
    SUM(CASE WHEN COALESCE(a.Col4, -123456) <> COALESCE(b.Col4, -123456) THEN 1 ELSE 0 END) AS Col4Count,
    SUM(CASE WHEN COALESCE(a.Col5, 'as3hdv7p@oip$') <> COALESCE(b.Col5, 'as3hdv7p@oip$') THEN 1 ELSE 0 END) AS Col5Count,
    SUM(CASE WHEN COALESCE(a.Col6, 'as3hdv7p@oip$') <> COALESCE(b.Col6, 'as3hdv7p@oip$') THEN 1 ELSE 0 END) AS Col6Count,
    SUM(CASE WHEN COALESCE(a.Col7, 'as3hdv7p@oip$') <> COALESCE(b.Col7, 'as3hdv7p@oip$') THEN 1 ELSE 0 END) AS Col7Count,
    SUM(CASE WHEN COALESCE(a.Col8, '1850-01-01 03:17:42') <> COALESCE(b.Col8, '1850-01-01 03:17:42') THEN 1 ELSE 0 END) AS Col8Count,
    SUM(CASE WHEN COALESCE(a.Col9, '1850-01-01 03:17:42') <> COALESCE(b.Col9, '1850-01-01 03:17:42') THEN 1 ELSE 0 END) AS Col9Count,
    SUM(CASE WHEN COALESCE(a.Col10, '1850-01-01 03:17:42') <> COALESCE(b.Col10, '1850-01-01 03:17:42') THEN 1 ELSE 0 END) AS Col10Count
FROM TABLEA a INNER JOIN
    TABLEB b ON a.primarykey = b.primarykey

/* returns:

Col2Count Col3Count Col4Count Col5Count Col6Count Col7Count Col8Count Col9Count Col10Count
------------------------------------------------------------------------------------------
1         1         0         0         0         1         2         0         0

*/




SELECT
    SUM(CASE WHEN (a.first_name IS NULL AND b.first_name IS NOT NULL) OR (a.first_name IS NOT NULL AND b.first_name IS NULL) THEN 1 ELSE 0 END) AS first_name,
    SUM(CASE WHEN (a.middle_name IS NULL AND b.middle_name IS NOT NULL) OR (a.middle_name IS NOT NULL AND b.middle_name IS NULL) THEN 1 ELSE 0 END) AS middle_name,
    SUM(CASE WHEN (a.last_name IS NULL AND b.last_name IS NOT NULL) OR (a.last_name IS NOT NULL AND b.last_name IS NULL) THEN 1 ELSE 0 END) AS last_name,
    SUM(CASE WHEN (a.user_country_id IS NULL AND b.user_country_id IS NOT NULL) OR (a.user_country_id IS NOT NULL AND b.user_country_id IS NULL) THEN 1 ELSE 0 END) AS user_country_id,
    SUM(CASE WHEN (a.user_manager_id IS NULL AND b.user_manager_id IS NOT NULL) OR (a.user_manager_id IS NOT NULL AND b.user_manager_id IS NULL) THEN 1 ELSE 0 END) AS user_manager_id,
    SUM(CASE WHEN (a.user_created_date IS NULL AND b.user_created_date IS NOT NULL) OR (a.user_created_date IS NOT NULL AND b.user_created_date IS NULL) THEN 1 ELSE 0 END) AS user_created_date,
    SUM(CASE WHEN (a.offsite_name IS NULL AND b.offsite_name IS NOT NULL) OR (a.offsite_name IS NOT NULL AND b.offsite_name IS NULL) THEN 1 ELSE 0 END) AS offsite_name,
    SUM(CASE WHEN (a.offline_hours IS NULL AND b.offline_hours IS NOT NULL) OR (a.offline_hours IS NOT NULL AND b.offline_hours IS NULL) THEN 1 ELSE 0 END) AS offline_hours
FROM user_sample1 a INNER JOIN
    user_sample2 b ON a.user_id = b.user_id

Open in new window

Have you tried the tablediff utility?
http://weblogs.sqlteam.com/mladenp/archive/2007/03/03/60125.aspx

How about SQL Delta (third party app)?
http://www.sqldelta.com/

Do you need to synchronize these tables?
 
========
I think you will help your cause if you add a timestamp column in these tables that indicates the last time the row was changed.  This will really improve the performance of finding row differences for matched keys.