Scott Fell
asked on
MS SQL SERVER prevent duplicates being inserted
I have a table that contains 3 fields as clustered index that is used to prevent duplicates to a temporary table used to grab data from an outside source.
The next step is copying this data to an active table that uses the fields from the clustered index for most of the rows of data, but not all so I can't duplicate the clustered index on the live table.
I have scripts that auto populate the temp table several times a day as well as copying the data over.
TEMP_TABLE Clustered index on key1,key2 and key3 Ignore Duplicate Values=true
key1
key2
key3
field1
field2
LIVE_TABLE
ID Identity_column
field1
field2
field3
key1
key2
key3
What is the best option to copy the data from the temp to live and avoid duplicates on key1,key2 and key3 where there will be rows of data that cause the key fields to be null as they will not be needed.
When the key fields are not needed, data is entered as a one off. When the key fields are used it will be brought in via an sql script.
The next step is copying this data to an active table that uses the fields from the clustered index for most of the rows of data, but not all so I can't duplicate the clustered index on the live table.
I have scripts that auto populate the temp table several times a day as well as copying the data over.
TEMP_TABLE Clustered index on key1,key2 and key3 Ignore Duplicate Values=true
key1
key2
key3
field1
field2
LIVE_TABLE
ID Identity_column
field1
field2
field3
key1
key2
key3
What is the best option to copy the data from the temp to live and avoid duplicates on key1,key2 and key3 where there will be rows of data that cause the key fields to be null as they will not be needed.
When the key fields are not needed, data is entered as a one off. When the key fields are used it will be brought in via an sql script.
ASKER
Thanks Jim. The issue is goo, foo and boo combined make up a unique row.
goo = abc
foo = def
boo = xyz
goo = abc
foo = def
boo = qrs
goo = abc
foo = abc
boo = xyz
the only thing that would be unique is abcdefxyz.
goo = abc
foo = def
boo = xyz
goo = abc
foo = def
boo = qrs
goo = abc
foo = abc
boo = xyz
the only thing that would be unique is abcdefxyz.
INSERT INTO LIVE_TABLE (id, goo, foo, boo)
SELECT tt.id, tt.goo, tt.foo, tt.boo
FROM TEMP_TABLE tt
-- LEFT means include all rows in TEMP_TABLE...
LEFT JOIN LIVE_TABLE lt ON tt.goo = lt.goo AND tt.foo = lt.foo AND tt.boo = lt.boo
-- ... ONLY that are not already in the live table
WHERE lt.id IS NULL AND lt.foo IS NULL AND lt.boo IS NULL
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
This is what the data looks like from the temp table. There is an ID as an Identity field, but for this purpose it is not used for anything or imported to the live table.
The probability that key3 on it's own is unique is pretty good. However, we can't control that data and can't assume it is unique. The only thing that can be considered unique is the combination of key1+key2+key3. Key3 is supposed to be unique to key2 and there will be multiple rows of key2 related to key1.
There are 2 types of data that are inserted/updated to the live table. One type does not utilize the key1, key2 or key3 fields and therefor, 20% of the rows will have those fields blank or null. In this case, we are importing these fields and I bring up the 20% of the data not containing data because I'm not sure the best way to bring this data in and make sure it is unique.
I thought I could create a view with just the data where these fields are filled in and use the same process I import to the temp table where I use the clustered index on the 3 key fields and skip duplicates.
Each import could be a few hundred to a few thousand rows of data for now.
The probability that key3 on it's own is unique is pretty good. However, we can't control that data and can't assume it is unique. The only thing that can be considered unique is the combination of key1+key2+key3. Key3 is supposed to be unique to key2 and there will be multiple rows of key2 related to key1.
ID key1 key2 key3 field1
6834 2 abc box1 some data 1
6835 2 abc box2 some data 2
6836 2 abc box3 some data 3
6837 2 abc box4 some data 4
6838 2 abc box5 some data 5
6839 2 abc box6 some data 6
6840 2 def circle1 some data 7
6841 2 def circle2 some data 8
6842 2 def circle3 some data 9
6843 2 def circle4 some data 10
6844 2 def circle5 some data 11
6845 2 def circle6 some data 12
6846 2 def circle7 some data 13
6847 3 aef triangle1 some data 14
6848 3 aef triangle2 some data 15
6849 3 aef triangle3 some data 16
6850 3 aef triangle4 some data 17
6851 3 zxy cylinder1 some data 18
6852 3 zxy cylinder2 some data 19
6853 3 zxy cylinder3 some data 20
6854 3 zxy cylinder4 some data 21
6855 3 zxy cylinder5 some data 22
6856 3 zxy cylinder6 some data 23
6857 3 zxy cylinder7 some data 24
There are 2 types of data that are inserted/updated to the live table. One type does not utilize the key1, key2 or key3 fields and therefor, 20% of the rows will have those fields blank or null. In this case, we are importing these fields and I bring up the 20% of the data not containing data because I'm not sure the best way to bring this data in and make sure it is unique.
I thought I could create a view with just the data where these fields are filled in and use the same process I import to the temp table where I use the clustered index on the 3 key fields and skip duplicates.
Each import could be a few hundred to a few thousand rows of data for now.
There are 2 types of data that are inserted/updated to the live table.
ok
One type does not utilize the key1, key2 or key3 fields and therefor, 20% of the rows will have those fields blank or null.
so:
type 1 has key1, key2, key3
type 2 does not have values for key1, key2, key3
In this case, we are importing these fields
for type 2, you are importing the non-null fields (i.e. NOT key1, key2, key3)
and I bring up the 20% of the data not containing data because I'm not sure the best way to bring this data in and make sure it is unique.
I do not understand this bit. What does "bring up the data" mean?
When you import those type 2 rows, do key1, key2, key3 simply remain NULL forever?
(I assume there some auto-incrementing unique identifier on the "LIVE_TABLE")
ok
One type does not utilize the key1, key2 or key3 fields and therefor, 20% of the rows will have those fields blank or null.
so:
type 1 has key1, key2, key3
type 2 does not have values for key1, key2, key3
In this case, we are importing these fields
for type 2, you are importing the non-null fields (i.e. NOT key1, key2, key3)
and I bring up the 20% of the data not containing data because I'm not sure the best way to bring this data in and make sure it is unique.
I do not understand this bit. What does "bring up the data" mean?
When you import those type 2 rows, do key1, key2, key3 simply remain NULL forever?
(I assume there some auto-incrementing unique identifier on the "LIVE_TABLE")
ASKER
Both tables have an ID field that is auto incrementing. Type 1 data is what is being imported in bulk or as one off. Type 2 data is entered only as one off.
During the import, it is very possible that the same combination of key1+key2+key3 has already been added to the live table and I do not want to add it if it is there. I could just do a look up for each row, I was wondering if there is something more efficient that I did not think of.
During the import, it is very possible that the same combination of key1+key2+key3 has already been added to the live table and I do not want to add it if it is there. I could just do a look up for each row, I was wondering if there is something more efficient that I did not think of.
Have you considered/tried MERGE?
ASKER
Thank you both! I am going with the Merge on this one.
ASKER
Thank you Jim and Paul. The Merge worked easily for this live.
Pleased to hear it. Cheers, Paul
Modify to meet your needs.
Open in new window
btw if this is an SSIS solution you can also use a Lookup Transform between source and destination, connected to the live table, which would mean a staging 'TEMP TABLE' is no longer required. Although there are lots of reasons why it would be a good idea anyways..