Link to home
Start Free TrialLog in
Avatar of cpeters5
cpeters5

asked on

Identify duplicate record using random key - possible?

We are loading data from files into SQL Server database using SSIS.   Before insert into the destination table, we need to remove duplicate.    
1)  What is the best way to do this in SSIS?
2)  I am thinking of store a randome key in teh destination table, where the random key is generated using the entire record as input.    Does SQL Server have a random generator that accepts parameter of type varchar?  
Thanks,
pax
Avatar of Steve Hogg
Steve Hogg
Flag of United States of America image

Let me suggest you import into a "staging" table and then use a separate Data Flow task to move the data from staging table to the final destiantion. No way to perform SQL on a file before insert.
Avatar of cpeters5
cpeters5

ASKER

Thnaks HoggZilla,
How do we identify duplicate rows in the staging table?  My thought (item 2) was to create a random key to dientify each row using the entire row as seed.  But there must be a better way to do this in SSIS.  (I don't know SSIS much, actually, at all...)
pax
It depends on what you consider a duplicate row. All values are the same? If you will provide me an example of a duplicate vs. non-duplicate that would help. Also, I assume you want to insert this into a table but not duplicate there also, is that correct. So if you had a row in the database already, you don't want to insert another just like it?
But just so you know, yes, you can set an IDENTITY colum on a table and it will auto-increment by whatever value you choose. Unfortunately this is not a varchar, it is int. There is also the GUID value, but I am not sure this is what you are after so I am trying to verify your need before I suggest either of these as a solution.
By Duplicate, I mean exactly identical string. Usually created by accident from data source.  (We have specific logic to deal with other partial duplicate.)

Here is an example
1|This is|a|3-rd|test|
1|This is|a|3-rd|test|
2|This is|a|3-rd|test|

Row 1 and 2 are duplicate
Row 1 and 3 are not duplicate.
pax
OK, so here is my suggestion.
1. Insert the data from the file into a table. It should be a copy of the same table you use currently - this will be your staging table.
2. After the file, insert the distinct records from this staging table into your final destination table. All you need to do for this insert is an Execute SQL Task in SSIS, unless there are more than 50K rows, then I would use a Data Flow Task. But the SQL would be this:
INSERT INTO MyFinalDestinationTable
SELECT DISTINCT * FROM MyStagingTable
This would work except for the possibility of duplicates are in different files.   Unfortuanately, we do not have any control over the input extract file.  Is there a way to remedy this case (two files with some overlap)?

To expand on our loading process.  We receive a number of extract files on a daily basis.  (Data could come in stream, but we plan to dump them into files before loading.)   This means duplicate rows could occur in multiple passes.  There is no guarantee.
pax
ASKER CERTIFIED SOLUTION
Avatar of Steve Hogg
Steve Hogg
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Sorry, my bad. Recheck that above.....

INSERT INTO MyFinalDestinationTable
SELECT DISTINCT * FROM MyStagingTable a
WHERE NOT EXISTS
(SELECT 1 FROM MyFinalDestinationTable b
WHERE b.Column1 = a.Column1
AND b.Column2 = a.Column2
AND .....)
So, is this scenario true. If so, then the above should solve this problem.
You have a final destination table where all rows should be unique.
You receive multiple files and want to insert the data from those files into the final destination table but you don't want to insert a row that already exists?

Thanks HaggZilla  (I have an urge to call you Hogg :-) )
Your scenario is almost true.   My only concern is the duplicates taht occur between two different days.   This means I have to maintain the staging table over multiple days, and trim it regularly perhaps weekly.  I think it is safe to assume that duplicates, if exist, will not be further apart more than one week.
This sounds better than to maintain a unique identifier key in the destination table as I have thought.  
What do you think?

But I am curious if SQL Server has a random function that accepts string parameter instead of int.  I will post this as a separate question.

Thanks,
pax

 
Hogg is good, that is my last name. :-)
You don't have to keep the data in the staging table. Once you load the data from the .csv into the staging table, then you insert that data into the destination table using my "second" stmt above. The insertion of non-duplicate data is based on the destination table. After you move the unique data to the destination table, truncate the staging table - you don't need that data any longer.
Now, to your question of random string??? You really need someting unique, like Oracle's ROWID. You can create a new GUID using NEWID() and save it to varchar. That is a uniqueidentifier. Lookup up NEWID() and GUID, lots of information here at EE.
Thanks Hogg,
I didn't read your statement carefully.  Yes it would work.  Not sure about the performance though.  The destination table(s) has over 40 columns and may grow to over 20-30 million rows.  

As for the random key, it was not meant to be unique id for staging table.  It is a way to identify duplicate strings.   Say, if string1 = string2, then this function, say f, will generate the same vaue
f(string1) = f(string2) for both strings.   Just like rand(n) function, rand(5) = rand(5).
But that's another story.
Thanks agin.  I will accept your solution.
pax

     
Yes, performance would be tough at those levels. We can talk through the key if you want, I just don't understand. :-). I am sure you understand and it probably makes sense, I am just missing it. Sorry. If you want to try explaining it to me like I'm a third grader, with some examples, I will try to help you figure out the best approach.
Are you saying, a function that would look at the data and create a value that represents it's uniqueness? Then you could run this function against the data as it comes in and search the destination table based on this key, not all of the potential 50 columns?

Exactly.   A simplest exampel would be
- each input record has only 1 value, and it is of type integer (no pipes in each row.)  

-  my destination table is  tab(id integer, col1 integer).  Where id is uniquely generated using either guid or NewID()

-  To capture duplicates, I would add a new column, say  uniq in the destination table    
tab(id integer, col1 integer, uniq integer)  and create a unique index on uniq.

-  During teh load, I would apply rand() to each value.  Then insert using the insert statement
insert into tab values (uniq(), value,  rand(value))  for each value in the file.

since rand(

Fior exampleIf I am loading a file containing
123
456
123
789

Apply rand() to each value yields
0.715865215706424
0.722069995135404
0.715865215706424
0.728274774564385

as you can see, the first and third values are the same.  So the unique index on uniq column will reject the third row.

So if only we have a rund function that accept string value, we can seed the function with the entire row as we did above.  
Does this make sense?

pax
Sorry, typo

- During teh load, I would apply rand() to each value. Then insert using the insert statement
insert into tab values (uniq(), value, rand(value)) for each value in the file.

should read

- During the load, I would apply rand() to each value. Then insert using the insert statement
insert into tab values (newid(), value, rand(value)) for each value in the file.



Let me cleanup/simplify my example post to make it easier to read.  You see, English is not my first language.  I tends to write backward making it confusing to the readers.

-------------------
- Consider an input file containing 4 rows. Each row has one field of type integer. There are no pipes in each row.  Example
123
456
123
789

- The original destination table has two columns  tab(id integer, col1 integer). Where id is uniquely generated using either guid or NewID()

- To detect duplicates, I add a new column, say uniq of type numeric
tab(id integer, col1 integer, uniq numeric)  and create a unique index on uniq.

- To load the above file, I first apply rand() function to each value
Apply rand() to each value yields
0.715865215706424
0.722069995135404
0.715865215706424
0.728274774564385

-  Then   insert into tab values (uniq(), 123, rand(123))   etc

The third row will be rejected by the unique index because their rand values are the same.

Back to the real world.  My records are not integer, so I cannot use rand() function.  But if there is one that takes string values.  We could use it in the same way as we did int he integer example above.
I experimented with the CAST('allthecolumnstogether' as VARBINARY(255)) and it could work if you didn't have 50 columns. Sorry, can't think of anything else.
Thanks Hogg,
I accept your original solution using staging table since it has potential.  I will remove the condition statement that compare all columns if it proves to be too costly.  It is a reasonable assumption to assume no douplicate over a week long.  So I will just trim the staging table and only keep one week long data in it.  
Thanks again,
pax