[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Identify duplicate record using random key - possible?

Posted on 2008-11-19
20
Medium Priority
?
489 Views
Last Modified: 2013-11-10
We are loading data from files into SQL Server database using SSIS.   Before insert into the destination table, we need to remove duplicate.    
1)  What is the best way to do this in SSIS?
2)  I am thinking of store a randome key in teh destination table, where the random key is generated using the entire record as input.    Does SQL Server have a random generator that accepts parameter of type varchar?  
Thanks,
pax
0
Comment
Question by:cpeters5
  • 10
  • 10
20 Comments
 
LVL 17

Expert Comment

by:HoggZilla
ID: 22995796
Let me suggest you import into a "staging" table and then use a separate Data Flow task to move the data from staging table to the final destiantion. No way to perform SQL on a file before insert.
0
 

Author Comment

by:cpeters5
ID: 22995880
Thnaks HoggZilla,
How do we identify duplicate rows in the staging table?  My thought (item 2) was to create a random key to dientify each row using the entire row as seed.  But there must be a better way to do this in SSIS.  (I don't know SSIS much, actually, at all...)
pax
0
 
LVL 17

Expert Comment

by:HoggZilla
ID: 22995929
It depends on what you consider a duplicate row. All values are the same? If you will provide me an example of a duplicate vs. non-duplicate that would help. Also, I assume you want to insert this into a table but not duplicate there also, is that correct. So if you had a row in the database already, you don't want to insert another just like it?
0
Free learning courses: Active Directory Deep Dive

Get a firm grasp on your IT environment when you learn Active Directory best practices with Veeam! Watch all, or choose any amount, of this three-part webinar series to improve your skills. From the basics to virtualization and backup, we got you covered.

 
LVL 17

Expert Comment

by:HoggZilla
ID: 22995965
But just so you know, yes, you can set an IDENTITY colum on a table and it will auto-increment by whatever value you choose. Unfortunately this is not a varchar, it is int. There is also the GUID value, but I am not sure this is what you are after so I am trying to verify your need before I suggest either of these as a solution.
0
 

Author Comment

by:cpeters5
ID: 22996010
By Duplicate, I mean exactly identical string. Usually created by accident from data source.  (We have specific logic to deal with other partial duplicate.)

Here is an example
1|This is|a|3-rd|test|
1|This is|a|3-rd|test|
2|This is|a|3-rd|test|

Row 1 and 2 are duplicate
Row 1 and 3 are not duplicate.
pax
0
 
LVL 17

Expert Comment

by:HoggZilla
ID: 22996072
OK, so here is my suggestion.
1. Insert the data from the file into a table. It should be a copy of the same table you use currently - this will be your staging table.
2. After the file, insert the distinct records from this staging table into your final destination table. All you need to do for this insert is an Execute SQL Task in SSIS, unless there are more than 50K rows, then I would use a Data Flow Task. But the SQL would be this:
INSERT INTO MyFinalDestinationTable
SELECT DISTINCT * FROM MyStagingTable
0
 

Author Comment

by:cpeters5
ID: 22996163
This would work except for the possibility of duplicates are in different files.   Unfortuanately, we do not have any control over the input extract file.  Is there a way to remedy this case (two files with some overlap)?
0
 

Author Comment

by:cpeters5
ID: 22996226

To expand on our loading process.  We receive a number of extract files on a daily basis.  (Data could come in stream, but we plan to dump them into files before loading.)   This means duplicate rows could occur in multiple passes.  There is no guarantee.
pax
0
 
LVL 17

Accepted Solution

by:
HoggZilla earned 1500 total points
ID: 22996357
If I had multiple files a day, possibly containing duplicate information that I did not want to duplicate in my destination table, I would only insert into the destination table where it did not exist. That would always keep my final destination table duplicate free.

INSERT INTO MyFinalDestinationTable
SELECT DISTINCT * FROM MyStagingTable a
WHERE NOT EXISTS
(SELECT 1 FROM MyStagingTable b
WHERE b.Column1 = a.Column1
AND b.Column2 = a.Column2
AND .....)
including every column in my WHERE clause.
0
 
LVL 17

Expert Comment

by:HoggZilla
ID: 22996372
Sorry, my bad. Recheck that above.....

INSERT INTO MyFinalDestinationTable
SELECT DISTINCT * FROM MyStagingTable a
WHERE NOT EXISTS
(SELECT 1 FROM MyFinalDestinationTable b
WHERE b.Column1 = a.Column1
AND b.Column2 = a.Column2
AND .....)
0
 
LVL 17

Expert Comment

by:HoggZilla
ID: 22996387
So, is this scenario true. If so, then the above should solve this problem.
You have a final destination table where all rows should be unique.
You receive multiple files and want to insert the data from those files into the final destination table but you don't want to insert a row that already exists?
0
 

Author Comment

by:cpeters5
ID: 22996530

Thanks HaggZilla  (I have an urge to call you Hogg :-) )
Your scenario is almost true.   My only concern is the duplicates taht occur between two different days.   This means I have to maintain the staging table over multiple days, and trim it regularly perhaps weekly.  I think it is safe to assume that duplicates, if exist, will not be further apart more than one week.
This sounds better than to maintain a unique identifier key in the destination table as I have thought.  
What do you think?

But I am curious if SQL Server has a random function that accepts string parameter instead of int.  I will post this as a separate question.

Thanks,
pax

 
0
 
LVL 17

Expert Comment

by:HoggZilla
ID: 22996628
Hogg is good, that is my last name. :-)
You don't have to keep the data in the staging table. Once you load the data from the .csv into the staging table, then you insert that data into the destination table using my "second" stmt above. The insertion of non-duplicate data is based on the destination table. After you move the unique data to the destination table, truncate the staging table - you don't need that data any longer.
Now, to your question of random string??? You really need someting unique, like Oracle's ROWID. You can create a new GUID using NEWID() and save it to varchar. That is a uniqueidentifier. Lookup up NEWID() and GUID, lots of information here at EE.
0
 

Author Comment

by:cpeters5
ID: 22996711
Thanks Hogg,
I didn't read your statement carefully.  Yes it would work.  Not sure about the performance though.  The destination table(s) has over 40 columns and may grow to over 20-30 million rows.  

As for the random key, it was not meant to be unique id for staging table.  It is a way to identify duplicate strings.   Say, if string1 = string2, then this function, say f, will generate the same vaue
f(string1) = f(string2) for both strings.   Just like rand(n) function, rand(5) = rand(5).
But that's another story.
Thanks agin.  I will accept your solution.
pax

     
0
 
LVL 17

Expert Comment

by:HoggZilla
ID: 22996792
Yes, performance would be tough at those levels. We can talk through the key if you want, I just don't understand. :-). I am sure you understand and it probably makes sense, I am just missing it. Sorry. If you want to try explaining it to me like I'm a third grader, with some examples, I will try to help you figure out the best approach.
Are you saying, a function that would look at the data and create a value that represents it's uniqueness? Then you could run this function against the data as it comes in and search the destination table based on this key, not all of the potential 50 columns?
0
 

Author Comment

by:cpeters5
ID: 22997091

Exactly.   A simplest exampel would be
- each input record has only 1 value, and it is of type integer (no pipes in each row.)  

-  my destination table is  tab(id integer, col1 integer).  Where id is uniquely generated using either guid or NewID()

-  To capture duplicates, I would add a new column, say  uniq in the destination table    
tab(id integer, col1 integer, uniq integer)  and create a unique index on uniq.

-  During teh load, I would apply rand() to each value.  Then insert using the insert statement
insert into tab values (uniq(), value,  rand(value))  for each value in the file.

since rand(

Fior exampleIf I am loading a file containing
123
456
123
789

Apply rand() to each value yields
0.715865215706424
0.722069995135404
0.715865215706424
0.728274774564385

as you can see, the first and third values are the same.  So the unique index on uniq column will reject the third row.

So if only we have a rund function that accept string value, we can seed the function with the entire row as we did above.  
Does this make sense?

pax
0
 

Author Comment

by:cpeters5
ID: 22997163
Sorry, typo

- During teh load, I would apply rand() to each value. Then insert using the insert statement
insert into tab values (uniq(), value, rand(value)) for each value in the file.

should read

- During the load, I would apply rand() to each value. Then insert using the insert statement
insert into tab values (newid(), value, rand(value)) for each value in the file.

0
 

Author Comment

by:cpeters5
ID: 22997646


Let me cleanup/simplify my example post to make it easier to read.  You see, English is not my first language.  I tends to write backward making it confusing to the readers.

-------------------
- Consider an input file containing 4 rows. Each row has one field of type integer. There are no pipes in each row.  Example
123
456
123
789

- The original destination table has two columns  tab(id integer, col1 integer). Where id is uniquely generated using either guid or NewID()

- To detect duplicates, I add a new column, say uniq of type numeric
tab(id integer, col1 integer, uniq numeric)  and create a unique index on uniq.

- To load the above file, I first apply rand() function to each value
Apply rand() to each value yields
0.715865215706424
0.722069995135404
0.715865215706424
0.728274774564385

-  Then   insert into tab values (uniq(), 123, rand(123))   etc

The third row will be rejected by the unique index because their rand values are the same.

Back to the real world.  My records are not integer, so I cannot use rand() function.  But if there is one that takes string values.  We could use it in the same way as we did int he integer example above.
0
 
LVL 17

Expert Comment

by:HoggZilla
ID: 23000777
I experimented with the CAST('allthecolumnstogether' as VARBINARY(255)) and it could work if you didn't have 50 columns. Sorry, can't think of anything else.
0
 

Author Comment

by:cpeters5
ID: 23003005
Thanks Hogg,
I accept your original solution using staging table since it has potential.  I will remove the condition statement that compare all columns if it proves to be too costly.  It is a reasonable assumption to assume no douplicate over a week long.  So I will just trim the staging table and only keep one week long data in it.  
Thanks again,
pax
0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Why is this different from all of the other step by step guides?  Because I make a living as a DBA and not as a writer and I lived through this experience. Defining the name: When I talk to people they say different names on this subject stuff l…
I have a large data set and a SSIS package. How can I load this file in multi threading?
Using examples as well as descriptions, and references to Books Online, show the documentation available for date manipulation functions and by using a select few of these functions, show how date based data can be manipulated with these functions.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.
Suggested Courses

872 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question