Solved

MSSQL Unique identifier based on row content

Posted on 2011-03-10
10
510 Views
Last Modified: 2012-05-11
Hi

I'm writing a program which contains an import process (of csv data) before the data is committed to the main table I want to ensure that it is not duplicate data.  

I think the best way to do this is to create a unique Identifier based on the row content a bit like an MP5 hash, but I don’t think this is supported in MSSQL (I’m using 2008 R2)

Can any one recommend a way to achieve this?

Thanks

Brasso
0
Comment
Question by:brasso_42
  • 4
  • 3
  • 2
  • +1
10 Comments
 
LVL 9

Expert Comment

by:kaminda
Comment Utility
You can write a .NET Assembly to create the hash and then use it in a CLR UDF. So you can use that function to create the unique identifier while inserting data.
0
 
LVL 1

Author Comment

by:brasso_42
Comment Utility
Hi

Do you have an example of how to do that?

Thanks

Brasso
0
 
LVL 9

Expert Comment

by:mayank_joshi
Comment Utility
You can use ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) to Delete Duplicate Rows in SQL Table

e.g.,
http://www.kodyaz.com/articles/delete-duplicate-rows-using-row-number-partition-by-order-by.aspx
0
 
LVL 1

Author Comment

by:brasso_42
Comment Utility
Hi

Necessarily want to delete them, I just want to be made aware of them.

Thanks

Brasso
0
 
LVL 9

Expert Comment

by:mayank_joshi
Comment Utility
so instead of deleting you can insert them into another table of same structure.
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 9

Expert Comment

by:mayank_joshi
Comment Utility
eg.,
WITH [CTE DUPLICATE] AS
(
SELECT
  RN = ROW_NUMBER() OVER (PARTITION BY CompanyTitle ORDER BY LastContactDate DESC),
  Id, CompanyTitle, ContactName, LastContactDate
FROM Suppliers
)
Insert into Suppliers_Duplicates select *  FROM [CTE DUPLICATE] WHERE RN > 1

Open in new window



here Suppliers and Suppliers_Duplicates should have same structure.
0
 
LVL 9

Expert Comment

by:mayank_joshi
Comment Utility
Better:-

WITH [CTE DUPLICATE] AS
(
SELECT
  RN = ROW_NUMBER() OVER (PARTITION BY CompanyTitle ORDER BY LastContactDate DESC),
  Id, CompanyTitle, ContactName, LastContactDate
FROM Suppliers
)
Insert into Suppliers_Duplicates ( Id, CompanyTitle, ContactName, LastContactDate )
select  Id, CompanyTitle, ContactName, LastContactDate  FROM [CTE DUPLICATE] WHERE RN > 1
0
 
LVL 75

Expert Comment

by:Anthony Perkins
Comment Utility
The correct way to do this is always, always import into a staging table, validate your data prior to inserting into the final production table(s).  Remember the old axiom: Garbage in, garbage out.

>>I want to ensure that it is not duplicate data. <<
Before we can help you here, we need to know your definition of "duplicate data", is it:
1. Duplicate imported data or
2. Duplicate data when comparing imported data to existing data or
3. Something else.
0
 
LVL 1

Author Comment

by:brasso_42
Comment Utility
Hi Acperkins

This is exactly what I want to do.   I want to validate on both point one and two.  

Thanks

Brasso

0
 
LVL 75

Accepted Solution

by:
Anthony Perkins earned 500 total points
Comment Utility
The way you shoud do it is:
1.  Import into staging table.
2.  Use a query with a GROUP BY clause based on the key columns to determine if any duplicates in the staging table if not
3.  Use a query with an INNER JOIN between the staging table and the production table based on the same key columns to determine if any duplicates in the staging table if not
4.  Import the now validated data into production
0

Featured Post

Get up to 2TB FREE CLOUD per backup license!

An exclusive Black Friday offer just for Expert Exchange audience! Buy any of our top-rated backup solutions & get up to 2TB free cloud per system! Perform local & cloud backup in the same step, and restore instantly—anytime, anywhere. Grab this deal now before it disappears!

Join & Write a Comment

Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
The Delta outage: 650 cancelled flights, more than 1200 delayed flights, thousands of frustrated customers, tens of millions of dollars in damages – plus untold reputational damage to one of the world’s most trusted airlines. All due to a catastroph…
Using examples as well as descriptions, and references to Books Online, show the documentation available for date manipulation functions and by using a select few of these functions, show how date based data can be manipulated with these functions.
Using examples as well as descriptions, and references to Books Online, show the documentation available for datatypes, explain the available data types and show how data can be passed into and out of variables.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now