brasso_42
asked on
MSSQL Unique identifier based on row content
Hi
I'm writing a program which contains an import process (of csv data) before the data is committed to the main table I want to ensure that it is not duplicate data.
I think the best way to do this is to create a unique Identifier based on the row content a bit like an MP5 hash, but I don’t think this is supported in MSSQL (I’m using 2008 R2)
Can any one recommend a way to achieve this?
Thanks
Brasso
I'm writing a program which contains an import process (of csv data) before the data is committed to the main table I want to ensure that it is not duplicate data.
I think the best way to do this is to create a unique Identifier based on the row content a bit like an MP5 hash, but I don’t think this is supported in MSSQL (I’m using 2008 R2)
Can any one recommend a way to achieve this?
Thanks
Brasso
You can write a .NET Assembly to create the hash and then use it in a CLR UDF. So you can use that function to create the unique identifier while inserting data.
ASKER
Hi
Do you have an example of how to do that?
Thanks
Brasso
Do you have an example of how to do that?
Thanks
Brasso
You can use ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) to Delete Duplicate Rows in SQL Table
e.g.,
http://www.kodyaz.com/arti cles/delet e-duplicat e-rows-usi ng-row-num ber-partit ion-by-ord er-by.aspx
e.g.,
http://www.kodyaz.com/arti
ASKER
Hi
Necessarily want to delete them, I just want to be made aware of them.
Thanks
Brasso
Necessarily want to delete them, I just want to be made aware of them.
Thanks
Brasso
so instead of deleting you can insert them into another table of same structure.
eg.,
here Suppliers and Suppliers_Duplicates should have same structure.
WITH [CTE DUPLICATE] AS
(
SELECT
RN = ROW_NUMBER() OVER (PARTITION BY CompanyTitle ORDER BY LastContactDate DESC),
Id, CompanyTitle, ContactName, LastContactDate
FROM Suppliers
)
Insert into Suppliers_Duplicates select * FROM [CTE DUPLICATE] WHERE RN > 1
here Suppliers and Suppliers_Duplicates should have same structure.
Better:-
WITH [CTE DUPLICATE] AS
(
SELECT
RN = ROW_NUMBER() OVER (PARTITION BY CompanyTitle ORDER BY LastContactDate DESC),
Id, CompanyTitle, ContactName, LastContactDate
FROM Suppliers
)
Insert into Suppliers_Duplicates ( Id, CompanyTitle, ContactName, LastContactDate )
select Id, CompanyTitle, ContactName, LastContactDate FROM [CTE DUPLICATE] WHERE RN > 1
WITH [CTE DUPLICATE] AS
(
SELECT
RN = ROW_NUMBER() OVER (PARTITION BY CompanyTitle ORDER BY LastContactDate DESC),
Id, CompanyTitle, ContactName, LastContactDate
FROM Suppliers
)
Insert into Suppliers_Duplicates ( Id, CompanyTitle, ContactName, LastContactDate )
select Id, CompanyTitle, ContactName, LastContactDate FROM [CTE DUPLICATE] WHERE RN > 1
The correct way to do this is always, always import into a staging table, validate your data prior to inserting into the final production table(s). Remember the old axiom: Garbage in, garbage out.
>>I want to ensure that it is not duplicate data. <<
Before we can help you here, we need to know your definition of "duplicate data", is it:
1. Duplicate imported data or
2. Duplicate data when comparing imported data to existing data or
3. Something else.
>>I want to ensure that it is not duplicate data. <<
Before we can help you here, we need to know your definition of "duplicate data", is it:
1. Duplicate imported data or
2. Duplicate data when comparing imported data to existing data or
3. Something else.
ASKER
Hi Acperkins
This is exactly what I want to do. I want to validate on both point one and two.
Thanks
Brasso
This is exactly what I want to do. I want to validate on both point one and two.
Thanks
Brasso
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.