• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 529
  • Last Modified:

MSSQL Unique identifier based on row content

Hi

I'm writing a program which contains an import process (of csv data) before the data is committed to the main table I want to ensure that it is not duplicate data.  

I think the best way to do this is to create a unique Identifier based on the row content a bit like an MP5 hash, but I don’t think this is supported in MSSQL (I’m using 2008 R2)

Can any one recommend a way to achieve this?

Thanks

Brasso
0
brasso_42
Asked:
brasso_42
  • 4
  • 3
  • 2
  • +1
1 Solution
 
kamindaCommented:
You can write a .NET Assembly to create the hash and then use it in a CLR UDF. So you can use that function to create the unique identifier while inserting data.
0
 
brasso_42Author Commented:
Hi

Do you have an example of how to do that?

Thanks

Brasso
0
 
mayank_joshiCommented:
You can use ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) to Delete Duplicate Rows in SQL Table

e.g.,
http://www.kodyaz.com/articles/delete-duplicate-rows-using-row-number-partition-by-order-by.aspx
0
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

 
brasso_42Author Commented:
Hi

Necessarily want to delete them, I just want to be made aware of them.

Thanks

Brasso
0
 
mayank_joshiCommented:
so instead of deleting you can insert them into another table of same structure.
0
 
mayank_joshiCommented:
eg.,
WITH [CTE DUPLICATE] AS
(
SELECT
  RN = ROW_NUMBER() OVER (PARTITION BY CompanyTitle ORDER BY LastContactDate DESC),
  Id, CompanyTitle, ContactName, LastContactDate
FROM Suppliers
)
Insert into Suppliers_Duplicates select *  FROM [CTE DUPLICATE] WHERE RN > 1

Open in new window



here Suppliers and Suppliers_Duplicates should have same structure.
0
 
mayank_joshiCommented:
Better:-

WITH [CTE DUPLICATE] AS
(
SELECT
  RN = ROW_NUMBER() OVER (PARTITION BY CompanyTitle ORDER BY LastContactDate DESC),
  Id, CompanyTitle, ContactName, LastContactDate
FROM Suppliers
)
Insert into Suppliers_Duplicates ( Id, CompanyTitle, ContactName, LastContactDate )
select  Id, CompanyTitle, ContactName, LastContactDate  FROM [CTE DUPLICATE] WHERE RN > 1
0
 
Anthony PerkinsCommented:
The correct way to do this is always, always import into a staging table, validate your data prior to inserting into the final production table(s).  Remember the old axiom: Garbage in, garbage out.

>>I want to ensure that it is not duplicate data. <<
Before we can help you here, we need to know your definition of "duplicate data", is it:
1. Duplicate imported data or
2. Duplicate data when comparing imported data to existing data or
3. Something else.
0
 
brasso_42Author Commented:
Hi Acperkins

This is exactly what I want to do.   I want to validate on both point one and two.  

Thanks

Brasso

0
 
Anthony PerkinsCommented:
The way you shoud do it is:
1.  Import into staging table.
2.  Use a query with a GROUP BY clause based on the key columns to determine if any duplicates in the staging table if not
3.  Use a query with an INNER JOIN between the staging table and the production table based on the same key columns to determine if any duplicates in the staging table if not
4.  Import the now validated data into production
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

  • 4
  • 3
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now