asked on

sql duplicates

On a daily basis I am extracting records from a table and storing the results in a storage table, then based on certain criteria I am updating two columns with a flag.
So my table may be duplicating data on each extraction.

So my question is how do I remove duplicate data, bearing in mind that the following scenario will occur. I want to keep the latest copy but remove the older one.
Row1
Column1      Column2      Column3      Column4
Fred            flagN            flagN            date

Then if for that day they match the update criteria it will be
Row2
Column1      Column2      Column3      Column4
Fred            flagY            flagY            date

Please note my table contains roughly 30 columns.

You see the problem is I cannot simply do a select distinct or having count > 1.
Also there are no primary keys.

I’ve added a column to my table called extraction date, example column4, so I am thinking remove duplicate data based on the latest extraction date.

But this is were I am having a problem, because in effect row1 and row 2 are not appear not to be duplicates. So I cannot do a select distinct.

raysonlee

SELECT * FROM table GROUP BY Column1 ORDER BY Column4 DESC

ASKER CERTIFIED SOLUTION

Paul_Harris_Fusion

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

dhruvpratap

I think you can achieve this using inner queries

SELECT * from table t, (SELECT Column1, max(Column4) from table GROUP BY Column1) temp
where
t.Column1 = temp.Column1and
t.Column4 = temp.Column4

Hope this helps!

Paul_Harris_Fusion

Sorry forgot the FROM part

Select * FROM
(
Select
T.*,
ROW_NUMBER() OVER (PARTITION BY Column1 ORDER BY LastUpdated Desc) as ROW_NUM
FROM
MyTable T
) Q
WHERE ROW_NUM =1

aneilg

ASKER

Thanks guys the issue with

Select * FROM
(
Select
T.*,
ROW_NUMBER() OVER (PARTITION BY Column1 ORDER BY LastUpdated Desc) as ROW_NUM
FROM
MyTable T
) Q
WHERE ROW_NUM =1

is that it only displays one record. As an example my table might contain a 100 records. And out of those 100 records 20 might have been updated with a flag ‘flagY’ which means the previous 20 records which had the flag set to ‘flagN’ need to be deleted.

i have also tried SELECT * from table t, (SELECT Column1, max(Column4) from table GROUP BY Column1) temp
where
t.Column1 = temp.Column1and
t.Column4 = temp.Column4

but get an error.

Msg 156, Level 15, State 1, Line 2
Incorrect syntax near the keyword 'table'.
Msg 156, Level 15, State 1, Line 2
Incorrect syntax near the keyword 'table'.

dhruvpratap

Sorry for that, try this:

SELECT * 
from table outer, 
(SELECT inner.Column1 inName, max(inner.Column4) maxDate from table inner GROUP BY inner.Column1) as temp
where
outer.Column1 = temp.inName and
outer.Column4 = temp.maxDate

Open in new window

Paul_Harris_Fusion

Hi aneilg,

My query should return 1 record for each value of Column1 - so if you are getting 1 returned from a table of 100, do all the rows have the same value in Column1?

I don't really understand how you are defining duplicates? From your data example, it looked like Column1 contained an identifier for an entity and the other columns contain current values for the attributes?

So my current assumption is:

You create a table called MyTable by extracting from some main table.
Column1 = identifier for an entity/record from the main table
Column4= date of extraction from the main table

If this is true, and you want to remove old versions from the storage table, then try

Delete from MyTable T1
Where Column4 < ( Select max(Column4) from MyTable T2 where T1.Column1=T2.Column1 )

If you want to preview the final results
Select * From MyTable T1
Where Column4 = ( Select max(Column4) from MyTable T2 where T1.Column1=T2.Column1 )

aneilg

ASKER

hi paul thanks for that.

my mistake, i misunderstood what you were getting at.

thanks for your help.

aneilg

ASKER

Fan'dabi'dozi, thanks a lot guys.

awking00

I'm still not sure of how you define duplicates. Can you post some sample data and how it should look after the query?

aneilg

ASKER

just a question.

i am having trouble with the delete.
DELETE FROM [ELTO_Storage] T1
      WHERE [EXTRACTION_DATE] <
            (SELECT max([EXTRACTION_DATE])FROM [ELTO_Storage] T2 WHERE T1.[BROKER_REF] = T2.[BROKER_REF])

Msg 102, Level 15, State 1, Line 1
Incorrect syntax near 'T1'.

its because of the alias. any ideas.

aneilg

ASKER

Row1
Column1 Column2 Column3 Column4 Column5 Column6
Fred flagN flagN date Add1 Add2

Then if for that day they match the update criteria it will be
Row2
Column1 Column2 Column3 Column4 Column5 Column6
Fred flagY flagY date Add1 Add2

in the above example this would be classed as a duplicate because the columns with the flags in have been updated. but the rest of the data is the same. please note my table has 30 columns, and only the columns with the flags in have been updated, that day to say they meet the updated Criteria .

awking00

I think you may need the keyword "AS" T1

SOLUTION

Paul_Harris_Fusion

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

aneilg

ASKER

once again thanks paul.

aneilg

ASKER

good