sql duplicates

On a daily basis I am extracting records from a table and storing the results in a storage table, then based on certain criteria I am updating two columns with a flag.
So my table may be duplicating data on each extraction.

So my question is how do I remove duplicate data, bearing in mind that the following scenario will occur. I want to keep the latest copy but remove the older one.
Row1
Column1      Column2      Column3      Column4
Fred            flagN            flagN            date

Then if for that day they match the update criteria it will be
Row2
Column1      Column2      Column3      Column4
Fred            flagY            flagY            date

Please note my table contains roughly 30 columns.

You see the problem is I cannot simply do a select distinct or having count > 1.
Also there are no primary keys.

I’ve added a column to my table called extraction date, example column4, so I am thinking remove duplicate data based on the latest extraction date.

But this is were I am having a problem, because in effect row1 and row 2 are not appear not to be duplicates. So I cannot do a select distinct.


aneilgAsked:
Who is Participating?
 
Paul_Harris_FusionConnect With a Mentor Commented:
The ROW_NUMBER function is very useful for selecting 1 from a number of records

e.g.

Select * FROM
(
Select T.*,  ROW_NUMBER() OVER (PARTITION BY Column1 ORDER BY LastUpdated Desc) as ROW_NUM
) Q
WHERE ROW_NUM =1


In this example, the inner query Q assigns a row number to each row, resetting the row number every time Column1 changes and sorting within the Column1 by a column called LastUpdated.    
The outer query just selects records that have been assigned a row number of 1

Try specifying your own values for the PARTITION BY  clause to control the resetting of the row number.
0
 
raysonleeCommented:
SELECT  * FROM table GROUP BY Column1 ORDER BY Column4 DESC
0
 
dhruvpratapCommented:
I think you can achieve this using inner queries

SELECT * from table t, (SELECT Column1, max(Column4) from table GROUP BY Column1) temp
where
t.Column1 = temp.Column1and
t.Column4 = temp.Column4

Hope this helps!
0
Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

 
Paul_Harris_FusionCommented:
Sorry forgot the FROM part

Select * FROM
(
  Select
     T.*,
     ROW_NUMBER() OVER (PARTITION BY Column1 ORDER BY LastUpdated Desc) as ROW_NUM
  FROM
    MyTable T
) Q
WHERE ROW_NUM =1
0
 
aneilgAuthor Commented:
Thanks guys the issue with

Select * FROM
(
  Select
     T.*,
     ROW_NUMBER() OVER (PARTITION BY Column1 ORDER BY LastUpdated Desc) as ROW_NUM
  FROM
    MyTable T
) Q
WHERE ROW_NUM =1

is that it only displays one record. As an example my table might contain a 100 records. And out of those 100 records 20 might have been updated with a flag ‘flagY’ which means the previous 20 records which had the flag set to ‘flagN’ need to be deleted.

i have also tried SELECT * from table t, (SELECT Column1, max(Column4) from table GROUP BY Column1) temp
where
t.Column1 = temp.Column1and
t.Column4 = temp.Column4

but get an error.

Msg 156, Level 15, State 1, Line 2
Incorrect syntax near the keyword 'table'.
Msg 156, Level 15, State 1, Line 2
Incorrect syntax near the keyword 'table'.
0
 
dhruvpratapCommented:
Sorry for that, try this:

SELECT * 
from table outer, 
(SELECT inner.Column1 inName, max(inner.Column4) maxDate from table inner GROUP BY inner.Column1) as temp
where
outer.Column1 = temp.inName and
outer.Column4 = temp.maxDate 

Open in new window

0
 
Paul_Harris_FusionCommented:
Hi aneilg,

My query should return 1 record for each value of Column1 - so if you are getting 1 returned from a table of 100, do all the rows have the same value in Column1?

I don't really understand how you are defining duplicates?   From your data example, it looked like Column1 contained an identifier for an entity and the other columns contain current values for the attributes?

So my current assumption is:

You create a table called MyTable by extracting from some main table.
Column1 = identifier for an entity/record from the main table
Column4= date of extraction from the main table

If this is true, and you want to remove old versions from the storage table, then try

Delete from MyTable T1
Where Column4 <    ( Select max(Column4) from MyTable T2 where T1.Column1=T2.Column1  )

If you want to preview the final results
Select * From MyTable T1
Where Column4  =    ( Select max(Column4) from MyTable T2 where T1.Column1=T2.Column1  )
0
 
aneilgAuthor Commented:
hi paul thanks for that.

my mistake, i misunderstood what you were getting at.

thanks for your help.

0
 
aneilgAuthor Commented:
Fan'dabi'dozi, thanks a lot guys.
0
 
awking00Commented:
I'm still not sure of how you define duplicates. Can you post some sample data and how it should look after the query?
0
 
aneilgAuthor Commented:
just a question.

i am having trouble with the delete.
DELETE FROM [ELTO_Storage] T1
      WHERE [EXTRACTION_DATE] < 
            (SELECT max([EXTRACTION_DATE])FROM [ELTO_Storage] T2 WHERE T1.[BROKER_REF] = T2.[BROKER_REF])
            
Msg 102, Level 15, State 1, Line 1
Incorrect syntax near 'T1'.

its because of the alias. any ideas.


0
 
aneilgAuthor Commented:
Row1
Column1      Column2      Column3      Column4 Column5 Column6
Fred            flagN            flagN            date       Add1      Add2

Then if for that day they match the update criteria it will be
Row2
Column1      Column2      Column3      Column4 Column5 Column6
Fred            flagY            flagY            date      Add1     Add2

in the above example this would be classed as a duplicate because the columns with the flags in have been updated. but the rest of the data is the same. please note my table has 30 columns, and only the columns with the flags in have been updated, that day to say they meet the updated Criteria .

0
 
awking00Commented:
I think you may need the keyword "AS" T1
0
 
Paul_Harris_FusionConnect With a Mentor Commented:
If this is SQL Server,  you cannot use an alias when updating or deleting a table.

Try

DELETE FROM [ELTO_Storage]
      WHERE [EXTRACTION_DATE] < 
            (SELECT max([EXTRACTION_DATE])FROM [ELTO_Storage] T2
              WHERE [ELTO_Storage].[BROKER_REF] = T2.[BROKER_REF]
           )
           
0
 
aneilgAuthor Commented:
once again thanks paul.
0
 
aneilgAuthor Commented:
good
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.