Solved

SQL Statement to delete rows where a column value is duplicated.

Posted on 2013-12-08
20
229 Views
Last Modified: 2013-12-08
I have the following example of duplicate rows:

30175130      3      1808      AA      2013-11-19 00:00:00.000      9.13
30176132      3      1808      AA      2013-11-19 00:00:00.000      9.13
30175454      3      4450      AAN      2013-11-19 00:00:00.000      29.80
30175787      3      8384      AAN      2013-11-19 00:00:00.000      29.80
30175788      3      8385      AAN      2013-11-19 00:00:00.000      29.80
30177864      3      4450      AAN      2013-11-19 00:00:00.000      29.80
30178679      3      8384      AAN      2013-11-19 00:00:00.000      29.80
30178680      3      8385      AAN      2013-11-19 00:00:00.000      29.80
30175131      3      1810      AAP      2013-11-19 00:00:00.000      97.06
30176133      3      1810      AAP      2013-11-19 00:00:00.000      97.06

How can I delete duplicates?

This should be the result:

30175130      3      1808      AA      2013-11-19 00:00:00.000      9.13
30175454      3      4450      AAN      2013-11-19 00:00:00.000      29.80
30175131      3      1810      AAP      2013-11-19 00:00:00.000      97.06

Thank you.
0
Comment
Question by:Dovberman
  • 10
  • 10
20 Comments
 
LVL 9

Expert Comment

by:guswebb
ID: 39704431
What are your column names?

Are you classifying a duplicate as any row that matches another row based on the value in column 4 irrespective of what is in column 3?

Can you also confirm how you wish to decide which row to keep? Is it simply the one with the lowest value in column 3?
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704444
Assuming I have understood correctly, and pre-empting your reply to the above questions, this may be the code you need. Replace the column names (col1, col2 etc) to match the column names in your table. @col1, @col2 etc are just variables for the purpose of holding values returned by the initial select statement, creating a record set to loop through, so no need to change them.

DECLARE @col1 int, @col3 varchar(max), @col4 varchar(max), @count int

DECLARE c1 CURSOR FOR
SELECT col1, col3, col4 FROM MyTable GROUP BY col4 ORDER BY col4, col3

OPEN c1
FETCH NEXT FROM c1 INTO @col1, @col3, @col4
WHILE @@FETCH_STATUS = 0
BEGIN

SET @count = (SELECT count(*) FROM MyTable WHERE col4 = @col4)
IF @count > 1 BEGIN

DELETE FROM MyTable WHERE col4 = @col4 AND NOT col1 = @col1

END

FETCH NEXT FROM c1 INTO @col1, @col3, @col4
END
CLOSE c1
DEALLOCATE c1
0
 

Author Comment

by:Dovberman
ID: 39704624
Col1            Col2    Col3      Col4     Col5                                     Col6
30175130      3      1808      AA      2013-11-19 00:00:00.000      9.13
30176132      3      1808      AA      2013-11-19 00:00:00.000      9.13

Col1 is an identity primary key.  It is not important. Keep either the highest or lowest Col1 value.

Col2 is the MarketID

FROM TableName

Delete all except the highest(or lowest) value of Col1 row where Col3 and Col4 are equal and Col5 =  '2013-11-19'    and Col2 = 3.

Doe this help clarify the requirement?

Thank you
0
 

Author Comment

by:Dovberman
ID: 39704636
I can add a Boolean Col named ToBeDeleted.

Perhaps the ToBeDeleted Col could be set to true WHERE Col2 = 3 and Col5 =  '2013-11-19' and Col1 is the lowest value and (Col3 = Col4 and Col5 =  '2013-11-19')
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704662
Ok, so use this code instead, which will flag the ToBeDeleted column with a value of 1 for all rows highlighted for deletion. You can then run a separate delete command when you are happy this has tagged all records correctly.

DECLARE @col1 int, @col3 varchar(max), @col4 varchar(max), @count int

DECLARE c1 CURSOR FOR
SELECT col1, col3, col4 FROM MyTable WHERE col2 = 3 and col5 = '19/11/2013' GROUP BY col4 ORDER BY col4, col3

OPEN c1
FETCH NEXT FROM c1 INTO @col1, @col3, @col4
WHILE @@FETCH_STATUS = 0
BEGIN

SET @count = (SELECT count(*) FROM MyTable WHERE col4 = @col4)
IF @count > 1 BEGIN

UPDATE MyTable SET ToBeDeleted = 1 WHERE col4 = @col4 and NOT col1 = @col1

END

FETCH NEXT FROM c1 INTO @col1, @col3, @col4
END
CLOSE c1
DEALLOCATE c1
0
 

Author Comment

by:Dovberman
ID: 39704751
Here is my stored procedure:

USE [stockprosql]
GO

/****** Object:  StoredProcedure [dbo].[usp_delDuplicates]    Script Date: 12/08/2013 14:30:57 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

-- =============================================
-- Author:            <Author,,Name>
-- Create date: <Create Date,,>
-- Description:      Marks duplicate StockHist rows foe deletion.
-- The ToBeDeleted Col could be set to true
-- WHERE Col2 = 3 and Col5 =  '2013-11-19'
-- and Col1 is the lowest value and (Col3 = Col4 and Col5 =  '2013-11-19')
-- =============================================

CREATE PROCEDURE [dbo].[usp_delDuplicates]

DECLARE @StockHistID int, @SymbolID int, @SymbolName nvarchar(8),
      @DupCount int

--Msg 156, Level 15, State 1, Procedure usp_delDuplicates, Line 13
--Incorrect syntax near the keyword 'DECLARE'. Compile error
      
      DECLARE DeleteItems CURSOR FOR
    SELECT StockHistID, SymbolID, SymbolName FROM StockHist
    GROUP BY SymbolName ORDER BY SymbolName, SymbolID

BEGIN
      -- SET NOCOUNT ON added to prevent extra result sets from
      -- interfering with SELECT statements.
      SET NOCOUNT ON;

    -- Insert statements for procedure here
OPEN DeleteItems
FETCH NEXT FROM DeleteItems INTO @StockHistID, @SymbolID, @SymbolName
WHILE @@FETCH_STATUS = 0
BEGIN

SET @DupCount = (SELECT count(*) FROM StockHist WHERE SymbolName = @SymbolName)

IF @DupCount > 1 BEGIN

UPDATE StockHist SET ToBeDeleted = 1 WHERE SymbolName = @SymbolName and NOT StockHistID = @StockHistID
END

FETCH NEXT FROM DeleteItems INTO @StockHistID, @SymbolID, @SymbolName
END
CLOSE DeleteItems
DEALLOCATE DeleteItems


Does not compile.

How can I fix this?

Thanks,
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704783
Change the variable declarations to this...

DECLARE @StockHistID int
DECLARE @SymbolID int
DECLARE @SymbolName nvarchar(8)
DECLARE @DupCount int

...and if you get an error post it so we can pinpoint what is not compiling properly.
0
 

Author Comment

by:Dovberman
ID: 39704819
This is the parameter error:

DECLARE @StockHistID int -- (PK,int,not null)

I am not properly specifying the data type
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704826
What are the data types for each of the columns; StockHistID, SymbolID, SymbolName ?
0
 

Author Comment

by:Dovberman
ID: 39704832
The datatype for StockHistID is  (PK,int,not null)

StockHistID is the only issue.

SymbolID is int
SymbolName is nvarchr(8)
0
Get up to 2TB FREE CLOUD per backup license!

An exclusive Black Friday offer just for Expert Exchange audience! Buy any of our top-rated backup solutions & get up to 2TB free cloud per system! Perform local & cloud backup in the same step, and restore instantly—anytime, anywhere. Grab this deal now before it disappears!

 
LVL 9

Expert Comment

by:guswebb
ID: 39704840
This doesn't make sense. If StockHistID is defined as int in your table then fetching values from that field in to @StockHistID which is also defined as int, should work just fine.
0
 

Author Comment

by:Dovberman
ID: 39704841
StockHistiD is large: 30175130
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704849
Change this line...

CREATE PROCEDURE [dbo].[usp_delDuplicates]

to

CREATE PROCEDURE [dbo].[usp_delDuplicates] AS
0
 

Author Comment

by:Dovberman
ID: 39704862
DECLARE is OK Now.

New Compile Error:

Msg 8120, Level 16, State 1, Procedure usp_delDuplicates, Line 20
Column 'StockHist.StockHistID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704868
Just remove the GROUP BY clause and it will work fine. So your select statement will look like this...

SELECT StockHistID, SymbolID, SymbolName FROM StockHist ORDER BY SymbolName, SymbolID

Out of interest, how many rows are in the StockHist table?
0
 

Author Comment

by:Dovberman
ID: 39704873
This change fixed it.

GROUP BY SymbolName,SymbolID,StockHistID ORDER BY SymbolName, SymbolID

Thanks,
0
 
LVL 9

Accepted Solution

by:
guswebb earned 400 total points
ID: 39704879
There's no point grouping by StockhistID as it's a PK so your record set will return the same number of rows as running the query without the GROUP BY.
0
 

Author Comment

by:Dovberman
ID: 39704883
I removed the GROUP BY clause. It compiles without error.

There are about 4000 rows in the StockHist table for a specific MarketID and QuoteDate.

This query will always filter for a specific MarketID and QuoteDate.

Thanks,
0
 

Author Closing Comment

by:Dovberman
ID: 39704886
Thanks for all your help.
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704889
No problem!
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

In this article I will describe the Detach & Attach method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
The Delta outage: 650 cancelled flights, more than 1200 delayed flights, thousands of frustrated customers, tens of millions of dollars in damages – plus untold reputational damage to one of the world’s most trusted airlines. All due to a catastroph…
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now