Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

SQL Statement to delete rows where a column value is duplicated.

Posted on 2013-12-08
20
Medium Priority
?
238 Views
Last Modified: 2013-12-08
I have the following example of duplicate rows:

30175130      3      1808      AA      2013-11-19 00:00:00.000      9.13
30176132      3      1808      AA      2013-11-19 00:00:00.000      9.13
30175454      3      4450      AAN      2013-11-19 00:00:00.000      29.80
30175787      3      8384      AAN      2013-11-19 00:00:00.000      29.80
30175788      3      8385      AAN      2013-11-19 00:00:00.000      29.80
30177864      3      4450      AAN      2013-11-19 00:00:00.000      29.80
30178679      3      8384      AAN      2013-11-19 00:00:00.000      29.80
30178680      3      8385      AAN      2013-11-19 00:00:00.000      29.80
30175131      3      1810      AAP      2013-11-19 00:00:00.000      97.06
30176133      3      1810      AAP      2013-11-19 00:00:00.000      97.06

How can I delete duplicates?

This should be the result:

30175130      3      1808      AA      2013-11-19 00:00:00.000      9.13
30175454      3      4450      AAN      2013-11-19 00:00:00.000      29.80
30175131      3      1810      AAP      2013-11-19 00:00:00.000      97.06

Thank you.
0
Comment
Question by:Dovberman
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 10
  • 10
20 Comments
 
LVL 9

Expert Comment

by:guswebb
ID: 39704431
What are your column names?

Are you classifying a duplicate as any row that matches another row based on the value in column 4 irrespective of what is in column 3?

Can you also confirm how you wish to decide which row to keep? Is it simply the one with the lowest value in column 3?
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704444
Assuming I have understood correctly, and pre-empting your reply to the above questions, this may be the code you need. Replace the column names (col1, col2 etc) to match the column names in your table. @col1, @col2 etc are just variables for the purpose of holding values returned by the initial select statement, creating a record set to loop through, so no need to change them.

DECLARE @col1 int, @col3 varchar(max), @col4 varchar(max), @count int

DECLARE c1 CURSOR FOR
SELECT col1, col3, col4 FROM MyTable GROUP BY col4 ORDER BY col4, col3

OPEN c1
FETCH NEXT FROM c1 INTO @col1, @col3, @col4
WHILE @@FETCH_STATUS = 0
BEGIN

SET @count = (SELECT count(*) FROM MyTable WHERE col4 = @col4)
IF @count > 1 BEGIN

DELETE FROM MyTable WHERE col4 = @col4 AND NOT col1 = @col1

END

FETCH NEXT FROM c1 INTO @col1, @col3, @col4
END
CLOSE c1
DEALLOCATE c1
0
 

Author Comment

by:Dovberman
ID: 39704624
Col1            Col2    Col3      Col4     Col5                                     Col6
30175130      3      1808      AA      2013-11-19 00:00:00.000      9.13
30176132      3      1808      AA      2013-11-19 00:00:00.000      9.13

Col1 is an identity primary key.  It is not important. Keep either the highest or lowest Col1 value.

Col2 is the MarketID

FROM TableName

Delete all except the highest(or lowest) value of Col1 row where Col3 and Col4 are equal and Col5 =  '2013-11-19'    and Col2 = 3.

Doe this help clarify the requirement?

Thank you
0
Get free NFR key for Veeam Availability Suite 9.5

Veeam is happy to provide a free NFR license (1 year, 2 sockets) to all certified IT Pros. The license allows for the non-production use of Veeam Availability Suite v9.5 in your home lab, without any feature limitations. It works for both VMware and Hyper-V environments

 

Author Comment

by:Dovberman
ID: 39704636
I can add a Boolean Col named ToBeDeleted.

Perhaps the ToBeDeleted Col could be set to true WHERE Col2 = 3 and Col5 =  '2013-11-19' and Col1 is the lowest value and (Col3 = Col4 and Col5 =  '2013-11-19')
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704662
Ok, so use this code instead, which will flag the ToBeDeleted column with a value of 1 for all rows highlighted for deletion. You can then run a separate delete command when you are happy this has tagged all records correctly.

DECLARE @col1 int, @col3 varchar(max), @col4 varchar(max), @count int

DECLARE c1 CURSOR FOR
SELECT col1, col3, col4 FROM MyTable WHERE col2 = 3 and col5 = '19/11/2013' GROUP BY col4 ORDER BY col4, col3

OPEN c1
FETCH NEXT FROM c1 INTO @col1, @col3, @col4
WHILE @@FETCH_STATUS = 0
BEGIN

SET @count = (SELECT count(*) FROM MyTable WHERE col4 = @col4)
IF @count > 1 BEGIN

UPDATE MyTable SET ToBeDeleted = 1 WHERE col4 = @col4 and NOT col1 = @col1

END

FETCH NEXT FROM c1 INTO @col1, @col3, @col4
END
CLOSE c1
DEALLOCATE c1
0
 

Author Comment

by:Dovberman
ID: 39704751
Here is my stored procedure:

USE [stockprosql]
GO

/****** Object:  StoredProcedure [dbo].[usp_delDuplicates]    Script Date: 12/08/2013 14:30:57 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

-- =============================================
-- Author:            <Author,,Name>
-- Create date: <Create Date,,>
-- Description:      Marks duplicate StockHist rows foe deletion.
-- The ToBeDeleted Col could be set to true
-- WHERE Col2 = 3 and Col5 =  '2013-11-19'
-- and Col1 is the lowest value and (Col3 = Col4 and Col5 =  '2013-11-19')
-- =============================================

CREATE PROCEDURE [dbo].[usp_delDuplicates]

DECLARE @StockHistID int, @SymbolID int, @SymbolName nvarchar(8),
      @DupCount int

--Msg 156, Level 15, State 1, Procedure usp_delDuplicates, Line 13
--Incorrect syntax near the keyword 'DECLARE'. Compile error
      
      DECLARE DeleteItems CURSOR FOR
    SELECT StockHistID, SymbolID, SymbolName FROM StockHist
    GROUP BY SymbolName ORDER BY SymbolName, SymbolID

BEGIN
      -- SET NOCOUNT ON added to prevent extra result sets from
      -- interfering with SELECT statements.
      SET NOCOUNT ON;

    -- Insert statements for procedure here
OPEN DeleteItems
FETCH NEXT FROM DeleteItems INTO @StockHistID, @SymbolID, @SymbolName
WHILE @@FETCH_STATUS = 0
BEGIN

SET @DupCount = (SELECT count(*) FROM StockHist WHERE SymbolName = @SymbolName)

IF @DupCount > 1 BEGIN

UPDATE StockHist SET ToBeDeleted = 1 WHERE SymbolName = @SymbolName and NOT StockHistID = @StockHistID
END

FETCH NEXT FROM DeleteItems INTO @StockHistID, @SymbolID, @SymbolName
END
CLOSE DeleteItems
DEALLOCATE DeleteItems


Does not compile.

How can I fix this?

Thanks,
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704783
Change the variable declarations to this...

DECLARE @StockHistID int
DECLARE @SymbolID int
DECLARE @SymbolName nvarchar(8)
DECLARE @DupCount int

...and if you get an error post it so we can pinpoint what is not compiling properly.
0
 

Author Comment

by:Dovberman
ID: 39704819
This is the parameter error:

DECLARE @StockHistID int -- (PK,int,not null)

I am not properly specifying the data type
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704826
What are the data types for each of the columns; StockHistID, SymbolID, SymbolName ?
0
 

Author Comment

by:Dovberman
ID: 39704832
The datatype for StockHistID is  (PK,int,not null)

StockHistID is the only issue.

SymbolID is int
SymbolName is nvarchr(8)
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704840
This doesn't make sense. If StockHistID is defined as int in your table then fetching values from that field in to @StockHistID which is also defined as int, should work just fine.
0
 

Author Comment

by:Dovberman
ID: 39704841
StockHistiD is large: 30175130
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704849
Change this line...

CREATE PROCEDURE [dbo].[usp_delDuplicates]

to

CREATE PROCEDURE [dbo].[usp_delDuplicates] AS
0
 

Author Comment

by:Dovberman
ID: 39704862
DECLARE is OK Now.

New Compile Error:

Msg 8120, Level 16, State 1, Procedure usp_delDuplicates, Line 20
Column 'StockHist.StockHistID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704868
Just remove the GROUP BY clause and it will work fine. So your select statement will look like this...

SELECT StockHistID, SymbolID, SymbolName FROM StockHist ORDER BY SymbolName, SymbolID

Out of interest, how many rows are in the StockHist table?
0
 

Author Comment

by:Dovberman
ID: 39704873
This change fixed it.

GROUP BY SymbolName,SymbolID,StockHistID ORDER BY SymbolName, SymbolID

Thanks,
0
 
LVL 9

Accepted Solution

by:
guswebb earned 1600 total points
ID: 39704879
There's no point grouping by StockhistID as it's a PK so your record set will return the same number of rows as running the query without the GROUP BY.
0
 

Author Comment

by:Dovberman
ID: 39704883
I removed the GROUP BY clause. It compiles without error.

There are about 4000 rows in the StockHist table for a specific MarketID and QuoteDate.

This query will always filter for a specific MarketID and QuoteDate.

Thanks,
0
 

Author Closing Comment

by:Dovberman
ID: 39704886
Thanks for all your help.
0
 
LVL 9

Expert Comment

by:guswebb
ID: 39704889
No problem!
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article shows gives you an overview on SQL Server 2016 row level security. You will also get to know the usages of row-level-security and how it works
In the first part of this tutorial we will cover the prerequisites for installing SQL Server vNext on Linux.
Viewers will learn how the fundamental information of how to create a table.
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.

609 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question