• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 240
  • Last Modified:

SQL Statement to delete rows where a column value is duplicated.

I have the following example of duplicate rows:

30175130      3      1808      AA      2013-11-19 00:00:00.000      9.13
30176132      3      1808      AA      2013-11-19 00:00:00.000      9.13
30175454      3      4450      AAN      2013-11-19 00:00:00.000      29.80
30175787      3      8384      AAN      2013-11-19 00:00:00.000      29.80
30175788      3      8385      AAN      2013-11-19 00:00:00.000      29.80
30177864      3      4450      AAN      2013-11-19 00:00:00.000      29.80
30178679      3      8384      AAN      2013-11-19 00:00:00.000      29.80
30178680      3      8385      AAN      2013-11-19 00:00:00.000      29.80
30175131      3      1810      AAP      2013-11-19 00:00:00.000      97.06
30176133      3      1810      AAP      2013-11-19 00:00:00.000      97.06

How can I delete duplicates?

This should be the result:

30175130      3      1808      AA      2013-11-19 00:00:00.000      9.13
30175454      3      4450      AAN      2013-11-19 00:00:00.000      29.80
30175131      3      1810      AAP      2013-11-19 00:00:00.000      97.06

Thank you.
0
Dovberman
Asked:
Dovberman
  • 10
  • 10
1 Solution
 
guswebbCommented:
What are your column names?

Are you classifying a duplicate as any row that matches another row based on the value in column 4 irrespective of what is in column 3?

Can you also confirm how you wish to decide which row to keep? Is it simply the one with the lowest value in column 3?
0
 
guswebbCommented:
Assuming I have understood correctly, and pre-empting your reply to the above questions, this may be the code you need. Replace the column names (col1, col2 etc) to match the column names in your table. @col1, @col2 etc are just variables for the purpose of holding values returned by the initial select statement, creating a record set to loop through, so no need to change them.

DECLARE @col1 int, @col3 varchar(max), @col4 varchar(max), @count int

DECLARE c1 CURSOR FOR
SELECT col1, col3, col4 FROM MyTable GROUP BY col4 ORDER BY col4, col3

OPEN c1
FETCH NEXT FROM c1 INTO @col1, @col3, @col4
WHILE @@FETCH_STATUS = 0
BEGIN

SET @count = (SELECT count(*) FROM MyTable WHERE col4 = @col4)
IF @count > 1 BEGIN

DELETE FROM MyTable WHERE col4 = @col4 AND NOT col1 = @col1

END

FETCH NEXT FROM c1 INTO @col1, @col3, @col4
END
CLOSE c1
DEALLOCATE c1
0
 
DovbermanAuthor Commented:
Col1            Col2    Col3      Col4     Col5                                     Col6
30175130      3      1808      AA      2013-11-19 00:00:00.000      9.13
30176132      3      1808      AA      2013-11-19 00:00:00.000      9.13

Col1 is an identity primary key.  It is not important. Keep either the highest or lowest Col1 value.

Col2 is the MarketID

FROM TableName

Delete all except the highest(or lowest) value of Col1 row where Col3 and Col4 are equal and Col5 =  '2013-11-19'    and Col2 = 3.

Doe this help clarify the requirement?

Thank you
0
Restore individual SQL databases with ease

Veeam Explorer for Microsoft SQL Server delivers an easy-to-use, wizard-driven interface for restoring your databases from a backup. No expert SQL background required. Web interface provides a complete view of all available SQL databases to simplify the recovery of lost database

 
DovbermanAuthor Commented:
I can add a Boolean Col named ToBeDeleted.

Perhaps the ToBeDeleted Col could be set to true WHERE Col2 = 3 and Col5 =  '2013-11-19' and Col1 is the lowest value and (Col3 = Col4 and Col5 =  '2013-11-19')
0
 
guswebbCommented:
Ok, so use this code instead, which will flag the ToBeDeleted column with a value of 1 for all rows highlighted for deletion. You can then run a separate delete command when you are happy this has tagged all records correctly.

DECLARE @col1 int, @col3 varchar(max), @col4 varchar(max), @count int

DECLARE c1 CURSOR FOR
SELECT col1, col3, col4 FROM MyTable WHERE col2 = 3 and col5 = '19/11/2013' GROUP BY col4 ORDER BY col4, col3

OPEN c1
FETCH NEXT FROM c1 INTO @col1, @col3, @col4
WHILE @@FETCH_STATUS = 0
BEGIN

SET @count = (SELECT count(*) FROM MyTable WHERE col4 = @col4)
IF @count > 1 BEGIN

UPDATE MyTable SET ToBeDeleted = 1 WHERE col4 = @col4 and NOT col1 = @col1

END

FETCH NEXT FROM c1 INTO @col1, @col3, @col4
END
CLOSE c1
DEALLOCATE c1
0
 
DovbermanAuthor Commented:
Here is my stored procedure:

USE [stockprosql]
GO

/****** Object:  StoredProcedure [dbo].[usp_delDuplicates]    Script Date: 12/08/2013 14:30:57 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

-- =============================================
-- Author:            <Author,,Name>
-- Create date: <Create Date,,>
-- Description:      Marks duplicate StockHist rows foe deletion.
-- The ToBeDeleted Col could be set to true
-- WHERE Col2 = 3 and Col5 =  '2013-11-19'
-- and Col1 is the lowest value and (Col3 = Col4 and Col5 =  '2013-11-19')
-- =============================================

CREATE PROCEDURE [dbo].[usp_delDuplicates]

DECLARE @StockHistID int, @SymbolID int, @SymbolName nvarchar(8),
      @DupCount int

--Msg 156, Level 15, State 1, Procedure usp_delDuplicates, Line 13
--Incorrect syntax near the keyword 'DECLARE'. Compile error
      
      DECLARE DeleteItems CURSOR FOR
    SELECT StockHistID, SymbolID, SymbolName FROM StockHist
    GROUP BY SymbolName ORDER BY SymbolName, SymbolID

BEGIN
      -- SET NOCOUNT ON added to prevent extra result sets from
      -- interfering with SELECT statements.
      SET NOCOUNT ON;

    -- Insert statements for procedure here
OPEN DeleteItems
FETCH NEXT FROM DeleteItems INTO @StockHistID, @SymbolID, @SymbolName
WHILE @@FETCH_STATUS = 0
BEGIN

SET @DupCount = (SELECT count(*) FROM StockHist WHERE SymbolName = @SymbolName)

IF @DupCount > 1 BEGIN

UPDATE StockHist SET ToBeDeleted = 1 WHERE SymbolName = @SymbolName and NOT StockHistID = @StockHistID
END

FETCH NEXT FROM DeleteItems INTO @StockHistID, @SymbolID, @SymbolName
END
CLOSE DeleteItems
DEALLOCATE DeleteItems


Does not compile.

How can I fix this?

Thanks,
0
 
guswebbCommented:
Change the variable declarations to this...

DECLARE @StockHistID int
DECLARE @SymbolID int
DECLARE @SymbolName nvarchar(8)
DECLARE @DupCount int

...and if you get an error post it so we can pinpoint what is not compiling properly.
0
 
DovbermanAuthor Commented:
This is the parameter error:

DECLARE @StockHistID int -- (PK,int,not null)

I am not properly specifying the data type
0
 
guswebbCommented:
What are the data types for each of the columns; StockHistID, SymbolID, SymbolName ?
0
 
DovbermanAuthor Commented:
The datatype for StockHistID is  (PK,int,not null)

StockHistID is the only issue.

SymbolID is int
SymbolName is nvarchr(8)
0
 
guswebbCommented:
This doesn't make sense. If StockHistID is defined as int in your table then fetching values from that field in to @StockHistID which is also defined as int, should work just fine.
0
 
DovbermanAuthor Commented:
StockHistiD is large: 30175130
0
 
guswebbCommented:
Change this line...

CREATE PROCEDURE [dbo].[usp_delDuplicates]

to

CREATE PROCEDURE [dbo].[usp_delDuplicates] AS
0
 
DovbermanAuthor Commented:
DECLARE is OK Now.

New Compile Error:

Msg 8120, Level 16, State 1, Procedure usp_delDuplicates, Line 20
Column 'StockHist.StockHistID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
0
 
guswebbCommented:
Just remove the GROUP BY clause and it will work fine. So your select statement will look like this...

SELECT StockHistID, SymbolID, SymbolName FROM StockHist ORDER BY SymbolName, SymbolID

Out of interest, how many rows are in the StockHist table?
0
 
DovbermanAuthor Commented:
This change fixed it.

GROUP BY SymbolName,SymbolID,StockHistID ORDER BY SymbolName, SymbolID

Thanks,
0
 
guswebbCommented:
There's no point grouping by StockhistID as it's a PK so your record set will return the same number of rows as running the query without the GROUP BY.
0
 
DovbermanAuthor Commented:
I removed the GROUP BY clause. It compiles without error.

There are about 4000 rows in the StockHist table for a specific MarketID and QuoteDate.

This query will always filter for a specific MarketID and QuoteDate.

Thanks,
0
 
DovbermanAuthor Commented:
Thanks for all your help.
0
 
guswebbCommented:
No problem!
0

Featured Post

NEW Veeam Backup for Microsoft Office 365 1.5

With Office 365, it’s your data and your responsibility to protect it. NEW Veeam Backup for Microsoft Office 365 eliminates the risk of losing access to your Office 365 data.

  • 10
  • 10
Tackle projects and never again get stuck behind a technical roadblock.
Join Now