Solved

SQL remove duplicates, keeping 1

Posted on 2014-07-21
9
242 Views
Last Modified: 2014-07-21
Experts,

This question has come up a LOT, and I apologize for asking.  I've done some extensive searching on this topic and have some of the fundamentals, but I need a nudge in TSQL to be able to accomplish what I'm trying.  https://www.simple-talk.com/sql/t-sql-programming/removing-duplicates-from-a-table-in-sql-server/ has the best write-up I've seen so far, but I'm having trouble trying to sponge over the statement to meld it into our environment.

In a table that has approx 1.9 mil rows, there are MANY duplicates.  I've been able to identify duplicate records via the following:

[script]
SELECT displayName, ExchangeGuid, Store_Id, firstseen, lastseen, COUNT(*) As RecordCount
FROM     Mailboxes
GROUP BY displayName,
            ExchangeGuid,
            Store_Id,
            firstseen,
            lastseen
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
[/script]

Now, I need to take this to the next step of actually removing ALL BUT ONE on each of those duplicate returns.  Each of the search pages I visited before looks like they're using some sort of temp-table to accomplish this???

Apologies for the confusion, but any point in the right direction would be very appreciated.
0
Comment
Question by:usslindstrom
  • 5
  • 4
9 Comments
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40209483
Is there a unique column, such as an identity, in the table?

Can you temporarily stop use of the table while you do the clean-up, or does the table have to remain "live"?
0
 
LVL 5

Author Comment

by:usslindstrom
ID: 40209601
Thanks for jumping in to help - let me see if I can't answer your questions:


Is there a unique column, such as an identity, in the table?

Yes.  It's not in my above statement, but it's identified as "MailboxId".



Can you temporarily stop use of the table while you do the clean-up, or does the table have to remain "live"?

Our techs use this table throughout the day when needed - but we can shut it down and put a moritorium on its usage for whenever.
0
 
LVL 5

Author Comment

by:usslindstrom
ID: 40209609
On the MailboxId column though, it's "unique" - as it's an auto-gen field, and combining it with my group would break its ability to find duplicates, correct?  Apologies - SQL is "fun" to try to figure out all of its "ins and outs".  Why can't it be easier!?!  :)
0
 
LVL 5

Author Comment

by:usslindstrom
ID: 40209613
For note, here are all of the data fields within that table:

      [MailboxID] [int] IDENTITY(1,1) NOT NULL,
      [ExchangeGuid] [uniqueidentifier] NOT NULL,
      [Iden_ID] [int] NOT NULL,
      [LDN_ID] [int] NOT NULL,
      [DN_ID] [int] NULL,
      [Store_Id] [int] NOT NULL,
      [SamAccountName] [varchar](255) NOT NULL,
      [UserPrincipalName] [varchar](255) NOT NULL,
      [Alias] [varchar](255) NOT NULL,
      [LegacyExchangeDN] [varchar](255) NOT NULL,
      [PrimarySmtpAddress] [varchar](255) NOT NULL,
      [DistinguishedName] [varchar](255) NULL,
      [Identity] [varchar](255) NOT NULL,
      [Guid] [uniqueidentifier] NOT NULL,
      [IsMailboxEnabled] [bit] NOT NULL,
      [ProhibitSendQuota] [bigint] NULL,
      [UseDatabaseQuotaDefaults] [bit] NULL,
      [IssueWarningQuota] [bigint] NULL,
      [DisplayName] [varchar](255) NOT NULL,
      [IsValid] [bit] NOT NULL,
      [ExchangeVersion] [varchar](50) NULL,
      [Name] [varchar](255) NOT NULL,
      [StorageLimitInfo] [varchar](20) NULL,
      [TotalItemSize] [bigint] NULL,
      [firstseen] [datetime] NOT NULL,
      [lastseen] [datetime] NOT NULL,
0
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

 
LVL 69

Accepted Solution

by:
ScottPletcher earned 500 total points
ID: 40209651
>> and combining it with my group would break its ability to find duplicates, correct? <<

If we GROUPed by it, but not if we just use MIN and/or MAX.  That way we can delete all but the min or max MailBoxID :-).

If there are more dups than good rows, you could just select one distinct row from each set of dups, truncate the table, and re-insert only the non-dup rows.

If MailBoxID is the clustering key -- idents usually are, even though they most often shouldn't be -- I think we can "batch" the DELETEs as well, like so:


DELETE TOP (10000) FROM m
FROM Mailboxes m
INNER JOIN (
    SELECT displayName, ExchangeGuid, Store_Id, firstseen, lastseen, MIN(MailBoxID) AS MailBoxID_MIN, MAX(MailBoxID) AS MailBoxID_MAX, COUNT(*) As RecordCount
     FROM     Mailboxes WITH (NOLOCK)
     GROUP BY displayName,
                 ExchangeGuid,
                 Store_Id,
                 firstseen,
                 lastseen
     HAVING COUNT(*) > 1
) AS mailbox_dups ON
    m.MailBoxID > mailbox_dups.MailBoxID_MIN AND --NOTE: gt NOT ge
    m.MailBoxID <= mailbox_dups.MailBoxID_MAX AND
    mailbox_dups.displayName = m.displayName AND
    mailbox_dups.ExchangeGuid = m.ExchangeGuid AND
    mailbox_dups.Store_Id = m.Store_Id AND
    mailbox_dups.firstseen = m.firstseen AND
    mailbox_dups.lastseen = m.lastseen
0
 
LVL 5

Author Comment

by:usslindstrom
ID: 40209711
If you have patients and time, can you please help me understand what that statement is doing?

DELETE TOP (10000) FROM m . . . .


"m" is a temporary table at this point, correct?


One of the worries I have about deleting duplicates, is that there really are some risks of removing rows from this database.  Let me hit you with the scenario on what this table is used for:  It's a table that tracks mailbox migration moves accross out enterprise (thousands upon thousands of users).

UserA's mailbox is currently on ServerA/InformationStore1
UserA's mailbox gets migrated from ServerA/InformationStore1-->ServerK/InformationStore13

This database tracks all of the time the mailbox spent in the first store (ServerA/InformationStore1), and then has another entry once it's discovered that it's in the second store (ServerK/InformationStore13)

This continues indefinately.  So, a user could basically contain the following information (for example):

UserA
Server     Information Store    Start Date     End Date
ServerA   IS1                         09-11-06       01-13-12
ServerK   IS13                       01-14-12       11-1-12
ServerA   IS1                         11-2-12         07-21-14



So, in that example, you can see that the mailbox was moved from ServerA --> over to ServerK --> and then back to ServerA for one reason or another.  Where I'm going with this long example, is that even though ServerA is listed twice, it's really not a duplicate.  (Hense the grouping of start and end date in the table).

Apologies - just trying to understand everything more, than blindly cutting/pasting everything into play.  The above TSQL statement you have above, It's VERY impressive, and I just want to make sure that scenarios like I just described are accounted for.

*Thanks for all the help so far.
0
 
LVL 69

Assisted Solution

by:ScottPletcher
ScottPletcher earned 500 total points
ID: 40209781
>> "m" is a temporary table at this point, correct? <<

No, "m" is an alias for the Mailboxes table itself; check out the FROM in the second line:

DELETE TOP (10000) FROM m
FROM Mailboxes m

Note that before the DELETE a join is done on all the GROUP BY columns, to restrict the rows DELETED to only full dups.  That's also why it's vital that the MailBoxID part of that join uses > and <= (to keep the min ID) or >= and < (to keep the max ID), and not >= and <=.  The code uses the MailboxID min and max to guarantee only one row is left and to likely reduce the SQL join time, since I'm assuming that MailboxID is the clustering key (we can correct that later if you want :-) ).


>> One of the worries I have about deleting duplicates, is that there really are some risks of removing rows from this database. <<

Agreed!  You should definitely back up this table before beginning the DELETEs.
0
 
LVL 5

Author Closing Comment

by:usslindstrom
ID: 40209897
Absolutely perfect!  - Great explinations as well, thank you very much for helping others come to a better understanding of the "fun" technologies.
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40209959
You're very welcome!

As I said, if you want to review the indexes on that table too, just let me know :-) .
0

Featured Post

Free Gift Card with Acronis Backup Purchase!

Backup any data in any location: local and remote systems, physical and virtual servers, private and public clouds, Macs and PCs, tablets and mobile devices, & more! For limited time only, buy any Acronis backup products and get a FREE Amazon/Best Buy gift card worth up to $200!

Join & Write a Comment

Suggested Solutions

Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
The Delta outage: 650 cancelled flights, more than 1200 delayed flights, thousands of frustrated customers, tens of millions of dollars in damages – plus untold reputational damage to one of the world’s most trusted airlines. All due to a catastroph…
Via a live example, show how to set up a backup for SQL Server using a Maintenance Plan and how to schedule the job into SQL Server Agent.
Via a live example, show how to setup several different housekeeping processes for a SQL Server.

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now