Solved

SQL remove duplicates, keeping 1

Posted on 2014-07-21
9
243 Views
Last Modified: 2014-07-21
Experts,

This question has come up a LOT, and I apologize for asking.  I've done some extensive searching on this topic and have some of the fundamentals, but I need a nudge in TSQL to be able to accomplish what I'm trying.  https://www.simple-talk.com/sql/t-sql-programming/removing-duplicates-from-a-table-in-sql-server/ has the best write-up I've seen so far, but I'm having trouble trying to sponge over the statement to meld it into our environment.

In a table that has approx 1.9 mil rows, there are MANY duplicates.  I've been able to identify duplicate records via the following:

[script]
SELECT displayName, ExchangeGuid, Store_Id, firstseen, lastseen, COUNT(*) As RecordCount
FROM     Mailboxes
GROUP BY displayName,
            ExchangeGuid,
            Store_Id,
            firstseen,
            lastseen
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
[/script]

Now, I need to take this to the next step of actually removing ALL BUT ONE on each of those duplicate returns.  Each of the search pages I visited before looks like they're using some sort of temp-table to accomplish this???

Apologies for the confusion, but any point in the right direction would be very appreciated.
0
Comment
Question by:usslindstrom
  • 5
  • 4
9 Comments
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40209483
Is there a unique column, such as an identity, in the table?

Can you temporarily stop use of the table while you do the clean-up, or does the table have to remain "live"?
0
 
LVL 5

Author Comment

by:usslindstrom
ID: 40209601
Thanks for jumping in to help - let me see if I can't answer your questions:


Is there a unique column, such as an identity, in the table?

Yes.  It's not in my above statement, but it's identified as "MailboxId".



Can you temporarily stop use of the table while you do the clean-up, or does the table have to remain "live"?

Our techs use this table throughout the day when needed - but we can shut it down and put a moritorium on its usage for whenever.
0
 
LVL 5

Author Comment

by:usslindstrom
ID: 40209609
On the MailboxId column though, it's "unique" - as it's an auto-gen field, and combining it with my group would break its ability to find duplicates, correct?  Apologies - SQL is "fun" to try to figure out all of its "ins and outs".  Why can't it be easier!?!  :)
0
 
LVL 5

Author Comment

by:usslindstrom
ID: 40209613
For note, here are all of the data fields within that table:

      [MailboxID] [int] IDENTITY(1,1) NOT NULL,
      [ExchangeGuid] [uniqueidentifier] NOT NULL,
      [Iden_ID] [int] NOT NULL,
      [LDN_ID] [int] NOT NULL,
      [DN_ID] [int] NULL,
      [Store_Id] [int] NOT NULL,
      [SamAccountName] [varchar](255) NOT NULL,
      [UserPrincipalName] [varchar](255) NOT NULL,
      [Alias] [varchar](255) NOT NULL,
      [LegacyExchangeDN] [varchar](255) NOT NULL,
      [PrimarySmtpAddress] [varchar](255) NOT NULL,
      [DistinguishedName] [varchar](255) NULL,
      [Identity] [varchar](255) NOT NULL,
      [Guid] [uniqueidentifier] NOT NULL,
      [IsMailboxEnabled] [bit] NOT NULL,
      [ProhibitSendQuota] [bigint] NULL,
      [UseDatabaseQuotaDefaults] [bit] NULL,
      [IssueWarningQuota] [bigint] NULL,
      [DisplayName] [varchar](255) NOT NULL,
      [IsValid] [bit] NOT NULL,
      [ExchangeVersion] [varchar](50) NULL,
      [Name] [varchar](255) NOT NULL,
      [StorageLimitInfo] [varchar](20) NULL,
      [TotalItemSize] [bigint] NULL,
      [firstseen] [datetime] NOT NULL,
      [lastseen] [datetime] NOT NULL,
0
Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 
LVL 69

Accepted Solution

by:
ScottPletcher earned 500 total points
ID: 40209651
>> and combining it with my group would break its ability to find duplicates, correct? <<

If we GROUPed by it, but not if we just use MIN and/or MAX.  That way we can delete all but the min or max MailBoxID :-).

If there are more dups than good rows, you could just select one distinct row from each set of dups, truncate the table, and re-insert only the non-dup rows.

If MailBoxID is the clustering key -- idents usually are, even though they most often shouldn't be -- I think we can "batch" the DELETEs as well, like so:


DELETE TOP (10000) FROM m
FROM Mailboxes m
INNER JOIN (
    SELECT displayName, ExchangeGuid, Store_Id, firstseen, lastseen, MIN(MailBoxID) AS MailBoxID_MIN, MAX(MailBoxID) AS MailBoxID_MAX, COUNT(*) As RecordCount
     FROM     Mailboxes WITH (NOLOCK)
     GROUP BY displayName,
                 ExchangeGuid,
                 Store_Id,
                 firstseen,
                 lastseen
     HAVING COUNT(*) > 1
) AS mailbox_dups ON
    m.MailBoxID > mailbox_dups.MailBoxID_MIN AND --NOTE: gt NOT ge
    m.MailBoxID <= mailbox_dups.MailBoxID_MAX AND
    mailbox_dups.displayName = m.displayName AND
    mailbox_dups.ExchangeGuid = m.ExchangeGuid AND
    mailbox_dups.Store_Id = m.Store_Id AND
    mailbox_dups.firstseen = m.firstseen AND
    mailbox_dups.lastseen = m.lastseen
0
 
LVL 5

Author Comment

by:usslindstrom
ID: 40209711
If you have patients and time, can you please help me understand what that statement is doing?

DELETE TOP (10000) FROM m . . . .


"m" is a temporary table at this point, correct?


One of the worries I have about deleting duplicates, is that there really are some risks of removing rows from this database.  Let me hit you with the scenario on what this table is used for:  It's a table that tracks mailbox migration moves accross out enterprise (thousands upon thousands of users).

UserA's mailbox is currently on ServerA/InformationStore1
UserA's mailbox gets migrated from ServerA/InformationStore1-->ServerK/InformationStore13

This database tracks all of the time the mailbox spent in the first store (ServerA/InformationStore1), and then has another entry once it's discovered that it's in the second store (ServerK/InformationStore13)

This continues indefinately.  So, a user could basically contain the following information (for example):

UserA
Server     Information Store    Start Date     End Date
ServerA   IS1                         09-11-06       01-13-12
ServerK   IS13                       01-14-12       11-1-12
ServerA   IS1                         11-2-12         07-21-14



So, in that example, you can see that the mailbox was moved from ServerA --> over to ServerK --> and then back to ServerA for one reason or another.  Where I'm going with this long example, is that even though ServerA is listed twice, it's really not a duplicate.  (Hense the grouping of start and end date in the table).

Apologies - just trying to understand everything more, than blindly cutting/pasting everything into play.  The above TSQL statement you have above, It's VERY impressive, and I just want to make sure that scenarios like I just described are accounted for.

*Thanks for all the help so far.
0
 
LVL 69

Assisted Solution

by:ScottPletcher
ScottPletcher earned 500 total points
ID: 40209781
>> "m" is a temporary table at this point, correct? <<

No, "m" is an alias for the Mailboxes table itself; check out the FROM in the second line:

DELETE TOP (10000) FROM m
FROM Mailboxes m

Note that before the DELETE a join is done on all the GROUP BY columns, to restrict the rows DELETED to only full dups.  That's also why it's vital that the MailBoxID part of that join uses > and <= (to keep the min ID) or >= and < (to keep the max ID), and not >= and <=.  The code uses the MailboxID min and max to guarantee only one row is left and to likely reduce the SQL join time, since I'm assuming that MailboxID is the clustering key (we can correct that later if you want :-) ).


>> One of the worries I have about deleting duplicates, is that there really are some risks of removing rows from this database. <<

Agreed!  You should definitely back up this table before beginning the DELETEs.
0
 
LVL 5

Author Closing Comment

by:usslindstrom
ID: 40209897
Absolutely perfect!  - Great explinations as well, thank you very much for helping others come to a better understanding of the "fun" technologies.
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 40209959
You're very welcome!

As I said, if you want to review the indexes on that table too, just let me know :-) .
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this article I will describe the Detach & Attach method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.
Via a live example, show how to setup several different housekeeping processes for a SQL Server.

867 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now