Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

SQL remove duplicates, keeping 1

Posted on 2014-07-21
9
Medium Priority
?
253 Views
Last Modified: 2014-07-21
Experts,

This question has come up a LOT, and I apologize for asking.  I've done some extensive searching on this topic and have some of the fundamentals, but I need a nudge in TSQL to be able to accomplish what I'm trying.  https://www.simple-talk.com/sql/t-sql-programming/removing-duplicates-from-a-table-in-sql-server/ has the best write-up I've seen so far, but I'm having trouble trying to sponge over the statement to meld it into our environment.

In a table that has approx 1.9 mil rows, there are MANY duplicates.  I've been able to identify duplicate records via the following:

[script]
SELECT displayName, ExchangeGuid, Store_Id, firstseen, lastseen, COUNT(*) As RecordCount
FROM     Mailboxes
GROUP BY displayName,
            ExchangeGuid,
            Store_Id,
            firstseen,
            lastseen
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
[/script]

Now, I need to take this to the next step of actually removing ALL BUT ONE on each of those duplicate returns.  Each of the search pages I visited before looks like they're using some sort of temp-table to accomplish this???

Apologies for the confusion, but any point in the right direction would be very appreciated.
0
Comment
Question by:usslindstrom
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 4
9 Comments
 
LVL 70

Expert Comment

by:Scott Pletcher
ID: 40209483
Is there a unique column, such as an identity, in the table?

Can you temporarily stop use of the table while you do the clean-up, or does the table have to remain "live"?
0
 
LVL 5

Author Comment

by:usslindstrom
ID: 40209601
Thanks for jumping in to help - let me see if I can't answer your questions:


Is there a unique column, such as an identity, in the table?

Yes.  It's not in my above statement, but it's identified as "MailboxId".



Can you temporarily stop use of the table while you do the clean-up, or does the table have to remain "live"?

Our techs use this table throughout the day when needed - but we can shut it down and put a moritorium on its usage for whenever.
0
 
LVL 5

Author Comment

by:usslindstrom
ID: 40209609
On the MailboxId column though, it's "unique" - as it's an auto-gen field, and combining it with my group would break its ability to find duplicates, correct?  Apologies - SQL is "fun" to try to figure out all of its "ins and outs".  Why can't it be easier!?!  :)
0
The Eight Noble Truths of Backup and Recovery

How can IT departments tackle the challenges of a Big Data world? This white paper provides a roadmap to success and helps companies ensure that all their data is safe and secure, no matter if it resides on-premise with physical or virtual machines or in the cloud.

 
LVL 5

Author Comment

by:usslindstrom
ID: 40209613
For note, here are all of the data fields within that table:

      [MailboxID] [int] IDENTITY(1,1) NOT NULL,
      [ExchangeGuid] [uniqueidentifier] NOT NULL,
      [Iden_ID] [int] NOT NULL,
      [LDN_ID] [int] NOT NULL,
      [DN_ID] [int] NULL,
      [Store_Id] [int] NOT NULL,
      [SamAccountName] [varchar](255) NOT NULL,
      [UserPrincipalName] [varchar](255) NOT NULL,
      [Alias] [varchar](255) NOT NULL,
      [LegacyExchangeDN] [varchar](255) NOT NULL,
      [PrimarySmtpAddress] [varchar](255) NOT NULL,
      [DistinguishedName] [varchar](255) NULL,
      [Identity] [varchar](255) NOT NULL,
      [Guid] [uniqueidentifier] NOT NULL,
      [IsMailboxEnabled] [bit] NOT NULL,
      [ProhibitSendQuota] [bigint] NULL,
      [UseDatabaseQuotaDefaults] [bit] NULL,
      [IssueWarningQuota] [bigint] NULL,
      [DisplayName] [varchar](255) NOT NULL,
      [IsValid] [bit] NOT NULL,
      [ExchangeVersion] [varchar](50) NULL,
      [Name] [varchar](255) NOT NULL,
      [StorageLimitInfo] [varchar](20) NULL,
      [TotalItemSize] [bigint] NULL,
      [firstseen] [datetime] NOT NULL,
      [lastseen] [datetime] NOT NULL,
0
 
LVL 70

Accepted Solution

by:
Scott Pletcher earned 2000 total points
ID: 40209651
>> and combining it with my group would break its ability to find duplicates, correct? <<

If we GROUPed by it, but not if we just use MIN and/or MAX.  That way we can delete all but the min or max MailBoxID :-).

If there are more dups than good rows, you could just select one distinct row from each set of dups, truncate the table, and re-insert only the non-dup rows.

If MailBoxID is the clustering key -- idents usually are, even though they most often shouldn't be -- I think we can "batch" the DELETEs as well, like so:


DELETE TOP (10000) FROM m
FROM Mailboxes m
INNER JOIN (
    SELECT displayName, ExchangeGuid, Store_Id, firstseen, lastseen, MIN(MailBoxID) AS MailBoxID_MIN, MAX(MailBoxID) AS MailBoxID_MAX, COUNT(*) As RecordCount
     FROM     Mailboxes WITH (NOLOCK)
     GROUP BY displayName,
                 ExchangeGuid,
                 Store_Id,
                 firstseen,
                 lastseen
     HAVING COUNT(*) > 1
) AS mailbox_dups ON
    m.MailBoxID > mailbox_dups.MailBoxID_MIN AND --NOTE: gt NOT ge
    m.MailBoxID <= mailbox_dups.MailBoxID_MAX AND
    mailbox_dups.displayName = m.displayName AND
    mailbox_dups.ExchangeGuid = m.ExchangeGuid AND
    mailbox_dups.Store_Id = m.Store_Id AND
    mailbox_dups.firstseen = m.firstseen AND
    mailbox_dups.lastseen = m.lastseen
0
 
LVL 5

Author Comment

by:usslindstrom
ID: 40209711
If you have patients and time, can you please help me understand what that statement is doing?

DELETE TOP (10000) FROM m . . . .


"m" is a temporary table at this point, correct?


One of the worries I have about deleting duplicates, is that there really are some risks of removing rows from this database.  Let me hit you with the scenario on what this table is used for:  It's a table that tracks mailbox migration moves accross out enterprise (thousands upon thousands of users).

UserA's mailbox is currently on ServerA/InformationStore1
UserA's mailbox gets migrated from ServerA/InformationStore1-->ServerK/InformationStore13

This database tracks all of the time the mailbox spent in the first store (ServerA/InformationStore1), and then has another entry once it's discovered that it's in the second store (ServerK/InformationStore13)

This continues indefinately.  So, a user could basically contain the following information (for example):

UserA
Server     Information Store    Start Date     End Date
ServerA   IS1                         09-11-06       01-13-12
ServerK   IS13                       01-14-12       11-1-12
ServerA   IS1                         11-2-12         07-21-14



So, in that example, you can see that the mailbox was moved from ServerA --> over to ServerK --> and then back to ServerA for one reason or another.  Where I'm going with this long example, is that even though ServerA is listed twice, it's really not a duplicate.  (Hense the grouping of start and end date in the table).

Apologies - just trying to understand everything more, than blindly cutting/pasting everything into play.  The above TSQL statement you have above, It's VERY impressive, and I just want to make sure that scenarios like I just described are accounted for.

*Thanks for all the help so far.
0
 
LVL 70

Assisted Solution

by:Scott Pletcher
Scott Pletcher earned 2000 total points
ID: 40209781
>> "m" is a temporary table at this point, correct? <<

No, "m" is an alias for the Mailboxes table itself; check out the FROM in the second line:

DELETE TOP (10000) FROM m
FROM Mailboxes m

Note that before the DELETE a join is done on all the GROUP BY columns, to restrict the rows DELETED to only full dups.  That's also why it's vital that the MailBoxID part of that join uses > and <= (to keep the min ID) or >= and < (to keep the max ID), and not >= and <=.  The code uses the MailboxID min and max to guarantee only one row is left and to likely reduce the SQL join time, since I'm assuming that MailboxID is the clustering key (we can correct that later if you want :-) ).


>> One of the worries I have about deleting duplicates, is that there really are some risks of removing rows from this database. <<

Agreed!  You should definitely back up this table before beginning the DELETEs.
0
 
LVL 5

Author Closing Comment

by:usslindstrom
ID: 40209897
Absolutely perfect!  - Great explinations as well, thank you very much for helping others come to a better understanding of the "fun" technologies.
0
 
LVL 70

Expert Comment

by:Scott Pletcher
ID: 40209959
You're very welcome!

As I said, if you want to review the indexes on that table too, just let me know :-) .
0

Featured Post

Windows Server 2016: All you need to know

Learn about Hyper-V features that increase functionality and usability of Microsoft Windows Server 2016. Also, throughout this eBook, you’ll find some basic PowerShell examples that will help you leverage the scripts in your environments!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In part one, we reviewed the prerequisites required for installing SQL Server vNext. In this part we will explore how to install Microsoft's SQL Server on Ubuntu 16.04.
Windocks is an independent port of Docker's open source to Windows.   This article introduces the use of SQL Server in containers, with integrated support of SQL Server database cloning.
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
Using examples as well as descriptions, and references to Books Online, show the documentation available for datatypes, explain the available data types and show how data can be passed into and out of variables.

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question