Link to home
Start Free TrialLog in
Avatar of usslindstrom
usslindstromFlag for Japan

asked on

SQL remove duplicates, keeping 1

Experts,

This question has come up a LOT, and I apologize for asking.  I've done some extensive searching on this topic and have some of the fundamentals, but I need a nudge in TSQL to be able to accomplish what I'm trying.  https://www.simple-talk.com/sql/t-sql-programming/removing-duplicates-from-a-table-in-sql-server/ has the best write-up I've seen so far, but I'm having trouble trying to sponge over the statement to meld it into our environment.

In a table that has approx 1.9 mil rows, there are MANY duplicates.  I've been able to identify duplicate records via the following:

[script]
SELECT displayName, ExchangeGuid, Store_Id, firstseen, lastseen, COUNT(*) As RecordCount
FROM     Mailboxes
GROUP BY displayName,
            ExchangeGuid,
            Store_Id,
            firstseen,
            lastseen
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
[/script]

Now, I need to take this to the next step of actually removing ALL BUT ONE on each of those duplicate returns.  Each of the search pages I visited before looks like they're using some sort of temp-table to accomplish this???

Apologies for the confusion, but any point in the right direction would be very appreciated.
Avatar of Scott Pletcher
Scott Pletcher
Flag of United States of America image

Is there a unique column, such as an identity, in the table?

Can you temporarily stop use of the table while you do the clean-up, or does the table have to remain "live"?
Avatar of usslindstrom

ASKER

Thanks for jumping in to help - let me see if I can't answer your questions:


Is there a unique column, such as an identity, in the table?

Yes.  It's not in my above statement, but it's identified as "MailboxId".



Can you temporarily stop use of the table while you do the clean-up, or does the table have to remain "live"?

Our techs use this table throughout the day when needed - but we can shut it down and put a moritorium on its usage for whenever.
On the MailboxId column though, it's "unique" - as it's an auto-gen field, and combining it with my group would break its ability to find duplicates, correct?  Apologies - SQL is "fun" to try to figure out all of its "ins and outs".  Why can't it be easier!?!  :)
For note, here are all of the data fields within that table:

      [MailboxID] [int] IDENTITY(1,1) NOT NULL,
      [ExchangeGuid] [uniqueidentifier] NOT NULL,
      [Iden_ID] [int] NOT NULL,
      [LDN_ID] [int] NOT NULL,
      [DN_ID] [int] NULL,
      [Store_Id] [int] NOT NULL,
      [SamAccountName] [varchar](255) NOT NULL,
      [UserPrincipalName] [varchar](255) NOT NULL,
      [Alias] [varchar](255) NOT NULL,
      [LegacyExchangeDN] [varchar](255) NOT NULL,
      [PrimarySmtpAddress] [varchar](255) NOT NULL,
      [DistinguishedName] [varchar](255) NULL,
      [Identity] [varchar](255) NOT NULL,
      [Guid] [uniqueidentifier] NOT NULL,
      [IsMailboxEnabled] [bit] NOT NULL,
      [ProhibitSendQuota] [bigint] NULL,
      [UseDatabaseQuotaDefaults] [bit] NULL,
      [IssueWarningQuota] [bigint] NULL,
      [DisplayName] [varchar](255) NOT NULL,
      [IsValid] [bit] NOT NULL,
      [ExchangeVersion] [varchar](50) NULL,
      [Name] [varchar](255) NOT NULL,
      [StorageLimitInfo] [varchar](20) NULL,
      [TotalItemSize] [bigint] NULL,
      [firstseen] [datetime] NOT NULL,
      [lastseen] [datetime] NOT NULL,
ASKER CERTIFIED SOLUTION
Avatar of Scott Pletcher
Scott Pletcher
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If you have patients and time, can you please help me understand what that statement is doing?

DELETE TOP (10000) FROM m . . . .


"m" is a temporary table at this point, correct?


One of the worries I have about deleting duplicates, is that there really are some risks of removing rows from this database.  Let me hit you with the scenario on what this table is used for:  It's a table that tracks mailbox migration moves accross out enterprise (thousands upon thousands of users).

UserA's mailbox is currently on ServerA/InformationStore1
UserA's mailbox gets migrated from ServerA/InformationStore1-->ServerK/InformationStore13

This database tracks all of the time the mailbox spent in the first store (ServerA/InformationStore1), and then has another entry once it's discovered that it's in the second store (ServerK/InformationStore13)

This continues indefinately.  So, a user could basically contain the following information (for example):

UserA
Server     Information Store    Start Date     End Date
ServerA   IS1                         09-11-06       01-13-12
ServerK   IS13                       01-14-12       11-1-12
ServerA   IS1                         11-2-12         07-21-14



So, in that example, you can see that the mailbox was moved from ServerA --> over to ServerK --> and then back to ServerA for one reason or another.  Where I'm going with this long example, is that even though ServerA is listed twice, it's really not a duplicate.  (Hense the grouping of start and end date in the table).

Apologies - just trying to understand everything more, than blindly cutting/pasting everything into play.  The above TSQL statement you have above, It's VERY impressive, and I just want to make sure that scenarios like I just described are accounted for.

*Thanks for all the help so far.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Absolutely perfect!  - Great explinations as well, thank you very much for helping others come to a better understanding of the "fun" technologies.
You're very welcome!

As I said, if you want to review the indexes on that table too, just let me know :-) .