[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 202
  • Last Modified:

Best way to delete mass amount of rows?

Hello Experts

Somehow a table has grown into a monster.  It's over 330k rows and it should probably be about 100k.  I have repeating rows.  Looks like certain distinct rows repeat over 2200 times.  How can I safely & efficiently remove the duplicates?  Should I index the control column first and then delete from mytable where ID in (select top N from my table...) - Something like that?
0
jay-are
Asked:
jay-are
  • 8
  • 8
1 Solution
 
HoggZillaCommented:
C4
Just kidding.
Delete from is SO SLOW! If you are deleting more than you are keeping then you might consider doing an insert into a new table and then blow away the old one. How many rows are we talking about, what kind of frequency? One time?
0
 
jay-areAuthor Commented:
Yeah I wish I could just truncate it and start over but this is historical data that can't be reproduced...

I was reading about the slowness of delete from so I was worried about using that.  I tried to do this already and it sat processing for about 10 minutes before I got nervous and stopped it.

This SHOULD be a 1 time deal.  I'll have to do this based on groups of control numbers.  It looks to me that each group of control numbers have 1 real record I need to keep.  I have no objection to writing the real data out to a seperate table, then wiping the original and putting the real rows back into the old one.  That make sense?
0
 
funnymanmikeCommented:
what about building a query to import only the data you need into a new table, then just rename the two tables.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
jay-areAuthor Commented:
Yeah that would be fine.  I just need to do whatever it takes to get the real data into the original table.  I have to get the duplicates out one way or another.
0
 
HoggZillaCommented:
Since it is just a one-time clean up. Divide and conquer. Pick your grouping for smaller amounts to select and delete and go gettem. Watch your log.
0
 
jay-areAuthor Commented:
So you say just go for the delete from?  Should I try indexing the table first?
0
 
HoggZillaCommented:
Yes, delete from where criteria makes it smaller delete blocks. I would not add indexing for this.
0
 
jay-areAuthor Commented:
How would I use the delete from and not delete every row.  I only want to delete where the row count for each group is > 1.
0
 
HoggZillaCommented:
If you can write a select query, you can add the delete from. Without knowing your table structure, can you post a select that defines what you want to delete?
SELECT * FROM myTable WHERE Col1 = 'YourValue' and ...
DELETE FROM myTable Where Col1 = 'YourValue' and ...
0
 
jay-areAuthor Commented:
Something along these lines I think....

delete from paycheck where [ids.id] in (select top 2000 * from paycheck group by chknum)

[ids.id] is my unique identifier in that table.  I went as high as 2000 because there are some checks in this table that are duplicated about 2300 times...
0
 
HoggZillaCommented:
OK, that make sense. Make sure you select returns the rows you expect to delete before you run it.  Also, it might require a little finesse.
>delete from paycheck where [ids.id] in (select top 2000 * from paycheck group by chknum)
I don't think that SQL select will run, group by error. This might a case of what you want to keep vs what to delete.
select chknum, max([ids.id]) from paycheck group by chknum
Does that select grab what you want to keep? How many rows?
0
 
HoggZillaCommented:
If it does, then you do something like this.
STEP 1. Backup up your data before you do anything else. Make a copy of the table, whatever you normally do.
STEP 2. Confirm your delete SELECT stmt
select [ids.id] from paycheck where not exists (select chknum, max([ids.id]) from paycheck group by chknum)
STEP 3. If numbers 1 and 2 are a go, then run it.
delete from paycheck where [ids.id] in (select [ids.id] from paycheck where not exists (select chknum, max([ids.id]) from paycheck group by chknum))
0
 
jay-areAuthor Commented:
I altered your query a bit to this:

delete from paycheck where [ids.id] in (select [ids.id] from paycheck where not exists (select max([ids.id]) from paycheck group by [ids.id]))

0 rows affected.


0
 
HoggZillaCommented:
Sorry, I'm back.
Can you write just a select statement, no delete, that select what you want to delete? It should be this simple:
SELECT * FROM PAYCHECK WHERE ...
This is Step 2. Confirm this first. Then post the select here and show how many rows it returns.
0
 
jay-areAuthor Commented:
The select query from step 2 doesn't return any results for some reason.  I'm trying to delete all the duplicate chknums or [ids.id]'s.  I guess the ID would be best to use.  So I'd like to keep the max([ids.id]) of each group'd id's and just delete the rest.
0
 
HoggZillaCommented:
So in this case, you need a select to choose what you want to keep. Something like this:
select chknum, max([ids.id]) from paycheck group by chknum
This will give you each distinct chknum and the max id. Can you confirm this is what you want to keep?
0
 
jay-areAuthor Commented:
select chknum, max([ids.id]) as [ids.id] from paycheck group by [ids.id], chknum

Using that gets exactly what I need to keep.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 8
  • 8
Tackle projects and never again get stuck behind a technical roadblock.
Join Now