Solved

How to redeclare SELECT statement for not delaying in million records database

Posted on 2006-06-20
6
165 Views
Last Modified: 2011-10-03
Hi!

I have a 3 million records sql server table. I need to deduplicate based in a certain group of fields.

I've already made an application in vb to do so. The way I do it is:

1. Open a recordset with this statement:

SELECT  TOP 1000 * FROM NACIMIENTO AS TABLA1 WHERE EXISTS (SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*) FROM NACIMIENTO AS TABLA2 WHERE [TABLA1].[MUN_OFI]=[TABLA2].[MUN_OFI] AND [TABLA1].[NOMBRE]=[TABLA2].[NOMBRE] AND [TABLA1].[PRIMER_AP]=[TABLA2].[PRIMER_AP] AND [TABLA1].[SEGUNDO_AP]=[TABLA2].[SEGUNDO_AP] AND [TABLA1].[NOMBRE_MADRE]=[TABLA2].[NOMBRE_MADRE] AND [TABLA1].[PRIMER_AP_MADRE]=[TABLA2].[PRIMER_AP_MADRE] AND [TABLA1].[SEGUNDO_AP_MADRE]=[TABLA2].[SEGUNDO_AP_MADRE] AND [TABLA1].[NOMBRE_ABAM]=[TABLA2].[NOMBRE_ABAM] AND [TABLA1].[PRIMER_AP_ABAM]=[TABLA2].[PRIMER_AP_ABAM] AND [TABLA1].[SEGUNDO_AP_ABAM]=[TABLA2].[SEGUNDO_AP_ABAM] GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) ORDER BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM**

** I used the TOP clause in order to get it run, since when I didn't use it the application just hanged on...  But it keeps on hanging! And I tried to run this query on query analyzer but it doesn't function

2. When I can get the recordset opened (with tables of less than 300K records) I then move between the records looking for the duplicates (there can be 2, 3 or more duplicated records for each one). I apply a certain group of criteria in order to select the record to be kept.

3. Then I create a DeletedTable table and copy the duplicated records.

4. When this has been done I make a DELETE FROM Table WHERE EXISTS (SELECT * FROM DELETEDTABLE WHERE DELETEDTABLE.KEY1 = TABLE.KEY1 ...) I have around 6 key fields but only one of this is usually used to deduplicate on.

The question is, is there any way to optimize the sql mentioned in 1)

Or some other way to detect duplicated records with this kind of data?

The fields I'm deduplicated on are all text fields.

Any help will be greatly appreciated
0
Comment
Question by:bethzycb
  • 2
6 Comments
 
LVL 28

Expert Comment

by:imran_fast
ID: 16949113
/*
 This is the best way to find duplicate records delete it
 You don't need to put the records in temporary table to delete them  */

delete from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)
0
 

Author Comment

by:bethzycb
ID: 16954609
The problem is  I have to keep a copy of the deleted records.

And what happen to the recordset if while I'm looping it I delete some of the records? Does the absoluteposition of them modify?
0
 
LVL 28

Accepted Solution

by:
imran_fast earned 500 total points
ID: 16977591
>>what happen to the recordset if while I'm looping it I delete some of the records?
In the above there is no looping it is not a recordset it is direct delete stmt.

if you want to keep copy than you have to execute two statements


first to record duplicate rows
===================
select * Into YourbackupTable from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)


then to delete them
==============
delete from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)
0

Featured Post

Backup Your Microsoft Windows Server®

Backup all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

Join & Write a Comment

Introduction SQL Server Integration Services can read XML files, that’s known by every BI developer.  (If you didn’t, don’t worry, I’m aiming this article at newcomers as well.) But how far can you go?  When does the XML Source component become …
JSON is being used more and more, besides XML, and you surely wanted to parse the data out into SQL instead of doing it in some Javascript. The below function in SQL Server can do the job for you, returning a quick table with the parsed data.
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now