Solved

How to redeclare SELECT statement for not delaying in million records database

Posted on 2006-06-20
6
171 Views
Last Modified: 2011-10-03
Hi!

I have a 3 million records sql server table. I need to deduplicate based in a certain group of fields.

I've already made an application in vb to do so. The way I do it is:

1. Open a recordset with this statement:

SELECT  TOP 1000 * FROM NACIMIENTO AS TABLA1 WHERE EXISTS (SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*) FROM NACIMIENTO AS TABLA2 WHERE [TABLA1].[MUN_OFI]=[TABLA2].[MUN_OFI] AND [TABLA1].[NOMBRE]=[TABLA2].[NOMBRE] AND [TABLA1].[PRIMER_AP]=[TABLA2].[PRIMER_AP] AND [TABLA1].[SEGUNDO_AP]=[TABLA2].[SEGUNDO_AP] AND [TABLA1].[NOMBRE_MADRE]=[TABLA2].[NOMBRE_MADRE] AND [TABLA1].[PRIMER_AP_MADRE]=[TABLA2].[PRIMER_AP_MADRE] AND [TABLA1].[SEGUNDO_AP_MADRE]=[TABLA2].[SEGUNDO_AP_MADRE] AND [TABLA1].[NOMBRE_ABAM]=[TABLA2].[NOMBRE_ABAM] AND [TABLA1].[PRIMER_AP_ABAM]=[TABLA2].[PRIMER_AP_ABAM] AND [TABLA1].[SEGUNDO_AP_ABAM]=[TABLA2].[SEGUNDO_AP_ABAM] GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) ORDER BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM**

** I used the TOP clause in order to get it run, since when I didn't use it the application just hanged on...  But it keeps on hanging! And I tried to run this query on query analyzer but it doesn't function

2. When I can get the recordset opened (with tables of less than 300K records) I then move between the records looking for the duplicates (there can be 2, 3 or more duplicated records for each one). I apply a certain group of criteria in order to select the record to be kept.

3. Then I create a DeletedTable table and copy the duplicated records.

4. When this has been done I make a DELETE FROM Table WHERE EXISTS (SELECT * FROM DELETEDTABLE WHERE DELETEDTABLE.KEY1 = TABLE.KEY1 ...) I have around 6 key fields but only one of this is usually used to deduplicate on.

The question is, is there any way to optimize the sql mentioned in 1)

Or some other way to detect duplicated records with this kind of data?

The fields I'm deduplicated on are all text fields.

Any help will be greatly appreciated
0
Comment
Question by:bethzycb
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
6 Comments
 
LVL 28

Expert Comment

by:imran_fast
ID: 16949113
/*
 This is the best way to find duplicate records delete it
 You don't need to put the records in temporary table to delete them  */

delete from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)
0
 

Author Comment

by:bethzycb
ID: 16954609
The problem is  I have to keep a copy of the deleted records.

And what happen to the recordset if while I'm looping it I delete some of the records? Does the absoluteposition of them modify?
0
 
LVL 28

Accepted Solution

by:
imran_fast earned 500 total points
ID: 16977591
>>what happen to the recordset if while I'm looping it I delete some of the records?
In the above there is no looping it is not a recordset it is direct delete stmt.

if you want to keep copy than you have to execute two statements


first to record duplicate rows
===================
select * Into YourbackupTable from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)


then to delete them
==============
delete from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)
0

Featured Post

Three Reasons Why Backup is Strategic

Backup is strategic to your business because your data is strategic to your business. Without backup, your business will fail. This white paper explains why it is vital for you to design and immediately execute a backup strategy to protect 100 percent of your data.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

695 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question