[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

How to redeclare SELECT statement for not delaying in million records database

Posted on 2006-06-20
6
Medium Priority
?
175 Views
Last Modified: 2011-10-03
Hi!

I have a 3 million records sql server table. I need to deduplicate based in a certain group of fields.

I've already made an application in vb to do so. The way I do it is:

1. Open a recordset with this statement:

SELECT  TOP 1000 * FROM NACIMIENTO AS TABLA1 WHERE EXISTS (SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*) FROM NACIMIENTO AS TABLA2 WHERE [TABLA1].[MUN_OFI]=[TABLA2].[MUN_OFI] AND [TABLA1].[NOMBRE]=[TABLA2].[NOMBRE] AND [TABLA1].[PRIMER_AP]=[TABLA2].[PRIMER_AP] AND [TABLA1].[SEGUNDO_AP]=[TABLA2].[SEGUNDO_AP] AND [TABLA1].[NOMBRE_MADRE]=[TABLA2].[NOMBRE_MADRE] AND [TABLA1].[PRIMER_AP_MADRE]=[TABLA2].[PRIMER_AP_MADRE] AND [TABLA1].[SEGUNDO_AP_MADRE]=[TABLA2].[SEGUNDO_AP_MADRE] AND [TABLA1].[NOMBRE_ABAM]=[TABLA2].[NOMBRE_ABAM] AND [TABLA1].[PRIMER_AP_ABAM]=[TABLA2].[PRIMER_AP_ABAM] AND [TABLA1].[SEGUNDO_AP_ABAM]=[TABLA2].[SEGUNDO_AP_ABAM] GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) ORDER BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM**

** I used the TOP clause in order to get it run, since when I didn't use it the application just hanged on...  But it keeps on hanging! And I tried to run this query on query analyzer but it doesn't function

2. When I can get the recordset opened (with tables of less than 300K records) I then move between the records looking for the duplicates (there can be 2, 3 or more duplicated records for each one). I apply a certain group of criteria in order to select the record to be kept.

3. Then I create a DeletedTable table and copy the duplicated records.

4. When this has been done I make a DELETE FROM Table WHERE EXISTS (SELECT * FROM DELETEDTABLE WHERE DELETEDTABLE.KEY1 = TABLE.KEY1 ...) I have around 6 key fields but only one of this is usually used to deduplicate on.

The question is, is there any way to optimize the sql mentioned in 1)

Or some other way to detect duplicated records with this kind of data?

The fields I'm deduplicated on are all text fields.

Any help will be greatly appreciated
0
Comment
Question by:bethzycb
  • 2
3 Comments
 
LVL 28

Expert Comment

by:imran_fast
ID: 16949113
/*
 This is the best way to find duplicate records delete it
 You don't need to put the records in temporary table to delete them  */

delete from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)
0
 

Author Comment

by:bethzycb
ID: 16954609
The problem is  I have to keep a copy of the deleted records.

And what happen to the recordset if while I'm looping it I delete some of the records? Does the absoluteposition of them modify?
0
 
LVL 28

Accepted Solution

by:
imran_fast earned 2000 total points
ID: 16977591
>>what happen to the recordset if while I'm looping it I delete some of the records?
In the above there is no looping it is not a recordset it is direct delete stmt.

if you want to keep copy than you have to execute two statements


first to record duplicate rows
===================
select * Into YourbackupTable from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)


then to delete them
==============
delete from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A Stored Procedure in Microsoft SQL Server is a powerful feature that it can be used to execute the Data Manipulation Language (DML) or Data Definition Language (DDL). Depending on business requirements, a single Stored Procedure can return differe…
An alternative to the "For XML" way of pivoting and concatenating result sets into strings, and an easy introduction to "common table expressions" (CTEs). Being someone who is always looking for alternatives to "work your data", I came across this …
This video shows, step by step, how to configure Oracle Heterogeneous Services via the Generic Gateway Agent in order to make a connection from an Oracle session and access a remote SQL Server database table.
Via a live example, show how to extract information from SQL Server on Database, Connection and Server properties
Suggested Courses

872 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question