[Last Call] Learn about multicloud storage options and how to improve your company's cloud strategy. Register Now

x
?
Solved

How to redeclare SELECT statement for not delaying in million records database

Posted on 2006-06-20
6
Medium Priority
?
174 Views
Last Modified: 2011-10-03
Hi!

I have a 3 million records sql server table. I need to deduplicate based in a certain group of fields.

I've already made an application in vb to do so. The way I do it is:

1. Open a recordset with this statement:

SELECT  TOP 1000 * FROM NACIMIENTO AS TABLA1 WHERE EXISTS (SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*) FROM NACIMIENTO AS TABLA2 WHERE [TABLA1].[MUN_OFI]=[TABLA2].[MUN_OFI] AND [TABLA1].[NOMBRE]=[TABLA2].[NOMBRE] AND [TABLA1].[PRIMER_AP]=[TABLA2].[PRIMER_AP] AND [TABLA1].[SEGUNDO_AP]=[TABLA2].[SEGUNDO_AP] AND [TABLA1].[NOMBRE_MADRE]=[TABLA2].[NOMBRE_MADRE] AND [TABLA1].[PRIMER_AP_MADRE]=[TABLA2].[PRIMER_AP_MADRE] AND [TABLA1].[SEGUNDO_AP_MADRE]=[TABLA2].[SEGUNDO_AP_MADRE] AND [TABLA1].[NOMBRE_ABAM]=[TABLA2].[NOMBRE_ABAM] AND [TABLA1].[PRIMER_AP_ABAM]=[TABLA2].[PRIMER_AP_ABAM] AND [TABLA1].[SEGUNDO_AP_ABAM]=[TABLA2].[SEGUNDO_AP_ABAM] GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) ORDER BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM**

** I used the TOP clause in order to get it run, since when I didn't use it the application just hanged on...  But it keeps on hanging! And I tried to run this query on query analyzer but it doesn't function

2. When I can get the recordset opened (with tables of less than 300K records) I then move between the records looking for the duplicates (there can be 2, 3 or more duplicated records for each one). I apply a certain group of criteria in order to select the record to be kept.

3. Then I create a DeletedTable table and copy the duplicated records.

4. When this has been done I make a DELETE FROM Table WHERE EXISTS (SELECT * FROM DELETEDTABLE WHERE DELETEDTABLE.KEY1 = TABLE.KEY1 ...) I have around 6 key fields but only one of this is usually used to deduplicate on.

The question is, is there any way to optimize the sql mentioned in 1)

Or some other way to detect duplicated records with this kind of data?

The fields I'm deduplicated on are all text fields.

Any help will be greatly appreciated
0
Comment
Question by:bethzycb
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
6 Comments
 
LVL 28

Expert Comment

by:imran_fast
ID: 16949113
/*
 This is the best way to find duplicate records delete it
 You don't need to put the records in temporary table to delete them  */

delete from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)
0
 

Author Comment

by:bethzycb
ID: 16954609
The problem is  I have to keep a copy of the deleted records.

And what happen to the recordset if while I'm looping it I delete some of the records? Does the absoluteposition of them modify?
0
 
LVL 28

Accepted Solution

by:
imran_fast earned 2000 total points
ID: 16977591
>>what happen to the recordset if while I'm looping it I delete some of the records?
In the above there is no looping it is not a recordset it is direct delete stmt.

if you want to keep copy than you have to execute two statements


first to record duplicate rows
===================
select * Into YourbackupTable from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)


then to delete them
==============
delete from NACIMIENTO
where keyfield not in
(
select min(A.keyfield) from NACIMIENTO A
inner join
(SELECT MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP,
NOMBRE_MADRE, PRIMER_AP_MADRE, SEGUNDO_AP_MADRE,
NOMBRE_ABAM, PRIMER_AP_ABAM, SEGUNDO_AP_ABAM, COUNT(*)
FROM NACIMIENTO
GROUP BY MUN_OFI, NOMBRE, PRIMER_AP, SEGUNDO_AP, NOMBRE_MADRE,
PRIMER_AP_MADRE, SEGUNDO_AP_MADRE, NOMBRE_ABAM, PRIMER_AP_ABAM,
SEGUNDO_AP_ABAM HAVING COUNT(*) > 1) B
ON
A.MUN_OFI=B.MUN_OFI AND
A.NOMBRE = B.NOMBRE AND
A.PRIMER_AP =B.PRIMER_AP AND
A.SEGUNDO_AP = B.SEGUNDO_AP AND
A.NOMBRE_MADRE = B.NOMBRE_MADRE AND
A.PRIMER_AP_MADRE = B.PRIMER_AP_MADRE AND
A.SEGUNDO_AP_MADRE = B.SEGUNDO_AP_MADRE AND
A.NOMBRE_ABAM = B.NOMBRE_ABAM AND
A.PRIMER_AP_ABAM = B.PRIMER_AP_ABAM AND
A.SEGUNDO_AP_ABAM =B.SEGUNDO_AP_ABAM  
GROUP BY A.MUN_OFI, A.NOMBRE, A.PRIMER_AP, A.SEGUNDO_AP, A.NOMBRE_MADRE,
A.PRIMER_AP_MADRE, A.SEGUNDO_AP_MADRE, A.NOMBRE_ABAM, A.PRIMER_AP_ABAM,
A.SEGUNDO_AP_ABAM
)
0

Featured Post

Free learning courses: Active Directory Deep Dive

Get a firm grasp on your IT environment when you learn Active Directory best practices with Veeam! Watch all, or choose any amount, of this three-part webinar series to improve your skills. From the basics to virtualization and backup, we got you covered.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Why is this different from all of the other step by step guides?  Because I make a living as a DBA and not as a writer and I lived through this experience. Defining the name: When I talk to people they say different names on this subject stuff l…
This article explains how to reset the password of the sa account on a Microsoft SQL Server.  The steps in this article work in SQL 2005, 2008, 2008 R2, 2012, 2014 and 2016.
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.

650 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question