Solved

Finding Duplicates Using Local SQL?

Posted on 1998-04-15
21
194 Views
Last Modified: 2010-04-06
What is the best way to locate duplicate records (in one column on one DBase table) using local SQL?  The table column to search on is a real number. I would like to delete any duplicate records, except the original (first one in the natural record order).    Thanks.
0
Comment
Question by:martin_g
  • 7
  • 7
  • 6
  • +1
21 Comments
 
LVL 5

Expert Comment

by:julio011597
ID: 1361557
If you have no knowledge of what there could be in the field to check, i'd just make a complete scan of the table; here is a sample, assuming that "FieldToCheck" holds positive values only:

--//--
function ScanForDups(Table: TTable): Longint;
var
  val: Integer;
begin
  val := -1;
  with Table do begin
    First;
    while not EOF do begin
      if FieldByName('FieldToCheck').AsInteger = val then begin
        Result := RecNo;
        Exit;
      end
      else val := FieldByName('FieldToCheck').AsInteger;
      Next;
    end;
  end;
  Result := 1;
end;
--//--

This returns a RecNo of the first duplicate found; -1 otherwise.

Anyway there's a lot to know to give you a full solution:
- haven't you any knowledge on your data to avoid the full scan?
- what is the program expected to do when a duplicate is found? should it instead store all duplicates somewhere?
0
 
LVL 5

Expert Comment

by:julio011597
ID: 1361558
Oops, should be "Result := -1" down there.
0
 
LVL 1

Author Comment

by:martin_g
ID: 1361559
Edited text of question
0
 
LVL 3

Expert Comment

by:KE
ID: 1361560
Try this:

SELECT COLUMN.X, COLUMN.Y FROM "TABLENAME" X, "TABLENAME" Y WHERE COLUMN.X=COLUMN.Y

0
 
LVL 3

Expert Comment

by:KE
ID: 1361561
Do you have a comment ?

BTW. You should change the query to:

SELECT X.COLUMN Y.COLUMN FROM "TABLENAME" X, "TABLENAME" Y WHERE X.COLUMN=Y.COLUMN

EX:
SELECT X.REALNUM, Y.REALNUM
FROM "C:\TEST.DB" X, "C:\TEST.DB" Y
WHERE X.REALNUM=Y.REALNUM

You might want to include a field with a record identifier (unique field), so that you can actually delete the records afterwards.

EX:
SELECT X.ID, X.REALNUM, Y.ID, Y.REALNUM
FROM "C:\TEST.DB" X, "C:\TEST.DB" Y
WHERE X.REALNUM=Y.REALNUM

0
 
LVL 1

Author Comment

by:martin_g
ID: 1361562
KE thanks for your comments!  The problem is that the duplicates are within a single table. I don't have another table to compare them too.  For instance:

TABLE X

REAL_NO  SUBJECT
1.000          something
2.000          anotherthing
1.000          something
3.000          atotallydifferentthing

I need to find and delete the second dupe.  Any ideas?  Column REAL_NO is indexed.
0
 
LVL 3

Accepted Solution

by:
KE earned 20 total points
ID: 1361563
It's not a problem - the query I described uses the SAME table, just provide the SAME tablename for X and Y.
It works !!! - I've used it on one of my projects !!!

PS. If it is different tables, you wouldn't need to use the table aliases X and Y.

0
 
LVL 1

Author Comment

by:martin_g
ID: 1361564
KE
Whoops you're right about the table. I'll check it out.  Thanks!
0
 
LVL 5

Expert Comment

by:julio011597
ID: 1361565
Does that query returns a Live Result Set?
Will you be able to delete records that way?

...sorry, still learning.
0
 
LVL 3

Expert Comment

by:KE
ID: 1361566
Actually I'm not sure if the query provides a live result, but it's anyway often a handy thing to work with if you want to compare records - the results could however be used in a simple first-while not eof-next loop, and it's easy to compare multiple fields with different kinf of rules etc.
I will also mention that your funtion is not suitable for larger tables with many records, since it's starts over from the begining when called, so it could turn out to be very time consuming - however it will only require a few changes to overcome this - and it isn't ment as critic.

I was using this kind of query on a 80MB table with 45.000 records, and yes, it was also time consuming - it took around 2 minutes to complete it, but it was doing a task that would have taken days/weeks to implement in code - and I think a code version would have executed in at least the same time.

0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 5

Expert Comment

by:julio011597
ID: 1361567
You've submitted an answer, and you are not sure?

AFAIK, there's no query (neither yours) which can return a live result set for this kind of task; and if a query does not return a live result set, any loop is useless because you are not allowed to modify - so delete - data anyway.

Therefore, unless - as i've already pointed out - martin_g has to offer some additional knowledge on its data space, all methods have to deal with the table as a whole.

Yes, what i've proposed can be time consuming, but what else to do? Oh, well, just one thing: ask martin_g!

Hey, martin_g, is this procedure needed for a "one-time-only" mainenance task? If so there's nothing to add.
Is it needed for continuous user's usage instead? If so, couldn't we keep track of records the user enters from a "deletion" to the next, so that the loop has to be run on new records only?
Any other kind of knowledge to reduce the amount of records to scan?
And, *very* import (and you didn't bother ask!), how many records are there in the table?!! Is yours an 80M table with 45.000 records?

... and so on.

But, i'll leave this kind of investigation to someone else, since 20 points are already too few for so much effort, and i'm paying a modem connection.

Regards.
0
 
LVL 1

Author Comment

by:martin_g
ID: 1361568
Julio & KE:
Thanks all for your suggestions.  AS for information about the table, id is a dbase table (therefore there is no AutoIncrement field, this is done in code).  It has about 3 megs worth of information ( about  3,000 records).  I only need to find dupes for maintenenance puposes, not an ongoing process.  The field in question IS the unique record identification. (By the way, I found that setting  the UNIQUE check box in DBExplorer works only if the user uses a control to enter in the data, not if the data is enetered programtically, i.e. a batch import).

I tried KE's suggestion and the SQL resulted in two identical columns, probably due to the fact that that it compared the table to itself and there is no way to compare the result to another ID field (i.e. say  X.ID<>Y.ID) since all the info in the duplicates' columns are identical.  

I have used a variation of JULIO's  code process.  It works, but it is pretty slow, even when using on an indexed field.
0
 
LVL 3

Expert Comment

by:d4jaj1
ID: 1361569
Play nice guys ;o)  So, did the query return a live answer or would you have to do something extra to actually delete the records?
0
 
LVL 1

Author Comment

by:martin_g
ID: 1361570
Yes the SQL KE suggested returns a live answer, but the answer is not right. It returns two identical columns  with ALL the records in the Table  listed.  If I deleted at this point (Say DELETE insead of SELECT) I would delete all the records!  Not a happy situation!
0
 
LVL 3

Expert Comment

by:KE
ID: 1361571
I'll think I will follow d4jaj1's suggestion and don't reply on julio's "nice" comment.
However I still don't know WHY julio's code works. I don't see it mention that the field your searching for should be ordered. Also the improvement to this code would be to STAY in the loop like this:

function RemoveDups(Table: TTable): Longint;
var
  lastVal: Real;
begin
  with Table do begin
    First;
    lastval := FieldByName('FieldToCheck').AsReal;
    while not EOF do begin
      Next;
      if FieldByName('FieldToCheck').AsReal = lastVal then Delete
      else
        lastval := FieldByName('FieldToCheck').AsReal;
    end;
  end;
end;

PS. I haven't checked the code but I'm sure that YOU understand what I mean - and also the table should be indexed on the field you are searching on.

BTW. If one wish to operate on a Table and want to use a Query which doesn't produce a live result set, which is very handy at many implementations - the only thing to do is to make sure that your are having an unique field in your query/table. Then use this field to operate on a standard TTable to make your updates.
VERY usefull, and most experienced programmers will always make a table unique indexed, since this field is very important ! As an example - if you work with SQL servers, and you want to delete a record, you must be sure that it's an unique record, else you wouldn't know which record was going to be deleted - even that your cursor is currently placed on the record.
Always use unique indexes on a SQL table - the PDOX/dBASE tables uses the RecNo to handle this, but for a scalable solution one might want to make these unique indexed to.
0
 
LVL 1

Author Comment

by:martin_g
ID: 1361572
KE:
Thanks for the function! It's pretty close to the actual procedure I was using.  Other than the I marked the records to delete with a -1 and then deleted them separately ( mainly for checking purposes to comment out the delte portion and view what was to be deleted first).  It does work. ( Iwas hoping there would be a speedier way in SQL to get the job done).  Do you think the calls to Post are too redundant (I'm not very familiar with BDE and how the implied Post works, i.e. - moving to the next record programmatically in a loop).  Maybe they're not necessary.

PROCEDURE TForm1.DeleteDupRec;
VAR
  RecNo : REAL;
BEGIN
  Try
  WITH MainDB DO
       BEGIN
       RecNo := 0;
       Edit;
       Post;
       IndexName := 'RECIPE_NO';
       first;
       MainDB.refresh;
       RecNo := MainDBRECIPE_NO.value;
       WHILE NOT MainDB.EOF DO
             BEGIN
             Edit;
             IF  MainDBRECIPE_NO.value = RecNo THEN
                MainDBRECIPE_NO.value := - 1;
             RecNo := MainDBRECIPE_NO.value;
             Next;
             END;
       END;
  Finally
  MainDB.Post;
  WHILE MainDB.FindKey ( [ - 1]) = TRUE DO
        MainDB.DELETE;
END;
END;

0
 
LVL 3

Expert Comment

by:KE
ID: 1361573
This should be a little quicker:

PROCEDURE TForm1.DeleteDupRec;
VAR
  SearchValue : REAL;
BEGIN
  Try
    WITH MainDB DO BEGIN
      IndexName := 'RECIPE_NO';
      first;
      SearchValue := MainDBRECIPE_NO.value;
      WHILE NOT MainDB.EOF DO BEGIN
        IF  MainDBRECIPE_NO.value = SearchValue THEN begin
        edit;
        MainDBRECIPE_NO.value := - 1;
        post;
      end else SearchValue := MainDBRECIPE_NO.value;
      Next;
    END;
  END;
  WHILE MainDB.FindKey ( [ - 1]) DO MainDB.DELETE;
END;

That's it, if you want more help please increase the points, else I'm the only one left here ;-)

PS. I changed your RecNo variable - it's not a good practise to declare symbols which is used by the VCL - believe ME !!!

PPS. If it's of no importance to check the values, I'll suggest that you use the previous function, since it quicker.
0
 
LVL 5

Expert Comment

by:julio011597
ID: 1361574
No, my comment was not meant to be "nice".

What's nice is that i can no more see a Query anywhere.
What's nice is that your code looks quite similar to mine - no, no, just a joke, i guess yours does not even compile
What's nice is that you keep the customer's Edit-Modify-Post scheme.
What's nice is that your code - made to work - would always delete the table's first row.

AND... this is amazing... you end up with asking more points!!

Great.

I don't want to be nice.
We are supposed to be experts, not people just playing a game.
0
 
LVL 1

Author Comment

by:martin_g
ID: 1361575
KE & JULIO:

This is the final code (although I don't think I've ever written any really final code!) I'll be using.  Julio, your were right, again. My previous code did delete the first record every time.  KE, thanks for the simplified code to make it run a bit faster.
PROCEDURE TForm1.DeleteDupRec;
VAR
  SearchRec : REAL;
  BEGIN
  Try
  WITH MainDB DO
       BEGIN
       SearchRec := 0;
       IndexName := 'RECIPE_NO';
       first;
       WHILE NOT MainDB.EOF DO
             BEGIN
             Edit;
             IF  MainDBRECIPE_NO.value = SearchRec THEN
             MainDBRECIPE_NO.value := - 1;
             SearchRec := MainDBRECIPE_NO.value;
             MainDB.Post;
             Next;
             END;
       END;
  Finally
  WHILE MainDB.FindKey ( [ - 1]) = TRUE DO
        BEGIN
        MainDB.Edit;
        MainDB.DELETE;
        END;
END;
END;


0
 
LVL 3

Expert Comment

by:KE
ID: 1361576
Sorry to everyone, I just can't help it...

>No, my comment was not meant to be "nice".
OK - if you aren't able to conform to the regular behaviour of this site - please don't comment anything. The only thing that you have actually contributed with to this question (even the few points) is complaints about other people trying to help.
Also you contributed with an "almost" useless function, which you try to defend as the only possible solution - good work !!!

>What's nice is that i can no more see a Query anywhere.
This is only because there's no unique indexed field involved here - I could have started asking what kind of system, server, OS etc. marting_g was using, but....

>What's nice is that your code looks quite similar to mine - no, no, just a joke, i guess yours does not even compile
Speaking of compiling a code snip written direct on a WEB server - I'll think it's up to the answerer to guess where the ; etc. is missing.

>What's nice is that you keep the customer's Edit-Modify-Post scheme.
>What's nice is that your code - made to work - would always delete the table's first row.
I dont see what you mean,, martin_g was only asking to simplify his function, and I took the most obvious things out.
Take yourself some time to read all the comments - my procedure was already contributed earlier (and if you can't figure it out - it doesn't do either of your NEW claims).

>AND... this is amazing... you end up with asking more points!!
Yeah, but what you didn't understood that it was meant as keeping you to work. I've read your record, and it's very obvious to me that your'e only visiting this site to be a member of the hi-score, what about trying to help people instead ?
Is the points all that makes your life a success - what a pitty ?

>I don't want to be nice.
>We are supposed to be experts, not people just playing a game.
Wrong again - we're supposed to help other people, not playing your kind of expert game for some "useless" points - will an expert score ever prove, that your'e a true expert (keep on dreaming) !!!
Regarding the expert - try to behave first, then someone migth find you an expert some day - do you even have a job, I would never employ such an ignorant.

I'm not writing this because I want to defend my honor (who cares about you anyway ?), but only to "protect" the next victim of your ignorance - someday, when you grow up, you will understand what I mean.

0
 
LVL 5

Expert Comment

by:julio011597
ID: 1361577
> The only thing that you have actually contributed with to this question (even the few points) is complaints about other people trying to help.

I've just complained about you.

> Also you contributed with an "almost" useless function, which you try to defend as the only possible solution - good work !!!

Reread.

Mine was a commen saying: "Anyway there's a lot to know to give you a full solution".

Just after that came your *answer*:

"Try this:
SELECT COLUMN.X, COLUMN.Y
FROM "TABLENAME" X, "TABLENAME" Y
WHERE COLUMN.X=COLUMN.Y"

And, i'm not sure at all mine was the best way to go - so your final solution -, because there were indeed things to know that you didn't bother ask and martin_g couldn't know were important.

In general, while you've taken it personally (yes, this makes things much easier), most of my complaints came from what you've said and done under a _technical_ point of view. That is, i'm sure you are a wonderful guy, you just don't seem a programmer or a software consultant.

Good luck.
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Suggested Solutions

A lot of questions regard threads in Delphi.   One of the more specific questions is how to show progress of the thread.   Updating a progressbar from inside a thread is a mistake. A solution to this would be to send a synchronized message to the…
Objective: - This article will help user in how to convert their numeric value become words. How to use 1. You can copy this code in your Unit as function 2. than you can perform your function by type this code The Code   (CODE) The Im…
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
This tutorial demonstrates a quick way of adding group price to multiple Magento products.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now