Comparing textfiles.

Hi all,
I've made an application that compares two similar textfiles line for line, then write an output of both the lines if they are different;


aaaaaaaaaaaaaa       <->        aaaaaaaaaaaaaa
bbbbbbbbbbbbbb       <->        bbbbbbbbbbbbbb
cccccccccccccccc       <->        cccccccccccccccx   -> ccccccccccccccccc
                                                                            ccccccccccccccccx
.....
.....
zzzzzzzzzzzzzzzz        <->        zzzzzzzzzzzzzzzz


The thing is that this won't work if somewhere in one of the files a line has been added, in which case I need only this line to be written, and then the process resynchronized on the next similar line. There can be several instances of added lines after eachother, in numerous instances of the file, but mainly the two files will contain large chunks where the lines are equal. How do I solve this problem?

A.




LVL 2
AqueathAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Wim ten BrinkSelf-employed developerCommented:
This is not as simple as you think. Besides, there's already a tool that's doing this which is called WinDiff. There are probably some sources on the Internet somewhere that show you where to start.

And how would you handle this:

File 1                  File 2
AAAAAA                  AAAAAA
BBBBBB                  CCCCCC
CCCCCC                  BBBBBB
DDDDDD                  DDDDDD

Now, you have two lines that are exchanged... :-)
0
BlackTigerXCommented:
it's quite a chalenge

one approach you could take is make one of your files the "master"

then you would go checking the lines of file 1 (master) in file2, if there is a mismatch, you keep searching in file2 for the same line in file1, if you don't find it, you advance a line in file1 and so on...

so in the example that Workshop_Alex showed, it would go like this:

AAAA AAAA == same - both "pointers" advance
BBBB CCCC == diff - stay in same line in file 1, advance in file 2
BBBB BBBB == same - both pinters advance
CCCC DDDD == diff - stay in same line in file 1, advance in file 2

...that's just an idea, haven't tried it... there's many things that could be done... maybe even using multiple threads
0
BlackTigerXCommented:
the ideal would probably be that if you find a difference, you start looking in the "other" file at the same time, whichever finds a match in the "other" file stops first and consider that a match (and so both pointers keep advancing from there)

so, the same concept of my previous answer, except that is applied to both files, when there is a difference:
- you have a pointer static in file 1 looking at the next lines of file 2
- you also have a static pointer in file 2, looking at the next lines of file 1
0
Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

AqueathAuthor Commented:
Hi again,
Thanks for the comments so far, and it seems that this was a bigger problem than I first expected. So I've raised the reward ;-)

-Workshop_Alex:
Aren't there components that could do this? I've downloaded something called TDiff, which compares two files, and lists the difference in two separate windows. The thing is it merges the two files, deletes some lines, and I'm too stupid to figure out a way to integrate it into my program. What I would be looking for is a component that is for instance able to compare two StringLists or Listboxes, and expand one or both in the same way that graphical tools do this. Ie if an extra line has been inserted into StringList one there would be a blank one in Stringlist two etc. Then I could compare each line in the StringLists(now containing the same number of lines) with eachother. Something like ;

StrLstDiff(StringList1,StringList2);

Maybe this can be done with WinDiff or the component I downloaded:
TDiff
http://www.users.on.net/johnson/delphi/


-BlackTigerX
I started out down the same path you described above prior to posting my initial message, but it turned out to be quite messy, and I probably did not find a good solution as it turned out to be pretty slow.



So I figured that someone has probably done this thing before, and made a component for this sort of thing that makes this easy. I mean, why reinvent the wheel? ;-)


Anyone know a component that could do this, and know how I could implement it?

A.
0
BlackTigerXCommented:
I started writing it, is almost done... but if you want a component, here you go

http://www.users.on.net/johnson/delphi/
0
Wim ten BrinkSelf-employed developerCommented:
There are probably many components that will compare files. It's just that the algorithm isn't easy. You might also want to strip whitespaces and ignore empty lines, or have some other special fuunctionality.

There could be an interesting trick by using a hash table. You calculate a hash over each line in your file and store it in an array. Then you do the same with every line in the second file and check if it's in the hash table or not. Now, the lines in the second file that can be found in the hash table are duplicates. If you do it too in opposite direction too, creating a hash table for the secoond file and check every line of the first file with this hash table, you will know exactly which lines exist in both files. This would allow you to show all lines that don't exist in both files.
Now, about those lines that do exist in both files, all you would have to do next is check if they exist in the same order.

But it's one of the more complex algorithms and while I'm working on a hashtable right now, I just don't have the time to write a solution for this. Am curious of BlackTigerX will provide a good solution, though. :-)
0
gwalkeriqCommented:
winmerge http://winmerge.sourceforge.net/ is an open source version that is IMO better than Windiff. Full soure, written in C.

Simpler, but more powerful diff source is, http://ftp.gnu.org/pub/gnu/diffutils/ is a place where you find the GNU diff source

If you want to understand the algorithms, try here -- http://c2.com/cgi/wiki?DiffAlgorithm

If you want a commercial delphi component, try http://www.brothersoft.com/Software_Developer_Delphi_Pass_Diff_Pro_24080.html

Ad-hoc solutions are not likely to be very fast or robust -- check the papers referenced in the 3rd link if you are serious about learning how to roll you own diff.


0
geobulCommented:
Another interesting solution: http://www.ascotti.org/programming/lcs/lcs.htm with source.
0
BlackTigerXCommented:
well... here's my shot at it... I stopped writing yesterday since this guy wants a component, but finished this morning... just because...

hasn't been tested thoughtfully, but is a good start... it uses the concept that I explained, and which is also kinda the same concept WorkShop_Alex has, only without using hash tables...

I wrote a function that takes 2 sources (F1 and F2), it outputs the lines that are equal and the ones that are different (missing from either file)

procedure FindDifferences(F1, F2:TStrings; Diff:TStrings);
var
  P1, P2:Integer; //"pointers", or current row for each file
  tmpP1, tmpP2:Integer; //temp "pointers" to search the other file
  Diffs1, Diffs2:TStringList;
  Match1, Match2:Boolean;
begin
  Diff.Clear;
  Diffs1:=TStringList.Create;
  Diffs2:=TStringList.Create;
  try
    if (F1.Count=0) or (F2.Count=0) then
      Exit; //nothing to do, one of the files is empty
    P1:=0;
    P2:=0;
    repeat
      if (P1>=F1.Count) or (P2>=F2.Count) then
        Break; //we're done
      if (F1[P1]=F2[P2]) then //there's a match, increment both pointers and go to next line
      begin
        Diff.Add(F1[P1]+ ' == '+F2[P2]);
        Inc(P1);
        Inc(P2);
        Continue;
      end
      else //the fun begins here, there's a mismatch
      begin
        //Diff.Add(F1[P1]+' <-> '+F2[P2]); //"print" the difference
        tmpP1:=P1+1; //go to the next line in file 1
        tmpP2:=P2+1; //go to the next line in file 2
        Diffs1.Clear;
        Diffs2.Clear;
        Diffs1.Add('>>'+F1[P1]);
        Diffs2.Add('<<'+F2[P2]);
        Match1:=False;
        Match2:=False;
        repeat
          if (tmpP1>=F1.Count) or (tmpP2>=F2.Count) then
            Break; //one of the files reached the end, exit
          if (F1[P1]=F2[tmpP2]) or (F2[P2]=F1[tmpP1]) then
          begin
            if (F1[P1]=F2[tmpP2]) then
            begin
              Match1:=True;
              Diff.AddStrings(Diffs2)  //add the lines that are missing from the other file
            end
            else
            begin
              Match2:=True;
              Diff.AddStrings(Diffs1);
            end;
            Break; //we found a match, of the same line in file 1, in another line of file 2
          end;
          Diffs1.Add('>>'+F1[tmpP1]); //missing from file 2
          Diffs2.Add('<<'+F2[tmpP2]); //missing from file 1
          Inc(tmpP1);
          Inc(tmpP2)
        until (False);
        //*** only advance in the other file
        if (Match1) then
          P2:=tmpP2
        else if (Match2) then
          P1:=tmpP1;
        if (Match1) or (Match2) then
          Diff.Add(F1[P1]+ ' == '+F2[P2])
        else
          Diff.Add(F1[P1]+ ' <-> '+F2[P2]);
        //we found a match, increment both pointers
        Inc(P1);
        Inc(P2)
      end
    until (False)
  finally
    Diffs1.Free;
    Diffs2.Free
  end
end;

procedure TForm1.Button1Click(Sender: TObject);
begin
  Memo1.Lines.LoadFromFile('test.txt');
  Memo2.Lines.LoadFromFile('test2.txt');
  FindDifferences(Memo1.Lines, Memo2.Lines, Memo3.Lines);
end;

I used 2 memos to test this, load 2 text files in the first 2, and output to the third

best regards
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
aikimarkCommented:
take a look at http://DLSuperC.com

it is very fast and powerful and has several variants that can do comparisons of directory trees and comparisons of files in non-line chunks (words, bytes, etc.)
0
AqueathAuthor Commented:
Hi All,
The points go to BlackTigerX for a solution that deliveres just what I needed. A big thanks to you, and all that participated.

A.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Delphi

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.