Link to home
Start Free TrialLog in
Avatar of Schdeffon
Schdeffon

asked on

Differences between two sets of texts

Hi, If i have 2 sets of texts (stored in an array, not that that should matter)
and i need to specify all the differences on one of them, is there a function that automatically compares an entire text returning all the positions (perhaps) where the texts begin to differ and end differing.  
More specifically: I have a program that grabs html from a site, and then grabs it again a few seconds later to check whether it has changed.  If the texts differ, then the more recent text has to be printed to stdout, highlighting (changing the font to red) where the more recent text is different.
If there isn't already a function, what would be a good method to go about this?
ASKER CERTIFIED SOLUTION
Avatar of GaryFx
GaryFx

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of sunnycoder
if you are on *nix machine, then you can use diff command...
if you want to implement it on windows, you can look into the source of diff ... anyway, the source for such function should be easy

suppose your old text is in char array old[] and new text is in char array new[]

char *p, *q;

p = old;
q = old;

while ( p != NULL && q != NULL )
{
        if ( *p != *q )
        {
                 /* this char is different in both ... handle it*/
        }
        p++;
        q++;
}

if ( p == NULL && q != NULL )
    /* q (new) has some excessive text handle it*/

if ( q == NULL && p != NULL )
    /* p (old) has some excessive text handle it*/
Avatar of migoEX
migoEX

sunnycoder, I think your algorithm won't serve the goal Schdeffon wants it for.

For example, assume there's some counter in the body of the HTML, which changes from time to time.

If you compare 2 pages, where the counter is 9 and 10 - the entire file (starting from that point) will be different, as all the HTML text is shifted in the second case. Bu logically you wanted to find that 9 became 10.

I think that more complex algorithms should be used, which find the minimal differences between the files (as I believe diff will do).
that was what the asker seemed to be asking
"grabs it again a few seconds later to check whether it has changed.  If the texts differ, then the more recent text has to be printed to stdout, highlighting (changing the font to red) where the more recent text is different."

In case diff kind of out put is reqd., I guess this should do

int i = 0;
oldf = fopen(...);
newf = fopen(...);

while ( fgets(oldbuffer...) != NULL and fgets(newbuffer... != NULL )
{
         i++;
         if ( strcmp (oldbuffer, newbuffer)  != 0 )
         {
               printf ( "line no. %d is differernt\n", i );
         }
}

a comparison routine can be added in place of printf to get the char/column number where text differs
EOF tests can be performed after the while loop to test if the number of lines are different in two cases
A decent line by line diff is a lot more work.  It's not simply a matter of identifying the differences, but you also need to find the places where things match up again.  In the case of HTML, it's also useful to ignore differences in whitespace.  

That's why I recommended just using the system diff, getting a free one (e.g. GNU diff) for MS Windows if necessary.  
There's no point in reinventing the wheel.

Gary
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
One approach listed is what I've used in the past...it's tough but possible.

Take an html page, break it into a parse tree where all elements are branches, and all values are leaves. Then write a tree compare function, if you need case sensitive or case insensitive it may be better to add a flag for that.

You get the whitespace, indentation, newline issues solved, but the complexity does go up.