Solved

Differences between two sets of texts

Posted on 2003-11-09
10
316 Views
Last Modified: 2010-04-15
Hi, If i have 2 sets of texts (stored in an array, not that that should matter)
and i need to specify all the differences on one of them, is there a function that automatically compares an entire text returning all the positions (perhaps) where the texts begin to differ and end differing.  
More specifically: I have a program that grabs html from a site, and then grabs it again a few seconds later to check whether it has changed.  If the texts differ, then the more recent text has to be printed to stdout, highlighting (changing the font to red) where the more recent text is different.
If there isn't already a function, what would be a good method to go about this?
0
Comment
Question by:Schdeffon
10 Comments
 
LVL 6

Accepted Solution

by:
GaryFx earned 63 total points
ID: 9709640
You can just pipe them through diff and capture the output and perhaps status code.  For example

   rc = system("diff oldversion newversion > diffs.txt");
   if (rc) {
      /* An error occurred */
      fprintf(stderr, "The files are different. \n");
   }

Gary
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 9709645
if you are on *nix machine, then you can use diff command...
if you want to implement it on windows, you can look into the source of diff ... anyway, the source for such function should be easy

suppose your old text is in char array old[] and new text is in char array new[]

char *p, *q;

p = old;
q = old;

while ( p != NULL && q != NULL )
{
        if ( *p != *q )
        {
                 /* this char is different in both ... handle it*/
        }
        p++;
        q++;
}

if ( p == NULL && q != NULL )
    /* q (new) has some excessive text handle it*/

if ( q == NULL && p != NULL )
    /* p (old) has some excessive text handle it*/
0
 
LVL 5

Expert Comment

by:migoEX
ID: 9709946
sunnycoder, I think your algorithm won't serve the goal Schdeffon wants it for.

For example, assume there's some counter in the body of the HTML, which changes from time to time.

If you compare 2 pages, where the counter is 9 and 10 - the entire file (starting from that point) will be different, as all the HTML text is shifted in the second case. Bu logically you wanted to find that 9 became 10.

I think that more complex algorithms should be used, which find the minimal differences between the files (as I believe diff will do).
0
U.S. Department of Agriculture and Acronis Access

With the new era of mobile computing, smartphones and tablets, wireless communications and cloud services, the USDA sought to take advantage of a mobilized workforce and the blurring lines between personal and corporate computing resources.

 
LVL 45

Expert Comment

by:sunnycoder
ID: 9709960
that was what the asker seemed to be asking
"grabs it again a few seconds later to check whether it has changed.  If the texts differ, then the more recent text has to be printed to stdout, highlighting (changing the font to red) where the more recent text is different."

In case diff kind of out put is reqd., I guess this should do

int i = 0;
oldf = fopen(...);
newf = fopen(...);

while ( fgets(oldbuffer...) != NULL and fgets(newbuffer... != NULL )
{
         i++;
         if ( strcmp (oldbuffer, newbuffer)  != 0 )
         {
               printf ( "line no. %d is differernt\n", i );
         }
}

a comparison routine can be added in place of printf to get the char/column number where text differs
EOF tests can be performed after the while loop to test if the number of lines are different in two cases
0
 
LVL 6

Expert Comment

by:GaryFx
ID: 9710689
A decent line by line diff is a lot more work.  It's not simply a matter of identifying the differences, but you also need to find the places where things match up again.  In the case of HTML, it's also useful to ignore differences in whitespace.  

That's why I recommended just using the system diff, getting a free one (e.g. GNU diff) for MS Windows if necessary.  
There's no point in reinventing the wheel.

Gary
0
 
LVL 45

Assisted Solution

by:Kdo
Kdo earned 62 total points
ID: 9712052

Hi Schdeffon,

A couple of pieces of good advice have been posted, but it gets even more compicated than that.  Changing a single value in an html document can result in line breaks changing which will change the number of lines and may change how the html is endented or othwise formatted.  This can strongly complicate the amount of work needed to test for the differences.

If this were my task, I'd read the two files into separate buffers and walk through them byte by byte.  At each byte I would then make certain value judgements.

If the bytes are the same, loop to the next byte.
If the bytes are both whitespace, skip to the next non-whitespace in each line.
If the bytes are both characters or digits, check to see if the WORD that starts at the current locations match.  (If both characters being pointed to are "A", assemble a word from each file and see if the words are different.  If they are different, you don't really care that a 1 in the first file became a 2 in the second as much as you'll care that the word "Address1" became "Address2" or that the value 101 became the value 102.)  You may want to disregard case.
If the bytes are both quote marks, build the corresponding strings and test them.
Finally, test the two bytes.  If they are not the same, mark a difference.

Resyncing after finding a difference can be tough.  That's a science unto itself.


Good Luck,
Kent
0
 
LVL 5

Expert Comment

by:g0rath
ID: 9714947
One approach listed is what I've used in the past...it's tough but possible.

Take an html page, break it into a parse tree where all elements are branches, and all values are leaves. Then write a tree compare function, if you need case sensitive or case insensitive it may be better to add a flag for that.

You get the whitespace, indentation, newline issues solved, but the complexity does go up.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
The goal of this video is to provide viewers with basic examples to understand how to create, access, and change arrays in the C programming language.
The goal of this video is to provide viewers with basic examples to understand and use switch statements in the C programming language.

867 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now