Solved

Please Comment This Perl Code Snippet

Posted on 2007-04-05
22
366 Views
Last Modified: 2010-03-05
I'd appreciate if a Perl expert who really understands exactly what is happening in every part of this code, could add comments line by line. I need to know how it works line by line.  500 points for detailed comments. Thanks.

$DiffArgs=''; # set to -w to ignore white spaces

(@OldLines,@NewLines,$ii,$iii,$iiii,$is);
open(DIFF,"diff -f $DiffArgs $OldFile $NewFile |");
open(OLD,$OldFile);  
push(@OldLines,'');
push(@NewLines,'');

while(<OLD>){
  chomp;
  push(@OldLines,$_);
  push(@NewLines,$_);
  warn"unclosed or unopened tag detected, malfunction warning: $_" if(/<[^>]*$|^[^<]*>/);
}

close(OLD);

$iii=0;
$TmpFile="$ENV{HOME}/.HtmlDiff";
$FH;
$FH=\*STDOUT;

while(<DIFF>){
  if(/^d(\d+)(\s+(\d+))?$/){
    $iiii= defined $2 ? $2 : $1;
    for($ii=$iii;$ii<$1;$ii++){PrintLine('Equal',$NewLines[$ii])}
    for($ii=$1;$ii<=$iiii;$ii++){PrintLine('Old',$NewLines[$ii])}
    $iii=$iiii+1;
  }

  elsif(/^c(\d+)(\s+(\d+))?$/){
    $iiii= defined $2 ? $2 : $1;
    for($ii=$iii;$ii<$1;$ii++){PrintLine('Equal',$NewLines[$ii])}
    for($ii=$1;$ii<=$iiii;$ii++){PrintLine('Old',$NewLines[$ii])}
    while(defined($is=<DIFF>)){
      chomp;
      if($is=~/^\.$/){last}
      else{PrintLine('New',$is)}
    }
    $iii=$iiii+1;
  }
  elsif(/^a(\d+)$/){
    $iiii=$1;
    for($ii=$iii;$ii<=$1;$ii++){PrintLine('Equal',$NewLines[$ii])}
    while(defined($is=<DIFF>)){
      if($is=~/^\.$/){last}
      else{PrintLine('New',$is)}
    }
    $iii=$1+1;
  }
}

for($ii=$iii;$ii<=$#NewLines;$ii++){PrintLine('Equal',$NewLines[$ii])}
unlink($OldFile,$NewFile);

print $output;

Thanks!  (By the way, I have another question asking for comments on another part of the script, at: http:Q_22495367.html (for another 500 points)).
0
Comment
Question by:Randall-B
  • 10
  • 6
  • 3
  • +1
22 Comments
 
LVL 84

Expert Comment

by:ozo
Comment Utility
perldoc -f open
perldoc -f push
perldoc -f chomp
perldoc -f warn
perldoc -f close
etc.
explain what each of the functions do
if you want to know why they are doing it,  it's hard for us to know without knowing why the code was written
0
 

Author Comment

by:Randall-B
Comment Utility
The script uses Unix DIFF to compare html documents. Then it outputs a web page that shows underlines where additions were made, and strikeouts where deletions were made.  The entire script is visible here: http://216.92.61.99/htmldiffcgi.htm .  Maybe that context would help. And you can see the results of it here: http://216.92.61.99/cgi-bin/htmldiff.cgi?doc=2B . But what I really need is an explanation of each line of code in my Question. Thanks.
0
 
LVL 12

Expert Comment

by:Jeff Darling
Comment Utility
I started documenting the code, and then I saw your comment showing the link to the entire script.

If anyone else wants to take a stab at this, I would definitely recommend looking at http://216.92.61.99/htmldiffcgi.htm and http://216.92.61.99/cgi-bin/htmldiff.cgi?doc=2B
0
 

Author Comment

by:Randall-B
Comment Utility
jeffld,
    I posted the script because ozo asked for the context of the code snippet. Although the script is rather long, I'm asking for an explanation of only a portion.  I understand the other parts OK and have already made a lot of modifications, but I'm trying to work out some bugs in the output formatting; so I wanted to understand a couple of sections in better detail.
    As for the sample page, I thought the collection of opinions found on this site would make a nice demo filler text, but now it is plain "lorem ipsum."
    You are welcome to continue assisting with my question if you would like.  Thanks.
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
Do the perldoc pages not adequately describe what the lines do?
If there are specific things you don't understand, we may be better able to help you if you can ask specific questions.
0
 

Author Comment

by:Randall-B
Comment Utility
ozo,
   Yes, the excellent documentation at perldoc.perl.org and my Perl books are helpful as to the functions like chomp, push, etc., but I'm trying to understand exactly what is going on algorithms like this:

while(<DIFF>){
  if(/^d(\d+)(\s+(\d+))?$/){
    $iiii= defined $2 ? $2 : $1;
    for($ii=$iii;$ii<$1;$ii++){PrintLine('Equal',$NewLines[$ii])}
    for($ii=$1;$ii<=$iiii;$ii++){PrintLine('Old',$NewLines[$ii])}
    $iii=$iiii+1;
  }

and the other two similar loops below that.  

I'm trying to track down what is causing some bugs in the output. I would be grateful for a plain English explanation of each line in those portions of the code.  Thanks.
0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 80 total points
Comment Utility
That does seem strangely written
Would it be any clearer as
PrintLine('Equal',$_) for @NewLines[$iii..$1-1];
PrintLine('Old',$_) for @NewLines[$1..$iiii];
where $1 is what was matched in the first pair of parenthesis: (\d+)
and $iiii is what was matches in the second pair of parenthesis: (\s+(\d+)) if the second pair of parentheses matched, otherwse what was matched in the first pair of parentheses.
0
 

Author Comment

by:Randall-B
Comment Utility
Yes, that's a little clearer.  

And what does this do, near the first line of the snippet:

   (@OldLines,@NewLines,$ii,$iii,$iiii,$is);

And this:

elsif(/^c(\d+)(\s+(\d+))?$/){
    $iiii= defined $2 ? $2 : $1;
    for($ii=$iii;$ii<$1;$ii++){PrintLine('Equal',$NewLines[$ii])}
    for($ii=$1;$ii<=$iiii;$ii++){PrintLine('Old',$NewLines[$ii])}
    while(defined($is=<DIFF>)){
      chomp;
      if($is=~/^\.$/){last}
      else{PrintLine('New',$is)}
    }
    $iii=$iiii+1;

and the one below that?

By the way, if anybody wonders, the original script (which I have modified a lot) came from CPAN at
http://www.cpan.org/authors/id/B/BW/BWEILER/HtmlDiff-2.1
   Thanks.
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
(@OldLines,@NewLines,$ii,$iii,$iiii,$is);
by itself does not seem to serve a useful purpose.
if it was
my(@OldLines,@NewLines,$ii,$iii,$iiii,$is);
it would be declareing those local variables.
0
 

Author Comment

by:Randall-B
Comment Utility
Now you see why I've asked for help understanding it.  Although its from CPAN, the coding seems a bit unusual.

Please explain this one, too:

elsif(/^c(\d+)(\s+(\d+))?$/){
    $iiii= defined $2 ? $2 : $1;
    for($ii=$iii;$ii<$1;$ii++){PrintLine('Equal',$NewLines[$ii])}
    for($ii=$1;$ii<=$iiii;$ii++){PrintLine('Old',$NewLines[$ii])}
    while(defined($is=<DIFF>)){
      chomp;
      if($is=~/^\.$/){last}
      else{PrintLine('New',$is)}
    }
    $iii=$iiii+1;

and the one below that. Thanks.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 84

Expert Comment

by:ozo
Comment Utility
The else seems to be the same thing for lines starting with c
then calling printline('New' for lines read from <DIFF> stoping either when the end of file is reached, or when it sees a line containing only a single .
0
 

Author Comment

by:Randall-B
Comment Utility
In the context of the purpose of the script, do you see why it would be looking for a line beginning with " c "?

And when finished here, please consider commenting in detail on http:Q_22495367.html for another 500 points. Thanks.
0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 80 total points
Comment Utility
probably because it is parsing the output of diff, which indicates added, changed and deleted lines
0
 
LVL 7

Accepted Solution

by:
mzalfres earned 420 total points
Comment Utility
Randall,

OK, so here is the second part:

# assign empty string
$DiffArgs=''; # set to -w to ignore white spaces

# 'declare' variables - in general not needed if you don't have 'use strict'
(@OldLines,@NewLines,$ii,$iii,$iiii,$is);
# open a 'pipe' to UNIX process. It opens file like connection to UNIX command and reads its output.
# in this case, in DIFF filehandle you have output from diff -f $DiffArgs $OldFile $NewFile
open(DIFF,"diff -f $DiffArgs $OldFile $NewFile |");
# open $OldFile
open(OLD,$OldFile);  
# this makes nothing, but normally it appends next element to the table. Because element is '' (empty
# string), so it makes only variable 'declared'
push(@OldLines,'');
push(@NewLines,'');

# for each line in file $OldFile...
while(<OLD>){
# cut ending newline if any
  chomp;
# put modified line into @OldLines and @NewLines tables.
  push(@OldLines,$_);
  push(@NewLines,$_);
# this line will issue warning, when you have situation, that '>' character
# goes before at least one '<'. However it will not detect situation, where multiple '<'
# are followed by single '>'.
  warn"unclosed or unopened tag detected, malfunction warning: $_" if(/<[^>]*$|^[^<]*>/);
}
# close OLD file handle ($OldFile)
close(OLD);
# assign zero to $iii - please, never call variables like this !!!
$iii=0;
# assign the value of your HOME environment variable followed by '/.HtmlDiff" to $TmpFile
# the hash %ENV keeps all your environment variables
$TmpFile="$ENV{HOME}/.HtmlDiff";
# nothing, kind of 'declaration'
$FH;
# here $FH becomes a 'reference' to standard output file handle (however not used in this part of
# your code)
$FH=\*STDOUT;

# for each line of diff command output...
while(<DIFF>){
# this mathes lines starting with 'd', then having 1 on more digits followed by (optional) some
# whitespaces and another digits (in general - deletion line from diff)
  if(/^d(\d+)(\s+(\d+))?$/){
# this one is crazy :) it assigns value of second digit, if it exists, otherwise - it assigns first
    $iiii= defined $2 ? $2 : $1;
# Now for all lines from @NewLines table, starting at $ii number
# up to first matched digit from diff 'd' line, call
# PrintLine (it probably just displays the line w/o modification, as this will indicate lines which are
# common in both checked files
    for($ii=$iii;$ii<$1;$ii++){PrintLine('Equal',$NewLines[$ii])}
# Here print all lines from $ii number to $iiii (see comment above)
    for($ii=$1;$ii<=$iiii;$ii++){PrintLine('Old',$NewLines[$ii])}
# then assign $iiii+1 to $iii
    $iii=$iiii+1;
  }
# now do similar operaions for 'c' lines (changed) - match 'c' line first
  elsif(/^c(\d+)(\s+(\d+))?$/){
# then assign first or second number depending if second exists or not
    $iiii= defined $2 ? $2 : $1;
# and print output - equal lines here
    for($ii=$iii;$ii<$1;$ii++){PrintLine('Equal',$NewLines[$ii])}
# then old lines here
    for($ii=$1;$ii<=$iiii;$ii++){PrintLine('Old',$NewLines[$ii])}
# for lines read from 'diff' command...
    while(defined($is=<DIFF>)){
# cut newline
      chomp;
# if line contains only '.' character, skip the rest of input from while (jump out of while loop)
      if($is=~/^\.$/){last}
# otherwise print line as "new"
      else{PrintLine('New',$is)}
    }
# assign $iiii+1 value ro $iii variable...
    $iii=$iiii+1;
  }
# here we match 'a' lines from diff output
  elsif(/^a(\d+)$/){
# then assign number which comes after 'a' character in diff to the $iiii variable
    $iiii=$1;
# print all lines from $ii to this number
    for($ii=$iii;$ii<=$1;$ii++){PrintLine('Equal',$NewLines[$ii])}
# and again print all lines from diff as 'new' until you find line with single '.' only
    while(defined($is=<DIFF>)){
      if($is=~/^\.$/){last}
      else{PrintLine('New',$is)}
    }
# assign matched earlier number from 'a' line incremented by one to $iii variable
# inconsequence from previous parts of code - $iiii could be used also here
# but in general - naming variables here is terrific!!!
    $iii=$1+1;
  }
}
# print all lines from @NewLines table starting from $iii to the end of the
# NewLines table as 'equal'
for($ii=$iii;$ii<=$#NewLines;$ii++){PrintLine('Equal',$NewLines[$ii])}
# remove $OldFile and $NewFile from disk
unlink($OldFile,$NewFile);
# print $output variable (seems to be defined somewhere else, otherwise does nothing)
print $output;
0
 
LVL 7

Assisted Solution

by:mzalfres
mzalfres earned 420 total points
Comment Utility
See my suggestions from the other part, but in general - make variables more meaningful, put repeating
parts of code into subroutines. That would make this code much more clear and easier for any further
development.

Regards,

Marek ZJ.


0
 

Author Comment

by:Randall-B
Comment Utility
mzalfres,
    Excellent. Once again, that is exactly what I was looking for.  I may split some points here because ozo also provided some good info, but please leave a note at http://Q_22495548.html to receive those points also. Thanks.
0
 

Author Comment

by:Randall-B
Comment Utility
mzalfres,
   Correction: the other link is: http:Q_22495548.html where you can collect the rest of the points. Thanks.
0
 

Author Comment

by:Randall-B
Comment Utility
OK. Thanks.
0
 
LVL 12

Expert Comment

by:Jeff Darling
Comment Utility
Randall-B.

I wanted to thank you for posting this code.  I agree that the code is a bit obscure the way it is, but I enjoyed looking into it anyway.  I'm glad the other experts were able to assist.  I had a few things come up at work that prevented me from finishing what I started.  I didn't want to post incomplete work.  

The one thing that I did notice, is that they are using the ed script option of diff.  I'm not really familiar with ed scripts so that is probably something that you might want to investigate.  I also noticed that none of the experts commented on that.

Maybe the version of ed on your machine isn't producing the output in the format that this script is expecting?  




0
 

Author Comment

by:Randall-B
Comment Utility
jeffld,
   That's a good question. I'll have to look into that.  Thanks for your comment.
0
 
LVL 12

Expert Comment

by:Jeff Darling
Comment Utility
CORRECTION to my last sentence:

Maybe the version of diff on your machine isn't producing the simulated ed output in the format that this script is expecting?  

I had incorrectly implied that ed was used in the script.  Actually what is happening is that diff is creating scripts that are used by the ed program because the parameter -f, but I don't see anything indicating that ed is actually used.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Suggested Solutions

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now