Solved

Hoe to open file from html using perl / cgi?

Posted on 2001-07-02
9
338 Views
Last Modified: 2008-02-01
How can I open a file from an HTML page by calling a perl script?
Example a web page with a button.  At the cilck of a button.. should trigger a perl script that opens the text file?

the text file looks like this

File - words.txt

hello
world
text
new
a
of
the
etc..
0
Comment
Question by:sdesar
  • 5
  • 2
  • 2
9 Comments
 
LVL 1

Accepted Solution

by:
mehdi earned 10 total points
ID: 6245676
-HTML

<form action="myscript/pl" method="post">

<input type="submit">

</form>

-PERL

#!/usr/local/bin/perl -w  
# point to where YOUR perl is installed


#point to where you file is.. this example is a uni* path
$path = '/home/myname/myfile/';

#open the file
open (FH, "$path");

# the all important header..
print "Content-type: text/html\n\n";

#loop through file, print each line
while ($line=<FH>) {
  print $line;
}

#close the file
close (FH);
exit(1);

~~~~~~~~~~~~~~~~~~~~~~~~~~~
The secret lemonade drinker
0
 

Author Comment

by:sdesar
ID: 6246272
I am getting a 500 Internal server error.

Here's the code-
#!/usr/local/bin/perl -w  
# point to where YOUR perl is installed


#point to where you file is.. this example is a uni* path
$path = 'stop_words.txt';

#open the file
open (FH, "$path");

# the all important header..
print "Content-type: text/html\n\n";

#loop through file, print each line
while ($line=<FH>) {
 print $line;
}

#close the file
close (FH);
exit(1);
#!/usr/local/bin/perl -w  
# point to where YOUR perl is installed


#point to where you file is.. this example is a uni* path
$path = '/home/textseem/www/cgi-bin/stop_words.txt';

#open the file
open (FH, "$path");

# the all important header..
print "Content-type: text/html\n\n";

#loop through file, print each line
while ($line=<FH>) {
 print $line;
}

#close the file
close (FH);
exit(1);


here's the website - http://208.56.56.72/test2/hello.html

Thanks for your help!
Awaiting a response soon.
0
 
LVL 1

Expert Comment

by:elminster
ID: 6246809
Check that the perl file readable and executable for all.
( chmod 755 <perl_file> )

It's the first thing I check when I see the '500 Internal' error.
0
 
LVL 1

Expert Comment

by:mehdi
ID: 6247778
that error could mean anything.  have a look at your error log (find out where this is from your ISP).. find the exact cause of the problem.

what platform is this on ? my example assumed Uni*

but first try adding the following line just *beneath* the line #!/usr/local/bin/perl -w  

the line is:

use CGI::Carp qw(fatalsToBrowser);

this should give more detail about the error.. like our friend says, check the text file permissions, and check that the script has execute permissions.

~~~~~~~~~~~~~~~~~~~~~~~~~~~
The secret lemonade drinker
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Comment

by:sdesar
ID: 6249595
I changed permissions and also added the above suggestion!
I don;t get an error.

But I still see no data when the submit button is hit!

What could be the reason for this behavior?

Thanks!
0
 

Author Comment

by:sdesar
ID: 6257799
Awaiting a response.. please help!
0
 
LVL 1

Expert Comment

by:elminster
ID: 6258953
I've tried the above on my linux box and it works OK.

However, if the 'words' file is not readable, nothing is displayed - not even an error, so check that the permissions on the 'words' file are correct.

Also - I've noticed that in your first post you called the file 'words.txt', but in your script you call 'stop_words.txt'. Check that you're using the right filename (and path) in the script.

What OS are you running on?
0
 

Author Comment

by:sdesar
ID: 6302955
I am very sorry for the late response... I never got an email on the response and I forgot to check again.

Thanks a million! It works now!

******However, I cannot understand why its not working when I implement this on my site that actually opens the stop_words.txt and deletes all those words and then does its analysis of the bestwords etc... to test what I mean ..
please check out -
**** http://208.56.56.72/test2/index.html

**** Also is there a way to find out what versions of the modules are located on this server .. ie what version of Lingua:EN:Fathom.

*** I am using Windows NT to upload my site on to the servers at ehost4u.com... I am guessing they use UNIX... I don't know.

The code that does the chopping of the stop_words is in
PageParser.pm

Here's the code:-

package PageParser;

use strict;
use FindBin qw($Bin);

# Simple extension of Treebuilder to parse through a file, and add
# in the capability to skip through the text

use Lingua::EN::Fathom;

use HTML::TreeBuilder;
use HTML::Element;
use HTML::FormatText;
use URI::URL;

use Vector;
use Lingua::Stem;


my %defaults = (
      
      HTML => undef,      # The HTML as a text string
      KEYWORDS => undef,      # Keywords as a hash, count of keys in analyze
      URL => undef,      # The path to wherever
);

sub new {
    my $class = shift;

    my %extra;
    my $self = {};
    bless($self, $class);

    @$self{keys %defaults} = values %defaults ;
    if(@_) {
        my %stuff = @_;
        @$self{keys %stuff} = values %stuff;
    }
    $self->init();

    return $self;
}


sub init {
    my $self = shift;

    if($self->{HTML}) {
      $self->{TEXT} = $self->as_Text();
      $self->txtAnalyze();
      $self->{VECTOR} = Vector->new( $self->getStemParagraphs() );
    }
}

sub as_Text {
      my $self = shift;
      return($self->dumpText2());
}

# Use lynx to get it into a more reasonable ascii format
sub dumpText2() {
      my $self = shift;
      my $ret = "";
      my $url = $self->{URL};

      my @text = split /\r?\n/, `lynx -dump -width=4000 $url`;
      
      # Pass 1 -- clear out blank lines, [image] lines, and
      #  meaningless junk
      my @text1 = map {
            my $line = $_;
            # remove long meaningless --- and ___ strings
            $line =~ s/[!@#$%^&*()\-_+=][!@#$%^&*()\-_+=]+//;
            if( $line =~ m/^\s*\[.*\]\s*$/ ) {
                  $line = undef;
            }
            $line
      } @text;

      # Pass 2 -- Look for short lines that are probably header lines to
      #  paragraphs.  Combine those into the next paragraph.  Look for
      #  long lines that may have been broken up.  Should have a 7 space
      #  indent on them.  Join those lines together.

      my $line;
      while ($#text1) {
            $line = shift(@text1);
            if($line =~ m/^\s*$/) {
                  next;
            }
            if( $line =~ m/^\s*References\s*$/) {
                  $self->references(\@text1);
                  last;
            }
            
            # Abitrarily assume that a header line is less than 50 chars
            if(length($line) < 50 ) {
                  my $line2 = shift(@text1);
                  if($line2) {
                        if(length($line2) > 50)  {
                              $line .= "\n$line2";
                        }
                        else {
                              unshift @text1, $line2;
                        }
                  }
                  else {
                        unshift @text1, $line2;
                  }
            }
            while(length($line) > 60) {
                  my $line2 = shift(@text1);
                  if(!$line2) {
                        last;
                  }
                  if($line2 =~ m/^       \w+/) {
                        $line .= $line2;
                  }
                  else {
                        unshift @text1, $line2;
                        last;
                  }
            }

            # trim off useless space
            $line =~ s/^\s*(.*)\s*$/$1/;

            $ret .= "<P>\n$line\n</P>\n";
      }
      return $ret;
}
# This get the lynx reference set
# found at the end of the lynx output
sub references {
      my $self = shift;
      my $text = shift;

      my %ref = ();

      for( @$text) {
            if( m@(\d+).\s*([^\s#]+)\s*@ ) {
                  my $num = $1;
                  my $rf = $2;
                  $rf =~ s@/$@@;
                  if($rf !~ m/mailto/i) {
                        $ref{$num} = $rf;
                  }
            }
      }
      $self->{REFERENCES} = \%ref;

      # Now go and get the interesting references from the html
      my $html = $self->{HTML};
      my @str = ();
      my %hstr = ();

      $html =~ s{<A\s.*?HREF="?([^\s">]*)[^>]*?>(.*?)</a>}{ push @str, ($1, $2); "$&" }iges;
      while(scalar(@str)) {
            my $href = shift @str;
            my $val = shift @str;
            $href = url($href, url($self->{URL}))->abs->as_string;
            $href =~ s@/$@@;
            $val =~ s/<[^>]*>//gs;
            $val =~ s/^\s*(.*)\s*$/$1/;
            $hstr{$href} = $val if $val ;
      }
      $self->{HTMLREFERENCES} = \%hstr;
}

# Pass through the paragraph, reconstructing links that were in the original
# HTML except now they go through the Javascript stuff.
sub hideLinks {
      my $self = shift;
      my $par = shift;
      my $str = $self->{HTMLREFERENCES};
      my $rf = $self->{REFERENCES};

      return $par unless $rf;
      return $par unless $str;

      my @num = ( $par =~ m/\[(\d+)\]/gs );

      for (@num) {
            if($rf->{$_}) {
                  my $re = "\\[$_\\]\\s*".$str->{$rf->{$_}}.'\s*';
                  unless ($par =~ s{$re}{\[$_\]}) {
                        $par =~ s/\[$_\]//;
                  }
            }
      }
      return $par;
}

# Recover the links that lynx inconveniently put at the end of the file
# and put them in their rightful place in teh document.
sub restoreLinks {
      my $self = shift;
      my $par = shift;
      my $str = $self->{HTMLREFERENCES};
      my $rf = $self->{REFERENCES};

      return $par unless $rf;
      return $par unless $str;

      my @num = ( $par =~ m/\[(\d+)\]/gs );

      for (@num) {
            if($rf->{$_}) {
                  my $re = "\\[$_\\]\\s*";
                  $par =~ s{$re}["<A HREF=\"javascript:linkto('".$rf->{$_}."')\">".$str->{$rf->{$_}}."</A>"]gei;
            }
      }
      return $par;
}


############################################################
# The following methods are adaptation from url_fathom
# or my interpretation of what should happen

# txt_analyze --
#  1. create a text version of the html
#  2. strip it of "stop words", and send it to fathomAnalyze
#  3. create a list of the words on a total basis
#  4. take the top N of these words as Keywords
#  5. create a html document which analyzes each paragraph
#     highlights each keyword as a link to the previous keyword
#     for each paragraph that contains 1 or more keywords, print a list of
#     the keyword frequencies, and a skip anchor which sends it from
#     one paragraph to the next.

# Strip out "stop_words" Are these common prepositions, etc.

sub txtAnalyze {
      my $self = shift;
      my $text = $self->{TEXT};

      my $stopwords = $self->{STOPWORDS};
      unless( $stopwords ){
            $stopwords = $self->stop_words();
            $self->{STOPWORDS} = $stopwords;
      }

      # remove all the stopwords
      my $re =  "(\\b" . join ("\\b|\\b", @$stopwords) . "\\b)";
      my $txt = $text;
      $txt =~ s/$re//gis;
      $self->{REDUCEDTEXT} = $txt;
      
      # Fathom analyze to get keywords
      my $keywords = $self->fathomAnalyze($txt);

      # nip the rest down to their stems ...
      my $stemmer = Lingua::Stem->new( -locale => 'EN');
      $stemmer->stem_caching({-level => 1});
      my $stems = $stemmer->stem(@$keywords);
      my %stemhash = ();
      for (0..$#$keywords) {
            $stemhash{$keywords->[$_]} = $stems->[$_];
      }
      $self->{STEMS} = \%stemhash;

      # Now, do a re-count based on stemmed words
      my $fathom = $self->{FATHOM};
      my %uniq_words = $fathom->unique_words;
      my %keycount;
      for (keys %uniq_words) {
            $keycount{$stemhash{$_}} += $uniq_words{$_};
      }
      $self->{STEMCOUNT} = \%keycount;

      # Now, get the top 10 keywords

      ($self->{STEMKEYWORDS}, $self->{KEYWORDS}) = $self->getStemKeywords(10);
}

# dump out the parsed text, putting in some html tags to make the
# display more useful.
sub dumpParseText {
      my $self = shift;
      my $keywords = $self->{KEYWORDS};
      my $ret = "";
      my $vector = $self->{VECTOR};

      # Create the HTML document
      my $re =  "(\\b" . join("\\b|\\b", @$keywords ) . "\\b)";
      my %kws = ();
      map { $kws{$_} = 0; } @$keywords;

      $ret = $self->dumpTopKeywords();
      my $txt = $self->{TEXT};
      my $parno = 0;      # paragraph number
      my $lastparno = 0;
      my $curpar = 0;
      my @klist = ();
      # Scan by paragraph to set the links
      $ret .= "<TABLE cellpadding=0 cellspacing=0>\n";
      while( $txt =~ m@<P>\n(.*?)\n</P>@gis ) {
            my $par = $self->hideLinks($1);
            my %kw = ();
#            if( $par =~ s{$re}[ $kw{lc($&)} = 1; "<A HREF=#" . lc($&) . $kws{lc($&)}++ . " NAME=" . lc($&) . $kws{lc($&)} . " >$&</A>" ]gei ) {
            $ret .= "<TR class=Bg1><TD><A NAME=p$curpar HREF=\"javascript:pswitch($curpar)\">$curpar.</A>&nbsp;&nbsp;";
            if( $par =~ s{$re}[$kw{lc($&)} = 1; "<B>$&</B>"]gei ) {
                  # Wow, there are keywords in this paragraph!
                  push @klist, ("\"par$curpar\"");
                  $parno += 1;
                  $ret .= "<A NAME=par$curpar>KEYWORDS: " . join(" &nbsp;", keys %kw) . "</A>";
            }
            $ret .= "</TD></TR>\n";
            $par = $self->restoreLinks($par);

            $par =~ s/\n/<br>\n/gis;

            $ret .= "<TR class=Bg2><TD>" . $par . "</TD></TR>\n";
            $curpar += 1;
      }
      $ret .= "</TABLE>\n";

      if($self->{MXDISPLAY}) {
            $ret .= "<P>" . $vector->mxdisplay();
      }
      if($self->{VFDISPLAY}) {
            $ret .= "<P>" . $vector->vfdisplay();
      }

my $js = <<"EOT";
<STYLE>
BODY
{
   BACKGROUND-COLOR: white;
   FONT-FAMILY: Verdana, Arial, Helvetica;
   FONT-SIZE: 8pt;
   MARGIN: 4px
}
TD
{
   FONT-FAMILY: Verdana, Arial, Helvetica;
   FONT-SIZE: 8pt
}
.Bg1
{
   BACKGROUND-COLOR: #dddddd
}
.Bg2
{
   BACKGROUND-COLOR: #eeeecc;
   FONT-SIZE: 10pt;
}
.Bg3
{
   BACKGROUNG-COLOR: #dddddd;
   FONT-SIZE: 8pt;
}
</STYLE>
<script language=javascript>
EOT
$js .= "var klist = new Array(" . join( ",", (@klist)) . ");\n";
$js .= <<"EOT";
parent.menu.setParNo(klist.length);
function kvalue(n) {
      return klist[n];
}
EOT
$js .= $vector->mxscript();
$js .= <<"EOT";
function pswitch(par) {
   // alert("pswitch -- "+par);
   parent.menu.setParLink(par, plist[par].length);
}
function setLocation(par, n) {
      // alert("setLocation -- "+par+" "+n);
      // alert("Hash set " + plist[par][n]);
      parent.menu.setParValue(vlist[par][n], plist[par][n]);
      // document.hash.location = '#' + plist[par][n];
      // alert("Hash Location -- "+document.hash.location);
}
function linkto(url) {
      parent.nav.newurl(url);
}

</script>
EOT
return ($js , $ret);
}
      
# Read in the stopwords.
sub stop_words {
      my $self = shift;
open KW, 'stop_words.txt';
my @kw = map {chop;$_} <KW>;
close KW;

      return \@kw;
}

########## CREATE A BEST WORDS LIST

sub fathomAnalyze {
      my $self = shift;
      my $txt = shift;
      my $fathom = $self->{FATHOM};

      if(!$fathom) {
            $fathom = new Lingua::EN::Fathom;
            $self->{FATHOM} = $fathom;
            $txt =~ s@</?P>@@gis;
            $fathom->analyse_block($txt);
      }
      my %uniq_words = $fathom->unique_words();

      my @kws = keys %uniq_words;
      return \@kws;
}

# dump the top 10 keywords
sub dumpTopKeywords {
      my $self = shift;
      return $self->dumpKeywords('num', 10);
}

sub dumpKeywords {
      my $self = shift;
      my $dir = shift;   # sort by either alpha, or num
      $dir = "alpha" unless $dir;
      my $len = shift;
      $len = 0 unless $len;

      my %uniq_words = %{$self->{STEMCOUNT}};
      my $word;
      my $ret;

      my @list = sort keys %uniq_words;
      if($dir eq 'num') {
            @list = sort { $uniq_words{$b} <=> $uniq_words{$a} }  keys %uniq_words;
      }

      if($len) {
            splice @list, $len;
      }

      $ret = "<TABLE>\n";
      foreach $word ( @list )
      {
            $ret .= "<TR><TD ALIGN=right>" . $uniq_words{$word}. "</TD><TD>$word</TD></TR>\n"; # outputs the word and frequency.
      ##                  print OUT ("$word\n"); # prints just the words

      }
      $ret .= "</TABLE>\n";
      return $ret;
}

# Get the top n Stem Keywords.  Also generate the equivalent array
# of real keywords (which will have more than n keys, and display unstemmed)
sub getStemKeywords {
      my $self = shift;
      my $len = shift;
      my $stems = $self->{STEMS};
      my $stemcount = $self->{STEMCOUNT};
      my @list = sort { $stemcount->{$b} <=> $stemcount->{$a} }  keys %$stemcount;
      splice @list, $len;

      # now find all the words in the other list
      my @klist = ();
      for (keys %$stems) {
            my $w = $_;
            for (@list) {
                  if($stems->{$w} eq $_) {
                        push @klist, $w;
                        last;
                  }
            }
      }

      return( \@list, \@klist );
}

# replace words with stemmed words in the paragraphs.
sub getStemParagraphs {
      my $self = shift;
      my $text = $self->{REDUCEDTEXT};

      # Get list of multi-word stems
      my $stems = $self->{STEMS};
      my %seen = ();
      map {  $seen{$_} ++ } values %$stems;
      my @list = grep { $seen{$stems->{$_}} > 1} keys %$stems;

      # replace those words in the text
      my $re =  "(\\b" . join("\\b|\\b", (@list) ) . "\\b)";
      $text =~ s{$re}[ $stems->{lc($&)} ]gise;

      # break into paragraphs
      my @plist = ( $text =~ m@<P>\n(.*?)\n</P>\n@gis );

      # create a key list
      my %kseen = ();
      my @klist = grep { ! $kseen{$_} ++ } map { $seen{$stems->{$_}} > 1 ? $stems->{$_} : $_  } keys %$stems;

      return (\@plist, \@klist);
}


sub getKeywords {
      my $self = shift;
      my $fathom = $self->{FATHOM};
      my %uniq_words = $fathom->unique_words;
      # my @list = grep { $uniq_words{$_} > 1 } keys %uniq_words;
      my @list = keys %uniq_words;

      return \@list;
}

sub getParagraphs {
      my $self = shift;
      my $text = $self->{TEXT};
      my @list = ( $text =~ m@<P>\n(.*?)\n</P>\n@gis );

      return \@list;
}

1;




Thanks,
Awaiting a reponse soon....
0
 

Author Comment

by:sdesar
ID: 6303001
Thanks for all the help!

It all works now!
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now