?
Solved

parse input name-value pairs in html

Posted on 2003-02-22
21
Medium Priority
?
439 Views
Last Modified: 2008-02-01
Can someone show me how to extract ALL the name-value pairs from an html document looping through each input and write the results to a tab delimeted file taking into account the differences of whitespace, ", '
ie:
<input type=hidden name="user" value="JohnGreen">
<input type=hidden name="state" value="NewYork">

<input type=hidden name= user value=JohnGreen>
<input type=hidden name= state value=NewYork>

<input type=hidden name='user' value='JohnGreen'>
<input type=hidden name='state' value='NewYork'>


user=JohnGreen state=NewYork
0
Comment
Question by:malkie
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 12
  • 9
21 Comments
 
LVL 10

Expert Comment

by:rj2
ID: 7998703
#!/usr/bin/perl
use strict;
use HTML::TokeParser::Simple;
open(OUTFILE,">out.txt") || die("can not open file because $!");

my $p = HTML::TokeParser::Simple->new('valuepairs.html');
while ( my $token = $p->get_token ) {
     if ( $token->is_start_tag('input')) {
          my $attr=$token->return_attr();
          print OUTFILE $attr->{ name },"\t",$attr->{ value },"\n";
     }
}    
close(OUTFILE);
0
 
LVL 10

Expert Comment

by:rj2
ID: 7998873
Or you could HTML::Parser

#!/usr/bin/perl
use strict;
use HTML::Parser();
open(OUTFILE,">out2.txt") || die("can not open file because $!");

sub start_handler {
     if($_[0] eq 'input') {
          print OUTFILE $_[1]{name},"\t",$_[1]{value},"\n";
     }
}    

my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler, 'tag, attr' );
$p->parse_file('valuepairs.html') || die $!;
close(OUTFILE);
0
 

Author Comment

by:malkie
ID: 7998943
I tried.............(having installed)
use HTML::TokeParser::Simple;
and got...........

CGI Error
The specified CGI application misbehaved by not returning a complete set of HTTP headers. The headers it did return are:


Can't call method "get_token" on an undefined value at C:\Inetpub\wwwroot\cgi-bin\extracthtml\thecode.cgi line 7.

can you please help...(not very experienced)
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 10

Expert Comment

by:rj2
ID: 8002010
Have you installed the modules HTML::TokeParser::Simple, HTML::Parser and HTML::TokeParser?

Does the file valuepairs.html given as parameter to new exist?
0
 
LVL 10

Expert Comment

by:rj2
ID: 8002096
I tested this now, and you get this error if the file gived as parameter to new does not exist.
Try to give a full path to the file as shown below
my $p = HTML::TokeParser::Simple->new('/usr/home/malkie/valuepairs.html');
If the html you want to parse is located on another webserver you must download it first

use LWP::Simple;
getstore('http://www.mysite.com','/usr/home/malkie/valuepairs.html');
0
 

Author Comment

by:malkie
ID: 8008015
#I have all the modules installed as far as I know. I have looked at the files and the HTML::Parser is version 3.25.
#I have tried numerous ways, I keep getting the errors:-
# I am using a microsoft personal web server.
#I tried with the valuepairs.html in the same folder as the script or a full path.
#Can't locate auto/HTML/Parser//Inetpub/wwwroot/valuepairs.html.al in @INC (@INC contains: C:/Perl/lib C:/Perl/site/lib .) at C:/Perl/site/lib/HTML/Parser.pm line 82
#Even if I type in a NON EXISTANT FILE it comes up with:-
#Can't locate auto/HTML/Parser//Inetpub/wwwroot/nofile.html.al in @INC (@INC contains: C:/Perl/lib C:/Perl/site/lib .) at C:/Perl/site/lib/HTML/Parser.pm line 82
#Can't locate auto/HTML/Parser/valuepairs.html.al in @INC (@INC contains: C:/Perl/lib C:/Perl/site/lib .) at C:/Perl/site/lib/HTML/Parser.pm line 82
#I notice that there is a double forward slash in the error path " Parser//Inetpub " but not if I put in just the file valuepairs.htm
# I don't know why it is adding the .al on the end of the file

#!/usr/bin/perl
use strict;
use HTML::Parser();
open(OUTFILE,">out2.txt") || die("can not open file because $!");

sub start_handler {
    if($_[0] eq 'input') {
         print OUTFILE $_[1]{name},"\t",$_[1]{value},"\n";
    }
}    

#my $p = HTML::Parser->new('/Inetpub/wwwroot/nofile.html');
#my $p = HTML::Parser->new('/Inetpub/wwwroot/valuepairs.html');
my $p = HTML::Parser->new('valuepairs.html');

$p->handler( start => \&start_handler, 'tag, attr' );
$p->parse_file('valuepairs.html') || die $!;
close(OUTFILE);
0
 
LVL 10

Expert Comment

by:rj2
ID: 8008313
On Windows a full path starts with a driveletter.
Try something like shown below, change path to correct path on your system

my $p = HTML::TokeParser::Simple->new('c:/inetpub/wwwroot/valuepairs.html');
0
 

Author Comment

by:malkie
ID: 8011207
I have tried that it came up with:-

Can't locate auto/HTML/Parser/c:/inetpub/wwwroot/valuepairs.html.al in @INC (@INC contains: C:/Perl/lib C:/Perl/site/lib .) at C:/Perl/site/lib/HTML/Parser.pm line 82

I also tried the script on a unix server (using the appropriate paths) it came up with an error perhaps you can let me see the source code you tested.
0
 
LVL 10

Expert Comment

by:rj2
ID: 8011632
Weird.
I posted the source I tested in my first comment, it works for me (Apache on WinXP).
Did you get the same error message on the Unix server?
Could it be some problem with your Perl installation?
Try to uninstall and reinstall Perl.
http://www.activestate.com/activeperl
0
 
LVL 10

Expert Comment

by:rj2
ID: 8011650
Try to run it from the shell instead as CGI script, ie type "perl parse.pl" on command line (save first script as parse.pl)
Do you get any errors if try try that?
0
 

Author Comment

by:malkie
ID: 8028720
Just to let you know I have been checking to see if HTML::Parser is installed properly and did as follows.

I ran the PPM> verify --force which upgraded about four packages so I assume everthing is okay.

I ran the script,
*********
use strict;
use HTML::Parser();
open(OUTFILE,">out2.txt") || die("can not open file because $!");

sub start_handler {
    if($_[0] eq 'input') {
         print OUTFILE $_[1]{name},"\t",$_[1]{value},"\n";
    }
}    

my $p = HTML::Parser->new('c:/Inetpub/wwwroot/valuepairs.htm');
$p->handler( start => \&start_handler, 'tag, attr' );
$p->parse_file('valuepairs.html') || die $!;
close(OUTFILE);
*****************
WHICH CAME UP WITH:-

CGI Error
The specified CGI application misbehaved by not returning a complete set of HTTP headers. The headers it did return are:
Can't locate auto/HTML/Parser/c:/Inetpub/wwwroot/valuepairs.htm.al in @INC (@INC contains: C:/Perl/lib C:/Perl/site/lib .) at C:/Perl/site/lib/HTML/Parser.pm line 82

I did the following,

C:\WINDOWS>ppm
PPM interactive shell (2.1.5) - type 'help' for available commands.
PPM>
PPM> verify HTML::Parser
PPM>
PPM> verify HTML-Parser
PPM> install HTML::Parser
Version 3.25 of 'HTML-Parser' is already installed.
Remove it, or use 'verify --upgrade HTML-Parser'.
PPM> verify --upgrade HTML-Parser
Upgrade package 'HTML-Parser'? (y/N): y
PPM>
PPM> verify HTML-Parser
PPM>
PPM> verify DBI
Package 'DBI' is up to date.
PPM> remove HTML::Parser
Remove package 'HTML-Parser?' (y/N): y
Error removing HTML-Parser: Package 'HTML-Parser' is required by PPM and
cannot be removed
PPM>
PPM> verify --force HTML::Parser
PPM>
PPM> verify HTML::Parser
PPM>
PPM> verify HTML-Parser
PPM>

The only thing that I can see is that when I run verify HTML::Parser it doesn't come back with
Package 'HTML-Parser' is up to date.
it always goes back to the PPM> prompt.

I will try what you suggested and get back to you.
" Try to run it from the shell instead as CGI script, ie type "perl parse.pl" on command line (save first script as parse.pl)
Do you get any errors if try try that? "

Thank You

0
 
LVL 10

Expert Comment

by:rj2
ID: 8035327
Add the two lines below right after the line "use HTML::Parser" to make it work as CGI script, you did not say anything about CGI in your original question.

use CGI qw/:standard/;
print header;
0
 

Author Comment

by:malkie
ID: 8043527
Thank you.
I put the CGI qw/:standard/; print header; as you said and it has stopped the error.

I ran the script (cgi) and it is definately doing something but it doesn't print anything to out2.txt file. I have put some html code below to show you what it is reading which is the source code for valuepairs.html in the Inetpub/wwwroot/ folder.

#!/usr/bin/perl
use strict;
use HTML::Parser();
use CGI qw/:standard/;
print header;

open(OUTFILE,">out2.txt") || die("can not open file because $!");

sub start_handler {
    if($_[0] eq 'input') {
         print OUTFILE $_[1]{name},"\t",$_[1]{value},"\n";
    }
}    

my $p = HTML::Parser->new('/Inetpub/wwwroot/valuepairs.html');
$p->handler( start => \&start_handler, 'tag, attr' );
$p->parse_file('valuepairs.html') || die $!;
close(OUTFILE);
*********
<html><head><title>test</title>
</head>

<center><h1>test</h1></center>
<form method=POST action="http://yes/cgi-bin/ascript.pl">
<p>
<input type=hidden name="userdir" value="lnks">
<input type=hidden name="lnkuser" value="okay">
EMail: <input type=text name="emailid" size=30>&nbsp;<br>
Title: <input type=text name="title" size=40><br>
URL: <input type=text name="url" size=55><br>
Section to be placed in: <select name="section">
<option> Business <option> Computers <option> Education  <option> Entertainment
<option> Government <option> Personal  <option selected> Miscellaneous </select><br>
<input type=submit value="Add"> * <input type=reset></p>
</form>

0
 
LVL 10

Expert Comment

by:rj2
ID: 8044116
Replace the line
my $p = HTML::Parser->new('/Inetpub/wwwroot/valuepairs.html');
with
my $p = HTML::Parser->new(api_version => 3);
as posted in ny second comment.

Replace
open(OUTFILE,">out2.txt") || die("can not open file because $!");
with
open(OUTFILE,">e:/inetpub/wwwroot/out2.txt") || die("can not open file because $!");
(replace path with correct path on your system)
0
 

Author Comment

by:malkie
ID: 8047103
my $p = HTML::Parser->new(api_version => 3);
as posted in ny second comment.
# I tried this and it doesn't work and having looked at other example code I think this needs to be a file.

open(OUTFILE,">e:/inetpub/wwwroot/out2.txt") || die("can not open file because $!");
# Yes the full path (it works okay without the e: I have used this on other cgi scripts)writes a file but it doesn't write anything to it???

#!/usr/bin/perl
use strict;
use HTML::Parser();
use CGI qw/:standard/;
print header;

# Absolute path to out2.txt file:
my $outfile = "/Inetpub/wwwroot/cgi-bin/extracthtml/out2.txt";

# Absolute path to valuepairs.html file:
my $infile = "/Inetpub/wwwroot/valuepairs.html";

THE CODE I AM USING.....
#!/usr/bin/perl
use strict;
use HTML::Parser();
use CGI qw/:standard/;
print header;

# Absolute path to out2.txt file:
my $outfile = "/Inetpub/wwwroot/cgi-bin/extracthtml/out2.txt";

# Absolute path to valuepairs.html file:
my $infile = "/Inetpub/wwwroot/valuepairs.html";

open(OUTFILE,">$outfile") || die("can not open file because $!");
sub start_handler {
    if($_[0] eq 'input') {
         print OUTFILE $_[1]{name},"\t",$_[1]{value},"\n";
    }
}    

my $p = HTML::Parser->new('$infile');
$p->handler( start => \&start_handler, 'tag, attr' );
$p->parse_file('$infile') || die $!;
close(OUTFILE);
0
 
LVL 10

Expert Comment

by:rj2
ID: 8047555
Absolute paths start with a drive letter in Windows.
0
 

Author Comment

by:malkie
ID: 8047681
okay but it still isn't working. The script is writing a out2.txt file but it isn't printing anything to it.
0
 
LVL 10

Expert Comment

by:rj2
ID: 8047752
Well, it works on my system.
Is it the same thing if you run it from the command line instead of as a CGI script?
0
 
LVL 10

Accepted Solution

by:
rj2 earned 750 total points
ID: 8047851
I tested your script now, and I changed a couple of things. Try script below now.

#!e:/perl/bin/perl.exe
use strict;
use HTML::Parser();
use CGI qw/:standard/;
print header;

# Absolute path to out2.txt file:
my $outfile = "e:/temp/out2.txt";

# Absolute path to valuepairs.html file:
my $infile = "e:/temp/valuepairs.html";

open(OUTFILE,">$outfile") || die("can not open file because $!");
sub start_handler {
   if(lc($_[0]) eq 'input') {
        print OUTFILE $_[1]{name},"\t",$_[1]{value},"\n";
   }
}    

my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler, 'tag, attr' );
$p->parse_file($infile) || die $!;
close(OUTFILE);
0
 

Author Comment

by:malkie
ID: 8047931
Try to run it from the shell instead as CGI script, ie type "perl parse.pl" on command line (save first script as parse.pl)
Do you get any errors if try try that?
**************
Goods news it works with parse.pl (almost) this is what I got:-

userdir lnks
lnkuser okay
emailid
title
url
    Add

**************
from parse.pl

#!/usr/bin/perl
use strict;
use HTML::TokeParser::Simple;
open(OUTFILE,">out.txt") || die("can not open file because $!");

my $p = HTML::TokeParser::Simple->new('valuepairs.html');
while ( my $token = $p->get_token ) {
if ( $token->is_start_tag('input')) {
my $attr=$token->return_attr();
print OUTFILE $attr->{ name },"\t",$attr->{ value },"\n";
}
}
close(OUTFILE);
***************
I am not sure why it did not pick up the <input type=reset> maybe it is
because it is on the same line as <input type=submit value="Add">.

However it was my hope to have it run from the browser.

I will now try your last comment posted. THANKS

0
 

Author Comment

by:malkie
ID: 8048470
ok that now works from the browser but what I wanted to do was read a url ie http://www.mysite.com and not a file as we have been doing with c:/inetpub/wwwrooot/valuepairs.html
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question