crest
asked on
rip table again, refer to roylam's question
i found it very useful to print certain tables instead of printing the whole lot. here is what i want to do:
myperlprogram 1 - print all tables without nested table in them.
i.e. <table></table>
myperlprogram 2 - print all tables with exactly one table in them.
i.e. <table><table></table></ta ble>
here "1" and "2" are the arguments passed to the program.
i spent ages trying to figure out the algorithm but no luck. any idea?
myperlprogram 1 - print all tables without nested table in them.
i.e. <table></table>
myperlprogram 2 - print all tables with exactly one table in them.
i.e. <table><table></table></ta
here "1" and "2" are the arguments passed to the program.
i spent ages trying to figure out the algorithm but no luck. any idea?
ASKER
i've use the source code from a PAQ "rip tables apart in html" - Computers/Programming/Lang uages/Perl /Q_1025467 4.html.
In which it print all the tables in a recursive manner. I don't want that, I want to print just certain tables that have a specific number of nested table(s) in them as I stated in my question. But my problem is that I couldn't get it working - use wrong algorithm. Anyway here is the original source code:
package myParser;
use HTML::Parser;
@ISA = qw(HTML::Parser);
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq "table") {
push @tables, $origtext;
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub text {
my ($self, $text) = @_;
$tables[$#tables] .= $text unless ($#tables < 0);
}
sub end {
my ($self, $tag, $origtext) = @_;
if ($tag eq "table") {
if (!@tables) {
print "Parse Error: Mismatched table tags-- too many closing tags\n";
}
else {
my $currenttable = pop @tables;
$currenttable .= $origtext;
$totaltablecount++;
open (OUTPUT, ">${filebase}_${totaltable count}.txt ") || die "Can't open
${filebase}_${totaltableco unt}.txt: $!\n";
print OUTPUT "$currenttable\n";
close OUTPUT;
print "$currenttable\n\n";
$tables[$#tables] .= $currenttable unless ($#tables < 0)
}
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub startParse {
(my $self, my $page, $filebase) = @_;
$totaltablecount = 0;
@tables = ();
$self->parse($page);
}
# END OF MYPARSER SUBCLASS
package main;
$page = qq(
<html>
<table>
1 belongs here
<table>
here is 2
</table>
<table>
here is 3
</table>
</table>
</html>
);
# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');
$p = new myParser;
$p->startParse($page, "MyOutputFileName");
Comment
From: roylam
Date: Wednesday, January 05 2000 - 01:30PM PST
i tried to install HTML:Parser but unfortunately i don't have the permission to
do so. i remember there was a way to tell the perl program where to look for
the module (which is compiled in my home directory, i'm using sun
solaris). can anyone tell me how to do that? thanks.
Comment
From: clockwatcher
Date: Wednesday, January 05 2000 - 04:07PM PST
I'm not a unix user, so there may be a lot better way to do this, but the
following works for me.
After untarring,
perl Makefile.pl PREFIX=/myhomedir/perl
make
make test
make install
that should install the module within /myhomedir/perl.
On my system, the above does the following:
Places documentation in:
~/perl/lib/perl5/man
Places the modules in:
~/perl/lib/site-perl/5.005 /i386-linu x
Your exact paths will be different. Then you simply add the path to @INC
and change the "use" to "require".
push (@INC, '/home/mark/perl/lib/site- perl/5.005 /i386-linu x');
package myParser;
require HTML::Parser;
@ISA = qw(HTML::Parser);
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq "table") {
push @tables, $origtext;
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub text {
my ($self, $text) = @_;
$tables[$#tables] .= $text unless ($#tables < 0);
}
sub end {
my ($self, $tag, $origtext) = @_;
if ($tag eq "table") {
if (!@tables) {
print "Parse Error: Mismatched table tags-- too many closing tags\n";
}
else {
my $currenttable = pop @tables;
$currenttable .= $origtext;
$totaltablecount++;
open (OUTPUT, ">${filebase}_${totaltable count}.txt ") || die "Can't open
${filebase}_${totaltableco unt}.txt: $!\n";
print OUTPUT "$currenttable\n";
close OUTPUT;
print "$currenttable\n\n";
$tables[$#tables] .= $currenttable unless ($#tables < 0)
}
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub startParse {
(my $self, my $page, $filebase) = @_;
$totaltablecount = 0;
@tables = ();
$self->parse($page);
}
# END OF MYPARSER SUBCLASS
package main;
$page = qq(
<html>
<table>
1 belongs here
<table>
here is 2
</table>
<table>
here is 3
</table>
</table>
</html>
);
# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');
$p = new myParser;
$p->startParse($page, "MyOutputFileName");
In which it print all the tables in a recursive manner. I don't want that, I want to print just certain tables that have a specific number of nested table(s) in them as I stated in my question. But my problem is that I couldn't get it working - use wrong algorithm. Anyway here is the original source code:
package myParser;
use HTML::Parser;
@ISA = qw(HTML::Parser);
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq "table") {
push @tables, $origtext;
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub text {
my ($self, $text) = @_;
$tables[$#tables] .= $text unless ($#tables < 0);
}
sub end {
my ($self, $tag, $origtext) = @_;
if ($tag eq "table") {
if (!@tables) {
print "Parse Error: Mismatched table tags-- too many closing tags\n";
}
else {
my $currenttable = pop @tables;
$currenttable .= $origtext;
$totaltablecount++;
open (OUTPUT, ">${filebase}_${totaltable
${filebase}_${totaltableco
print OUTPUT "$currenttable\n";
close OUTPUT;
print "$currenttable\n\n";
$tables[$#tables] .= $currenttable unless ($#tables < 0)
}
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub startParse {
(my $self, my $page, $filebase) = @_;
$totaltablecount = 0;
@tables = ();
$self->parse($page);
}
# END OF MYPARSER SUBCLASS
package main;
$page = qq(
<html>
<table>
1 belongs here
<table>
here is 2
</table>
<table>
here is 3
</table>
</table>
</html>
);
# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');
$p = new myParser;
$p->startParse($page, "MyOutputFileName");
Comment
From: roylam
Date: Wednesday, January 05 2000 - 01:30PM PST
i tried to install HTML:Parser but unfortunately i don't have the permission to
do so. i remember there was a way to tell the perl program where to look for
the module (which is compiled in my home directory, i'm using sun
solaris). can anyone tell me how to do that? thanks.
Comment
From: clockwatcher
Date: Wednesday, January 05 2000 - 04:07PM PST
I'm not a unix user, so there may be a lot better way to do this, but the
following works for me.
After untarring,
perl Makefile.pl PREFIX=/myhomedir/perl
make
make test
make install
that should install the module within /myhomedir/perl.
On my system, the above does the following:
Places documentation in:
~/perl/lib/perl5/man
Places the modules in:
~/perl/lib/site-perl/5.005
Your exact paths will be different. Then you simply add the path to @INC
and change the "use" to "require".
push (@INC, '/home/mark/perl/lib/site-
package myParser;
require HTML::Parser;
@ISA = qw(HTML::Parser);
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq "table") {
push @tables, $origtext;
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub text {
my ($self, $text) = @_;
$tables[$#tables] .= $text unless ($#tables < 0);
}
sub end {
my ($self, $tag, $origtext) = @_;
if ($tag eq "table") {
if (!@tables) {
print "Parse Error: Mismatched table tags-- too many closing tags\n";
}
else {
my $currenttable = pop @tables;
$currenttable .= $origtext;
$totaltablecount++;
open (OUTPUT, ">${filebase}_${totaltable
${filebase}_${totaltableco
print OUTPUT "$currenttable\n";
close OUTPUT;
print "$currenttable\n\n";
$tables[$#tables] .= $currenttable unless ($#tables < 0)
}
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub startParse {
(my $self, my $page, $filebase) = @_;
$totaltablecount = 0;
@tables = ();
$self->parse($page);
}
# END OF MYPARSER SUBCLASS
package main;
$page = qq(
<html>
<table>
1 belongs here
<table>
here is 2
</table>
<table>
here is 3
</table>
</table>
</html>
);
# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');
$p = new myParser;
$p->startParse($page, "MyOutputFileName");
ASKER
oops, please ignore the bit after "COMMENT" , my fault -- copy-and-paste problem.
I'm not 100% sure what you want, but I believe this is what you're asking for:
# SUBCLASS HTML::PARSER to pull the links and info
package myParser;
use HTML::Parser;
@ISA = qw(HTML::Parser);
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq "table") {
$withincount = -1 unless (@tables);
push @tables, $origtext;
$withincount++;
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub text {
my ($self, $text) = @_;
$tables[$#tables] .= $text unless ($#tables < 0);
}
sub end {
my ($self, $tag, $origtext) = @_;
if ($tag eq "table") {
if (!@tables) {
print "Parse Error: Mismatched table tags-- too many closing tags\n";
}
else {
my $currenttable = pop @tables;
$currenttable .= $origtext;
if (!@tables && ($withincount <= $maxnest)) {
$totaltablecount++;
open (OUTPUT, ">${filebase}_${totaltable count}.txt ") || die "Can't open ${filebase}_${totaltableco unt}.txt: $!\n";
print OUTPUT "$currenttable\n";
close OUTPUT;
print "$currenttable\n\n";
}
$tables[$#tables] .= $currenttable unless ($#tables < 0)
}
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub startParse {
(my $self, my $page, $filebase, $maxnest) = @_;
$totaltablecount = 0;
@tables = ();
$self->parse($page);
}
# END OF MYPARSER SUBCLASS
package main;
$maxinnertables = (shift) - 1;
$maxinnertables = 0 unless ($maxinnertables > 0);
$page = qq(
<html>
<table>
No nested
</table>
<table>
One nested
<table>
nest
</table>
</table>
<table>
Two nested
<table>
nest 1
</table>
<table>
nest 2
</table>
</table>
<table>
One nest with a nest
<table>
nest containing a nest
<table>
nest within a nest
</table>
</table>
</table>
</html>
);
# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');
$p = new myParser;
$p->startParse($page, "MyOutputFileName", $maxinnertables);
# SUBCLASS HTML::PARSER to pull the links and info
package myParser;
use HTML::Parser;
@ISA = qw(HTML::Parser);
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq "table") {
$withincount = -1 unless (@tables);
push @tables, $origtext;
$withincount++;
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub text {
my ($self, $text) = @_;
$tables[$#tables] .= $text unless ($#tables < 0);
}
sub end {
my ($self, $tag, $origtext) = @_;
if ($tag eq "table") {
if (!@tables) {
print "Parse Error: Mismatched table tags-- too many closing tags\n";
}
else {
my $currenttable = pop @tables;
$currenttable .= $origtext;
if (!@tables && ($withincount <= $maxnest)) {
$totaltablecount++;
open (OUTPUT, ">${filebase}_${totaltable
print OUTPUT "$currenttable\n";
close OUTPUT;
print "$currenttable\n\n";
}
$tables[$#tables] .= $currenttable unless ($#tables < 0)
}
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub startParse {
(my $self, my $page, $filebase, $maxnest) = @_;
$totaltablecount = 0;
@tables = ();
$self->parse($page);
}
# END OF MYPARSER SUBCLASS
package main;
$maxinnertables = (shift) - 1;
$maxinnertables = 0 unless ($maxinnertables > 0);
$page = qq(
<html>
<table>
No nested
</table>
<table>
One nested
<table>
nest
</table>
</table>
<table>
Two nested
<table>
nest 1
</table>
<table>
nest 2
</table>
</table>
<table>
One nest with a nest
<table>
nest containing a nest
<table>
nest within a nest
</table>
</table>
</table>
</html>
);
# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');
$p = new myParser;
$p->startParse($page, "MyOutputFileName", $maxinnertables);
ASKER
thanks for your quick response, clockwatcher.
but the output isn't quite what i want. let me use your example to illustrate:
ie (
<table>
No nested
</table>
<table>
One nested
<table>
nest
</table>
</table>
<table>
Two nested
<table>
nest 1
</table>
<table>
nest 2
</table>
</table>
<table>
One nest with a nest
<table>
nest containing a nest
<table>
nest within a nest
</table>
</table>
</table>
)
here is what i expect the program to output with the following commandline:
myperlprogram 0:
<table>
No nested
</table>
<table>
nest
</table>
<table>
nest 1
</table>
<table>
nest 2
</table>
<table>
nest within a nest
</table>
myperlprogram 1:
<table>
One nested
<table>
nest
</table>
</table>
<table>
Two nested
<table>
nest 1
</table>
<table>
nest 2
</table>
</table>
<table>
nest containing a nest
<table>
nest within a nest
</table>
</table>
and myperlprogram 2 will give this ouput:
<table>
One nest with a nest
<table>
nest containing a nest
<table>
nest within a nest
</table>
</table>
</table>
i'll try to clarify a bit more if this is not clear.
but the output isn't quite what i want. let me use your example to illustrate:
ie (
<table>
No nested
</table>
<table>
One nested
<table>
nest
</table>
</table>
<table>
Two nested
<table>
nest 1
</table>
<table>
nest 2
</table>
</table>
<table>
One nest with a nest
<table>
nest containing a nest
<table>
nest within a nest
</table>
</table>
</table>
)
here is what i expect the program to output with the following commandline:
myperlprogram 0:
<table>
No nested
</table>
<table>
nest
</table>
<table>
nest 1
</table>
<table>
nest 2
</table>
<table>
nest within a nest
</table>
myperlprogram 1:
<table>
One nested
<table>
nest
</table>
</table>
<table>
Two nested
<table>
nest 1
</table>
<table>
nest 2
</table>
</table>
<table>
nest containing a nest
<table>
nest within a nest
</table>
</table>
and myperlprogram 2 will give this ouput:
<table>
One nest with a nest
<table>
nest containing a nest
<table>
nest within a nest
</table>
</table>
</table>
i'll try to clarify a bit more if this is not clear.
ASKER
actually, i think how many "nested" table is not quite right. it should be how many "levels" of table.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
clockwatcher, you're champion!
thanks heaps for your quick response.
thanks heaps for your quick response.
i m not sure what you want to do..
are you not able to pass the parameter or what..?