Link to home
Start Free TrialLog in
Avatar of crest
crest

asked on

rip table again, refer to roylam's question

i found it very useful to print certain tables instead of printing the whole lot.  here is what i want to do:

myperlprogram 1 - print all tables without nested table in them.
i.e. <table></table>

myperlprogram 2 - print all tables with exactly one table in them.
i.e. <table><table></table></table>

here "1" and "2" are the arguments passed to the program.

i spent ages trying to figure out the algorithm but no luck.  any idea?
Avatar of khacharn
khacharn

Could you be more specific..
i m not sure what you want to do..
are you not able to pass the parameter or what..?
Avatar of crest

ASKER

i've use the source code from  a PAQ "rip tables apart in html" - Computers/Programming/Languages/Perl/Q_10254674.html.  
In which it print all the tables in a recursive manner.  I don't want that, I want to print just certain tables that have a specific number of nested table(s) in them as I stated in my question.  But my problem is that I couldn't get it working - use wrong algorithm.  Anyway here is the original source code:



package myParser;
use HTML::Parser;
@ISA = qw(HTML::Parser);

sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq "table") {
   push @tables, $origtext;
}
elsif ($#tables >= 0) {
   $tables[$#tables] .= $origtext;
}
}
sub text {
my ($self, $text) = @_;
$tables[$#tables] .= $text unless ($#tables < 0);

}
sub end {
my ($self, $tag, $origtext) = @_;
if ($tag eq "table") {
   if (!@tables) {
print "Parse Error: Mismatched table tags-- too many closing tags\n";
   }
   else {
my $currenttable = pop @tables;
      $currenttable .= $origtext;
$totaltablecount++;
      open (OUTPUT, ">${filebase}_${totaltablecount}.txt") || die "Can't open
${filebase}_${totaltablecount}.txt: $!\n";
print OUTPUT "$currenttable\n";
      close OUTPUT;
print "$currenttable\n\n";
      $tables[$#tables] .= $currenttable unless ($#tables < 0)
   }
}
elsif ($#tables >= 0) {
   $tables[$#tables] .= $origtext;
}
}

sub startParse {
(my $self, my $page, $filebase) = @_;
$totaltablecount = 0;
@tables = ();
$self->parse($page);
}
# END OF MYPARSER SUBCLASS


package main;

$page = qq(

<html>

<table>
1 belongs here
<table>
  here is 2
</table>
<table>
  here is 3
</table>
</table>

</html>
);

# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');

$p = new myParser;
$p->startParse($page, "MyOutputFileName");

Comment

From: roylam
                  Date: Wednesday, January 05 2000 - 01:30PM PST

i tried to install HTML:Parser but unfortunately i don't have the permission to
do so.  i remember there was a way to tell the perl program where to look for
the module (which is compiled in my home directory, i'm using sun
solaris).  can anyone tell me how to do that? thanks.


Comment

From: clockwatcher
                  Date: Wednesday, January 05 2000 - 04:07PM PST

I'm not a unix user, so there may be a lot better way to do this, but the
following works for me.

After untarring,

perl Makefile.pl PREFIX=/myhomedir/perl
make
make test
make install

that should install the module within /myhomedir/perl.

On my system, the above does the following:

Places documentation in:

~/perl/lib/perl5/man

Places the modules in:

~/perl/lib/site-perl/5.005/i386-linux

Your exact paths will be different.  Then you simply add the path to @INC
and change the "use" to "require".

push (@INC, '/home/mark/perl/lib/site-perl/5.005/i386-linux');

package myParser;
require HTML::Parser;
@ISA = qw(HTML::Parser);

sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq "table") {
   push @tables, $origtext;
}
elsif ($#tables >= 0) {
   $tables[$#tables] .= $origtext;
}
}
sub text {
my ($self, $text) = @_;
$tables[$#tables] .= $text unless ($#tables < 0);

}
sub end {
my ($self, $tag, $origtext) = @_;
if ($tag eq "table") {
   if (!@tables) {
print "Parse Error: Mismatched table tags-- too many closing tags\n";
   }
   else {
my $currenttable = pop @tables;
      $currenttable .= $origtext;
$totaltablecount++;
      open (OUTPUT, ">${filebase}_${totaltablecount}.txt") || die "Can't open
${filebase}_${totaltablecount}.txt: $!\n";
print OUTPUT "$currenttable\n";
      close OUTPUT;
print "$currenttable\n\n";
      $tables[$#tables] .= $currenttable unless ($#tables < 0)
   }
}
elsif ($#tables >= 0) {
   $tables[$#tables] .= $origtext;
}
}

sub startParse {
(my $self, my $page, $filebase) = @_;
$totaltablecount = 0;
@tables = ();
$self->parse($page);
}
# END OF MYPARSER SUBCLASS


package main;

$page = qq(

<html>

<table>
1 belongs here
<table>
  here is 2
</table>
<table>
  here is 3
</table>
</table>

</html>
);

# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');

$p = new myParser;
$p->startParse($page, "MyOutputFileName");


Avatar of crest

ASKER

oops, please ignore the bit after "COMMENT" , my fault -- copy-and-paste problem.
I'm not 100% sure what you want, but I believe this is what you're asking for:


# SUBCLASS HTML::PARSER to pull the links and info
package myParser;
use HTML::Parser;
@ISA = qw(HTML::Parser);

sub start {
  my ($self, $tag, $attr, $attrseq, $origtext) = @_;
  if ($tag eq "table") {
     $withincount = -1 unless (@tables);
     push @tables, $origtext;
     $withincount++;
  }
  elsif ($#tables >= 0) {
     $tables[$#tables] .= $origtext;
  }
}
sub text {
  my ($self, $text) = @_;
  $tables[$#tables] .= $text unless ($#tables < 0);

}
sub end {
  my ($self, $tag, $origtext) = @_;
  if ($tag eq "table") {
     if (!@tables) {
      print "Parse Error: Mismatched table tags-- too many closing tags\n";
     }
     else {
      my $currenttable = pop @tables;
        $currenttable .= $origtext;
        if (!@tables && ($withincount <= $maxnest)) {
           $totaltablecount++;
           open (OUTPUT, ">${filebase}_${totaltablecount}.txt") || die "Can't open ${filebase}_${totaltablecount}.txt: $!\n";
         print OUTPUT "$currenttable\n";
           close OUTPUT;
         print "$currenttable\n\n";
        }
        $tables[$#tables] .= $currenttable unless ($#tables < 0)
     }
  }
  elsif ($#tables >= 0) {
     $tables[$#tables] .= $origtext;
  }
}

sub startParse {
  (my $self, my $page, $filebase, $maxnest) = @_;
  $totaltablecount = 0;
  @tables = ();
  $self->parse($page);
}
# END OF MYPARSER SUBCLASS


package main;

$maxinnertables = (shift) - 1;
$maxinnertables = 0 unless ($maxinnertables > 0);

$page = qq(

<html>

<table>
   No nested
</table>

<table>
   One nested
  <table>
    nest
  </table>
</table>

<table>
   Two nested
   <table>
      nest 1
   </table>
   <table>
      nest 2
   </table>
</table>

<table>
   One nest with a nest
   <table>
      nest containing a nest
        <table>
           nest within a nest
        </table>
   </table>
</table>

</html>
);

# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');

$p = new myParser;
$p->startParse($page, "MyOutputFileName", $maxinnertables);
Avatar of crest

ASKER

thanks for your quick response, clockwatcher.
but the output isn't quite what i want.  let me use your example to illustrate:
ie (
<table>
   No nested
</table>

<table>
   One nested
  <table>
    nest
  </table>
</table>

<table>
   Two nested
   <table>
      nest 1
   </table>
   <table>
      nest 2
   </table>
</table>

<table>
   One nest with a nest
   <table>
      nest containing a nest
        <table>
           nest within a nest
        </table>
   </table>
</table>
)


here is what i expect the program to output with the following commandline:

myperlprogram 0:

<table>
   No nested
</table>

  <table>
    nest
  </table>

   <table>
      nest 1
   </table>

   <table>
      nest 2
   </table>

        <table>
           nest within a nest
        </table>


myperlprogram 1:
<table>
   One nested
  <table>
    nest
  </table>
</table>

<table>
   Two nested
   <table>
      nest 1
   </table>
   <table>
      nest 2
   </table>
</table>

   <table>
      nest containing a nest
        <table>
           nest within a nest
        </table>
   </table>




and myperlprogram 2 will give this ouput:

<table>
   One nest with a nest
   <table>
      nest containing a nest
        <table>
           nest within a nest
        </table>
   </table>
</table>

i'll try to clarify a bit more if this is not clear.
Avatar of crest

ASKER

actually, i think how many "nested" table is not quite right.  it should be how many "levels" of table.
Avatar of ozo
 
ASKER CERTIFIED SOLUTION
Avatar of clockwatcher
clockwatcher

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of crest

ASKER

clockwatcher, you're champion!
thanks heaps for your quick response.