rip tables apart in html

i would like to cut chunks of html code and write them to different files.
e.g.

<html>

<table>
  1 belongs here
  <table>
    here is 2
  </table>
</table>

</html>


more specific, i want the program to output the following:
  <table>
    here is 2
  </table>

<table>
  1 belongs here
  <table>
    here is 2
  </table>
</table>

what it actually does is that it recursively output the smallest table and then one level bigger and then another level bigger and so on.

your help would be appreciated
roylamAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Kim RyanIT ConsultantCommented:
There is a CPAN module that covers this, HTML::Parser. You can parse HTML files and build as tree of tokens or sections
0
KennyIT Application ExecutiveCommented:
I don't quite understand what you want to accomplish. Perhaps you can give a better example?
0
roylamAuthor Commented:
sorry about the unclear question.

let me try to explain what i want.
say there is a nested tables in a html file.
i would like to be able to get certain tables out, e.g. the innermost table(s). i.e. table that doesn't have any other table within itself.

if this could be done, i would like it to be expanded a bit.  i.e. i would call the program with a parameter say

MyProgram "2"

here 2 specifies that i only want to print tables that has one nested table inside it.  "2" as there are two tables altogether.

i hope this is clearer
0
Angular Fundamentals

Learn the fundamentals of Angular 2, a JavaScript framework for developing dynamic single page applications.

KennyIT Application ExecutiveCommented:
For now, I shall assume you are quite familiar with PERL. This is a way of doing it :

In the HTML file, try to place the <table> and </Table tags in a line by themselves. Definitely do not have 2 table tags on the same line, or else it will "break" the code.

In your PERL script, firstly you trap the value passed in to the program ($Num).
then,

 open (HTML, "htmlfile.htm")
 $Count=0;
 while (<HTML>)
   {
   $Line=$_;
   if ($Line =~ /<table>/i)
     {
     $Count++;
     }
   if ($Num eq $Count)
     {
     print $Line;
     }

   if ($Line =~ /<\/table>/i)
     {
     $Count=$Count-1;
     }
   

I hope this gives you an idea of what to do. If you need me to give you more, I can only do so a little later as I am a bit busy now. Hope it helps.
0
ozoCommented:
The right way to do it would be to use HTML::Parser;
although if your tables are as simple as those in the example above,
with nothing tricky in an ALT attribute or in comments or a in a <SCRIPT>, no attributes in your <table> tags, etc.
then you might be able to get by with some simple-minded regular expressions
0
roylamAuthor Commented:
zxr250, thanks for your input.

i was hoping that i can get around with this but i guess teraplane and ozo are right.  much of the html code out there are too much to expect to have one "table" per line.

can you show me some sample code of using the HTML:Parser module?
0
clockwatcherCommented:
Here you go:

package myParser;
use HTML::Parser;
@ISA = qw(HTML::Parser);

sub start {
  my ($self, $tag, $attr, $attrseq, $origtext) = @_;
  if ($tag eq "table") {
     push @tables, $origtext;
  }
  elsif ($#tables >= 0) {
     $tables[$#tables] .= $origtext;
  }
}
sub text {
  my ($self, $text) = @_;
  $tables[$#tables] .= $text unless ($#tables < 0);

}
sub end {
  my ($self, $tag, $origtext) = @_;
  if ($tag eq "table") {
     if (!@tables) {
      print "Parse Error: Mismatched table tags-- too many closing tags\n";
     }
     else {
      my $currenttable = pop @tables;
        $currenttable .= $origtext;
      $totaltablecount++;
        open (OUTPUT, ">${filebase}_${totaltablecount}.txt") || die "Can't open ${filebase}_${totaltablecount}.txt: $!\n";
      print OUTPUT "$currenttable\n";
        close OUTPUT;
      print "$currenttable\n\n";
        $tables[$#tables] .= $currenttable unless ($#tables < 0)
     }
  }
  elsif ($#tables >= 0) {
     $tables[$#tables] .= $origtext;
  }
}

sub startParse {
  (my $self, my $page, $filebase) = @_;
  $totaltablecount = 0;
  @tables = ();
  $self->parse($page);
}
# END OF MYPARSER SUBCLASS


package main;

$page = qq(

<html>

<table>
  1 belongs here
  <table>
    here is 2
  </table>
  <table>
    here is 3
  </table>
</table>

</html>
);

# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');

$p = new myParser;
$p->startParse($page, "MyOutputFileName");
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
roylamAuthor Commented:
i tried to install HTML:Parser but unfortunately i don't have the permission to do so.  i remember there was a way to tell the perl program where to look for the module (which is compiled in my home directory, i'm using sun solaris).  can anyone tell me how to do that? thanks.

0
clockwatcherCommented:
I'm not a unix user, so there may be a lot better way to do this, but the following works for me.

After untarring,

perl Makefile.pl PREFIX=/myhomedir/perl
make
make test
make install

that should install the module within /myhomedir/perl.

On my system, the above does the following:

Places documentation in:

  ~/perl/lib/perl5/man

Places the modules in:

  ~/perl/lib/site-perl/5.005/i386-linux

Your exact paths will be different.  Then you simply add the path to @INC and change the "use" to "require".

push (@INC, '/home/mark/perl/lib/site-perl/5.005/i386-linux');

package myParser;
require HTML::Parser;
@ISA = qw(HTML::Parser);

sub start {
  my ($self, $tag, $attr, $attrseq, $origtext) = @_;
  if ($tag eq "table") {
     push @tables, $origtext;
  }
  elsif ($#tables >= 0) {
     $tables[$#tables] .= $origtext;
  }
}
sub text {
  my ($self, $text) = @_;
  $tables[$#tables] .= $text unless ($#tables < 0);

}
sub end {
  my ($self, $tag, $origtext) = @_;
  if ($tag eq "table") {
     if (!@tables) {
print "Parse Error: Mismatched table tags-- too many closing tags\n";
     }
     else {
my $currenttable = pop @tables;
        $currenttable .= $origtext;
$totaltablecount++;
        open (OUTPUT, ">${filebase}_${totaltablecount}.txt") || die "Can't open ${filebase}_${totaltablecount}.txt: $!\n";
print OUTPUT "$currenttable\n";
        close OUTPUT;
print "$currenttable\n\n";
        $tables[$#tables] .= $currenttable unless ($#tables < 0)
     }
  }
  elsif ($#tables >= 0) {
     $tables[$#tables] .= $origtext;
  }
}

sub startParse {
  (my $self, my $page, $filebase) = @_;
  $totaltablecount = 0;
  @tables = ();
  $self->parse($page);
}
# END OF MYPARSER SUBCLASS


package main;

$page = qq(

<html>

<table>
  1 belongs here
  <table>
    here is 2
  </table>
  <table>
    here is 3
  </table>
</table>

</html>
);

# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');

$p = new myParser;
$p->startParse($page, "MyOutputFileName");
0
roylamAuthor Commented:
thnx heaps clockwatcher
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.