asked on

rip tables apart in html

i would like to cut chunks of html code and write them to different files.
e.g.

<html>

<table>
1 belongs here
<table>
here is 2
</table>
</table>

</html>

more specific, i want the program to output the following:
<table>
here is 2
</table>

<table>
1 belongs here
<table>
here is 2
</table>
</table>

what it actually does is that it recursively output the smallest table and then one level bigger and then another level bigger and so on.

your help would be appreciated

Kim Ryan

There is a CPAN module that covers this, HTML::Parser. You can parse HTML files and build as tree of tokens or sections

Kenny

I don't quite understand what you want to accomplish. Perhaps you can give a better example?

roylam

ASKER

sorry about the unclear question.

let me try to explain what i want.
say there is a nested tables in a html file.
i would like to be able to get certain tables out, e.g. the innermost table(s). i.e. table that doesn't have any other table within itself.

if this could be done, i would like it to be expanded a bit. i.e. i would call the program with a parameter say

MyProgram "2"

here 2 specifies that i only want to print tables that has one nested table inside it. "2" as there are two tables altogether.

i hope this is clearer

Kenny

For now, I shall assume you are quite familiar with PERL. This is a way of doing it :

In the HTML file, try to place the <table> and </Table tags in a line by themselves. Definitely do not have 2 table tags on the same line, or else it will "break" the code.

In your PERL script, firstly you trap the value passed in to the program ($Num).
then,

open (HTML, "htmlfile.htm")
$Count=0;
while (<HTML>)
{
$Line=$_;
if ($Line =~ /<table>/i)
{
$Count++;
}
if ($Num eq $Count)
{
print $Line;
}

if ($Line =~ /<\/table>/i)
{
$Count=$Count-1;
}

I hope this gives you an idea of what to do. If you need me to give you more, I can only do so a little later as I am a bit busy now. Hope it helps.

ozo

The right way to do it would be to use HTML::Parser;
although if your tables are as simple as those in the example above,
with nothing tricky in an ALT attribute or in comments or a in a <SCRIPT>, no attributes in your <table> tags, etc.
then you might be able to get by with some simple-minded regular expressions

roylam

ASKER

zxr250, thanks for your input.

i was hoping that i can get around with this but i guess teraplane and ozo are right. much of the html code out there are too much to expect to have one "table" per line.

can you show me some sample code of using the HTML:Parser module?

ASKER CERTIFIED SOLUTION

clockwatcher

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

roylam

ASKER

i tried to install HTML:Parser but unfortunately i don't have the permission to do so. i remember there was a way to tell the perl program where to look for the module (which is compiled in my home directory, i'm using sun solaris). can anyone tell me how to do that? thanks.

clockwatcher

I'm not a unix user, so there may be a lot better way to do this, but the following works for me.

After untarring,

perl Makefile.pl PREFIX=/myhomedir/perl
make
make test
make install

that should install the module within /myhomedir/perl.

On my system, the above does the following:

Places documentation in:

~/perl/lib/perl5/man

Places the modules in:

~/perl/lib/site-perl/5.005/i386-linux

Your exact paths will be different. Then you simply add the path to @INC and change the "use" to "require".

push (@INC, '/home/mark/perl/lib/site-perl/5.005/i386-linux');

package myParser;
require HTML::Parser;
@ISA = qw(HTML::Parser);

sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq "table") {
push @tables, $origtext;
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}
sub text {
my ($self, $text) = @_;
$tables[$#tables] .= $text unless ($#tables < 0);

}
sub end {
my ($self, $tag, $origtext) = @_;
if ($tag eq "table") {
if (!@tables) {
print "Parse Error: Mismatched table tags-- too many closing tags\n";
}
else {
my $currenttable = pop @tables;
$currenttable .= $origtext;
$totaltablecount++;
open (OUTPUT, ">${filebase}_${totaltablecount}.txt") || die "Can't open ${filebase}_${totaltablecount}.txt: $!\n";
print OUTPUT "$currenttable\n";
close OUTPUT;
print "$currenttable\n\n";
$tables[$#tables] .= $currenttable unless ($#tables < 0)
}
}
elsif ($#tables >= 0) {
$tables[$#tables] .= $origtext;
}
}

sub startParse {
(my $self, my $page, $filebase) = @_;
$totaltablecount = 0;
@tables = ();
$self->parse($page);
}
# END OF MYPARSER SUBCLASS

package main;

$page = qq(

<html>

<table>
1 belongs here
<table>
here is 2
</table>
<table>
here is 3
</table>
</table>

</html>
);

# use LWP::Simple;
# $page = LWP::Simple::get('http://www.experts-exchange.com');

$p = new myParser;
$p->startParse($page, "MyOutputFileName");

roylam

ASKER

thnx heaps clockwatcher