asked on

extract data from a html table

I have some html files that have at some point a table, delimited in the code with

<tbody>.....</tbody>

I would like to extract the values in each column and each row. The rows are delimited with <tr>.... </tr>

and my pattern is the following:

<td class="img"><a href="/page/abcd/"><img src="/static/asjsjs.png" alt="abcd123" /></a></td><td class="name"><a href="/gsgs/asdatacot/">abcddd</a></td><td class="date">Mar 30, 2008</td><td>dhhfdf</td><td class="pages">104</td></tr>

from each line like the one above I need to extract:

1) the 1st href link: /page/abcd/
2) the 2nd href link: /gsgs/asdatacot/ and its name "abcddd"
3) the date: Mar 30, 2008
4) the column after the date: dhhfdf
5) the number of "pages": 104

what is the best way to do that with perl?

thanks!!!!

Adam314

If you can count on the HTML being fairly consistent (eg: first td always has first href, second td has second href, third td has date, forth has column after date, fifth has pages), then you could do it with a regex. If the data could be more varied, then you would use an HTML parser.

catalini

ASKER

it's always very consistent, what would be the correct regex? thanks

Adam314

$_='<td class="img"><a href="/page/abcd/"><img src="/static/asjsjs.png" alt="abcd123" /></a></td><td class="name"><a href="/gsgs/asdatacot/">abcddd</a></td><td class="date">Mar 30, 2008</td><td>dhhfdf</td><td class="pages">104</td></tr>';
 
my ($h1, $h2, $date, $next, $page) =
 /href="(.*?)".*?href="(.*?)".*?date">(.*?)<\/td><td>(.*?)<\/td>.*?>(.*?)</;
 
print "h1=$h1\nh2=$h2\ndate=$date\nnext=$next\npage=$page\n";

Open in new window

catalini

ASKER

thanks adam! and to go through all the files in a directory and save a file for each of them only with the cleaned output?

catalini

ASKER

it should only extract the pattern after a ...

Adam314

local $/;
foreach <*> {
    open(my $in, "<$_") or warn "could not open $_: $!\n",next;
    open(my $out, ">$_.results") or warn "could not open $_.results: $!\n",next;
    my $data=<$in>;
    $data =~ s/.*?<tbody>(.*?)</tbody>.*/$1/s;
    while($data =~ /href="(.*?)".*?href="(.*?)".*?date">(.*?)<\/td><td>(.*?)<\/td>.*?>(.*?)</g) {
        print $out "h1=$1\nh2=$2\ndate=$3\nnext=$4\npage=$5\n";
    }
    close($in);
    close($out);
}

Open in new window

catalini

ASKER

I receive this error

Scalar found where operator expected at cleaner.pl line 6, near "s/.*?(
.*?).*/$1"
syntax error at believers.pl line 2, near "foreach <*>"
syntax error at believers.pl line 6, near "s/.*?(.*?).*/$1"
Execution of cleaner.pl aborted due to compilation errors.

ASKER CERTIFIED SOLUTION

Adam314

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

catalini

ASKER

PERFECT!