Link to home
Start Free TrialLog in
Avatar of catalini
catalini

asked on

extract data from a html table

I have some html files that have at some point a table, delimited in the code with

<tbody>.....</tbody>

I would like to extract the values in each column and each row. The rows are delimited with <tr>.... </tr>

and my pattern is the following:

<td class="img"><a href="/page/abcd/"><img src="/static/asjsjs.png" alt="abcd123" /></a></td><td class="name"><a href="/gsgs/asdatacot/">abcddd</a></td><td class="date">Mar 30, 2008</td><td>dhhfdf</td><td class="pages">104</td></tr>

from each line like the one above I need to extract:

1) the 1st href link: /page/abcd/
2) the 2nd href link: /gsgs/asdatacot/    and its name  "abcddd"
3) the date: Mar 30, 2008
4) the column after the date: dhhfdf
5) the number of "pages": 104

what is the best way to do that with perl?

thanks!!!!
Avatar of Adam314
Adam314

If you can count on the HTML being fairly consistent (eg: first td always has first href, second td has second href, third td has date, forth has column after date, fifth has pages), then you could do it with a regex.  If the data could be more varied, then you would use an HTML parser.
Avatar of catalini

ASKER

it's always very consistent, what would be the correct regex? thanks

$_='<td class="img"><a href="/page/abcd/"><img src="/static/asjsjs.png" alt="abcd123" /></a></td><td class="name"><a href="/gsgs/asdatacot/">abcddd</a></td><td class="date">Mar 30, 2008</td><td>dhhfdf</td><td class="pages">104</td></tr>';
 
my ($h1, $h2, $date, $next, $page) =
 /href="(.*?)".*?href="(.*?)".*?date">(.*?)<\/td><td>(.*?)<\/td>.*?>(.*?)</;
 
print "h1=$h1\nh2=$h2\ndate=$date\nnext=$next\npage=$page\n";

Open in new window

thanks adam! and to go through all the files in a directory and save a file for each of them only with the cleaned output?
it should only extract the pattern after a ...

local $/;
foreach <*> {
    open(my $in, "<$_") or warn "could not open $_: $!\n",next;
    open(my $out, ">$_.results") or warn "could not open $_.results: $!\n",next;
    my $data=<$in>;
    $data =~ s/.*?<tbody>(.*?)</tbody>.*/$1/s;
    while($data =~ /href="(.*?)".*?href="(.*?)".*?date">(.*?)<\/td><td>(.*?)<\/td>.*?>(.*?)</g) {
        print $out "h1=$1\nh2=$2\ndate=$3\nnext=$4\npage=$5\n";
    }
    close($in);
    close($out);
}

Open in new window

I receive this error

Scalar found where operator expected at cleaner.pl line 6, near "s/.*?(
.*?).*/$1"
syntax error at believers.pl line 2, near "foreach <*>"
syntax error at believers.pl line 6, near "s/.*?(.*?).*/$1"
Execution of cleaner.pl aborted due to compilation errors.
ASKER CERTIFIED SOLUTION
Avatar of Adam314
Adam314

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
PERFECT!