catalini
asked on
extract data from a html table
I have some html files that have at some point a table, delimited in the code with
<tbody>.....</tbody>
I would like to extract the values in each column and each row. The rows are delimited with <tr>.... </tr>
and my pattern is the following:
<td class="img"><a href="/page/abcd/"><img src="/static/asjsjs.png" alt="abcd123" /></a></td><td class="name"><a href="/gsgs/asdatacot/">ab cddd</a></ td><td class="date">Mar 30, 2008</td><td>dhhfdf</td><t d class="pages">104</td></tr >
from each line like the one above I need to extract:
1) the 1st href link: /page/abcd/
2) the 2nd href link: /gsgs/asdatacot/ and its name "abcddd"
3) the date: Mar 30, 2008
4) the column after the date: dhhfdf
5) the number of "pages": 104
what is the best way to do that with perl?
thanks!!!!
<tbody>.....</tbody>
I would like to extract the values in each column and each row. The rows are delimited with <tr>.... </tr>
and my pattern is the following:
<td class="img"><a href="/page/abcd/"><img src="/static/asjsjs.png" alt="abcd123" /></a></td><td class="name"><a href="/gsgs/asdatacot/">ab
from each line like the one above I need to extract:
1) the 1st href link: /page/abcd/
2) the 2nd href link: /gsgs/asdatacot/ and its name "abcddd"
3) the date: Mar 30, 2008
4) the column after the date: dhhfdf
5) the number of "pages": 104
what is the best way to do that with perl?
thanks!!!!
If you can count on the HTML being fairly consistent (eg: first td always has first href, second td has second href, third td has date, forth has column after date, fifth has pages), then you could do it with a regex. If the data could be more varied, then you would use an HTML parser.
ASKER
it's always very consistent, what would be the correct regex? thanks
$_='<td class="img"><a href="/page/abcd/"><img src="/static/asjsjs.png" alt="abcd123" /></a></td><td class="name"><a href="/gsgs/asdatacot/">abcddd</a></td><td class="date">Mar 30, 2008</td><td>dhhfdf</td><td class="pages">104</td></tr>';
my ($h1, $h2, $date, $next, $page) =
/href="(.*?)".*?href="(.*?)".*?date">(.*?)<\/td><td>(.*?)<\/td>.*?>(.*?)</;
print "h1=$h1\nh2=$h2\ndate=$date\nnext=$next\npage=$page\n";
ASKER
thanks adam! and to go through all the files in a directory and save a file for each of them only with the cleaned output?
ASKER
it should only extract the pattern after a ...
local $/;
foreach <*> {
open(my $in, "<$_") or warn "could not open $_: $!\n",next;
open(my $out, ">$_.results") or warn "could not open $_.results: $!\n",next;
my $data=<$in>;
$data =~ s/.*?<tbody>(.*?)</tbody>.*/$1/s;
while($data =~ /href="(.*?)".*?href="(.*?)".*?date">(.*?)<\/td><td>(.*?)<\/td>.*?>(.*?)</g) {
print $out "h1=$1\nh2=$2\ndate=$3\nnext=$4\npage=$5\n";
}
close($in);
close($out);
}
ASKER
I receive this error
Scalar found where operator expected at cleaner.pl line 6, near "s/.*?(
.*?).*/$1"
syntax error at believers.pl line 2, near "foreach <*>"
syntax error at believers.pl line 6, near "s/.*?(.*?).*/$1"
Execution of cleaner.pl aborted due to compilation errors.
Scalar found where operator expected at cleaner.pl line 6, near "s/.*?(
.*?).*/$1"
syntax error at believers.pl line 2, near "foreach <*>"
syntax error at believers.pl line 6, near "s/.*?(.*?).*/$1"
Execution of cleaner.pl aborted due to compilation errors.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
PERFECT!