trevor1940
asked on
perl: REGEX on img alt tag
Hi
Given the code and html snippet bellow I'm trying to extract the name from the img alt tag
I can isolate the alt tag but can't get just the name!
I want $name = "trevor OBT tumblr_ozxualLdb1who6_540. jpg"
As this is html I've no idea if each line of the alt ends in "\n" and my split isn't working
Given the code and html snippet bellow I'm trying to extract the name from the img alt tag
I can isolate the alt tag but can't get just the name!
I want $name = "trevor OBT tumblr_ozxualLdb1who6_540.
As this is html I've no idea if each line of the alt ends in "\n" and my split isn't working
use strict; use warnings;
use HTML::TreeBuilder;
use HTML::Element;
my $body =HTML::TreeBuilder->new_from_file(*DATA);
my @A = $body -> look_down('_tag', 'a');
for my $a (@A){
my $url = $a->attr('href');
if((defined($url)) && ($url=~m/attachment/) ){
print $url ."\n";
my $img = $a -> look_down('_tag', 'img');
my $alt = $img->attr('alt');
print "alt [" . $alt . "]\n"; ## works to here
my @altBits = split(/nbsp/,$alt);
foreach my $line (@altBits){
if ($line =~ m/Name:\s.*(.*)\&/i){
my $name =$1;
print "name [$name]\n";
}
}
}
else {
print $url ."\n";
}
}# end for $A
print "Finished \n";
__DATA__
<div class="postbody">
<div class="postrow">
<div class="content">
<div id="post_message_180">
<blockquote class="postcontent restore ">
Trevor <br>
<a href="http://www.example.com/vboard/attachment.php?s=b31c60a8e6f7c723&attachmentid=104&d=623527"
id="attachment1040762" rel="Lightbox_1804154">
<img src="http://www.example.com/vboard/attachment.php?s=b31c60a8e6f7c723&attachmentid=1042&d=623527&thumb=1"
alt="Click image for larger version.
Name: trevor OBT tumblr_ozxualLdb1who6_540.jpg
Views: 287
Size: 109.2 KB
ID: 1040762" class="thumbnail" style="float:CONFIG" title="Click image for larger version.
Name: trevor OBT tumblr_ozxualLdb1who6_540.jpg
Views: 287
Size: 109.2 KB
ID: 1040762" border="0"></a>
</blockquote>
</div>
</div>
</div>
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I figured it out by doing a data dump on $alt to reveal
so my REGEX became
"Click image for larger version.\xA0\n\nName:\ttrevor OBT tumblr_ozxualLdb1who6_540.jpg\xA0\nViews:\t287\xA0\nSize:\t109.2 KB\xA0\nID:\t1040762"
so my REGEX became
if ($alt =~ m/(?<=Name:\t)(.*)(?=\xA0)/i){
my $name =$1;
print "name [$name]\n";
}
ASKER
Hi,
Please see last comment for solution
I forgot became the hexadecimal \xA0 when using HTML Elements
Thanx for your help
Please see last comment for solution
I forgot became the hexadecimal \xA0 when using HTML Elements
Thanx for your help
ASKER
Your regex works on plain text but I Think the lines below create objects so you can't regex on a object
Open in new window
Doing this
Open in new window
Gives this error
Open in new window