Link to home
Start Free TrialLog in
Avatar of trevor1940
trevor1940

asked on

perl: REGEX on img alt tag

Hi

Given the code and html snippet bellow I'm trying to extract the name from the img alt tag

I can isolate the alt tag but can't get just the  name!

I want $name = "trevor OBT tumblr_ozxualLdb1who6_540.jpg"

As this is html I've no idea if each line of the alt ends in "\n" and my split isn't working

use strict; use warnings;
use HTML::TreeBuilder;
use HTML::Element;


my $body =HTML::TreeBuilder->new_from_file(*DATA);
  my @A = $body -> look_down('_tag', 'a');
  for my $a (@A){
    my $url = $a->attr('href'); 
    if((defined($url)) && ($url=~m/attachment/)  ){
        print  $url ."\n";
        my $img = $a -> look_down('_tag', 'img');
        my $alt = $img->attr('alt'); 
        print "alt [" . $alt . "]\n";  ##  works to here
        my @altBits = split(/nbsp/,$alt);
        foreach my $line (@altBits){
            if ($line =~ m/Name:\s.*(.*)\&/i){
                my $name =$1;
                print "name [$name]\n";                
                }
            }

         
        }
     else   {
                    print $url ."\n";
        }
    }# end for $A
print "Finished \n";

__DATA__
<div class="postbody">
			<div class="postrow">
				<div class="content">
					<div id="post_message_180">
						<blockquote class="postcontent restore ">
							Trevor <br>
<a href="http://www.example.com/vboard/attachment.php?s=b31c60a8e6f7c723&amp;attachmentid=104&amp;d=623527" 
id="attachment1040762" rel="Lightbox_1804154">
<img src="http://www.example.com/vboard/attachment.php?s=b31c60a8e6f7c723&amp;attachmentid=1042&amp;d=623527&amp;thumb=1" 
alt="Click image for larger version.&nbsp;

Name:	trevor OBT tumblr_ozxualLdb1who6_540.jpg&nbsp;
Views:	287&nbsp;
Size:	109.2 KB&nbsp;
ID:	1040762" class="thumbnail" style="float:CONFIG" title="Click image for larger version.&nbsp;

Name:	trevor OBT tumblr_ozxualLdb1who6_540.jpg&nbsp;
Views:	287&nbsp;
Size:	109.2 KB&nbsp;
ID:	1040762" border="0"></a>
						</blockquote>
					</div>

					
				</div>
			</div>
		

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Shaun Vermaak
Shaun Vermaak
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of trevor1940
trevor1940

ASKER

Hi,

Your regex works on plain text but I Think the lines below create objects so you can't regex on a object

        my $img = $a -> look_down('_tag', 'img');
        my $alt = $img->attr('alt'); 

Open in new window


Doing this

        my $img = $a -> look_down('_tag', 'img');
        my $alt = $img->attr('alt')->as_text;

Open in new window


Gives this error

http://www.example.com/vboard/attachment.php?s=b31c60a8e6f7c723&attachmentid=104&d=623527
Can't locate object method "as_text" via package "Click image for larger version.á

Name:   trevor OBT tumblr_ozxualLdb1who6_540.jpgá
Views:  287á
Size:   109.2 KBá
ID:     1040762" (perhaps you forgot to load "Click image for larger version.á

Name:   trevor OBT tumblr_ozxualLdb1who6_540.jpgá
Views:  287á
Size:   109.2 KBá
ID:     1040762"?) at D:\PerlScripts\GetAlt.pl line 15.

Open in new window

I figured it out by doing a data dump on $alt to reveal

"Click image for larger version.\xA0\n\nName:\ttrevor OBT tumblr_ozxualLdb1who6_540.jpg\xA0\nViews:\t287\xA0\nSize:\t109.2 KB\xA0\nID:\t1040762"

Open in new window


so my REGEX became

            if ($alt =~ m/(?<=Name:\t)(.*)(?=\xA0)/i){
                my $name =$1;
                print "name [$name]\n";                
                }

Open in new window

Hi,
Please see last comment for solution
I forgot &nbsp; became the hexadecimal \xA0 when using HTML Elements

Thanx for your help