• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 104
  • Last Modified:

perl: REGEX on img alt tag

Hi

Given the code and html snippet bellow I'm trying to extract the name from the img alt tag

I can isolate the alt tag but can't get just the  name!

I want $name = "trevor OBT tumblr_ozxualLdb1who6_540.jpg"

As this is html I've no idea if each line of the alt ends in "\n" and my split isn't working

use strict; use warnings;
use HTML::TreeBuilder;
use HTML::Element;


my $body =HTML::TreeBuilder->new_from_file(*DATA);
  my @A = $body -> look_down('_tag', 'a');
  for my $a (@A){
    my $url = $a->attr('href'); 
    if((defined($url)) && ($url=~m/attachment/)  ){
        print  $url ."\n";
        my $img = $a -> look_down('_tag', 'img');
        my $alt = $img->attr('alt'); 
        print "alt [" . $alt . "]\n";  ##  works to here
        my @altBits = split(/nbsp/,$alt);
        foreach my $line (@altBits){
            if ($line =~ m/Name:\s.*(.*)\&/i){
                my $name =$1;
                print "name [$name]\n";                
                }
            }

         
        }
     else   {
                    print $url ."\n";
        }
    }# end for $A
print "Finished \n";

__DATA__
<div class="postbody">
			<div class="postrow">
				<div class="content">
					<div id="post_message_180">
						<blockquote class="postcontent restore ">
							Trevor <br>
<a href="http://www.example.com/vboard/attachment.php?s=b31c60a8e6f7c723&amp;attachmentid=104&amp;d=623527" 
id="attachment1040762" rel="Lightbox_1804154">
<img src="http://www.example.com/vboard/attachment.php?s=b31c60a8e6f7c723&amp;attachmentid=1042&amp;d=623527&amp;thumb=1" 
alt="Click image for larger version.&nbsp;

Name:	trevor OBT tumblr_ozxualLdb1who6_540.jpg&nbsp;
Views:	287&nbsp;
Size:	109.2 KB&nbsp;
ID:	1040762" class="thumbnail" style="float:CONFIG" title="Click image for larger version.&nbsp;

Name:	trevor OBT tumblr_ozxualLdb1who6_540.jpg&nbsp;
Views:	287&nbsp;
Size:	109.2 KB&nbsp;
ID:	1040762" border="0"></a>
						</blockquote>
					</div>

					
				</div>
			</div>
		

Open in new window

0
trevor1940
Asked:
trevor1940
  • 3
1 Solution
 
Shaun VermaakTechnical Specialist/DeveloperCommented:
Updated
What about this?
(?<=Name:\t).*(?=&nbsp;)

Open in new window

https://regex101.com/r/KVSFE5/2
0
 
trevor1940Author Commented:
Hi,

Your regex works on plain text but I Think the lines below create objects so you can't regex on a object

        my $img = $a -> look_down('_tag', 'img');
        my $alt = $img->attr('alt'); 

Open in new window


Doing this

        my $img = $a -> look_down('_tag', 'img');
        my $alt = $img->attr('alt')->as_text;

Open in new window


Gives this error

http://www.example.com/vboard/attachment.php?s=b31c60a8e6f7c723&attachmentid=104&d=623527
Can't locate object method "as_text" via package "Click image for larger version.á

Name:   trevor OBT tumblr_ozxualLdb1who6_540.jpgá
Views:  287á
Size:   109.2 KBá
ID:     1040762" (perhaps you forgot to load "Click image for larger version.á

Name:   trevor OBT tumblr_ozxualLdb1who6_540.jpgá
Views:  287á
Size:   109.2 KBá
ID:     1040762"?) at D:\PerlScripts\GetAlt.pl line 15.

Open in new window

0
 
trevor1940Author Commented:
I figured it out by doing a data dump on $alt to reveal

"Click image for larger version.\xA0\n\nName:\ttrevor OBT tumblr_ozxualLdb1who6_540.jpg\xA0\nViews:\t287\xA0\nSize:\t109.2 KB\xA0\nID:\t1040762"

Open in new window


so my REGEX became

            if ($alt =~ m/(?<=Name:\t)(.*)(?=\xA0)/i){
                my $name =$1;
                print "name [$name]\n";                
                }

Open in new window

0
 
trevor1940Author Commented:
Hi,
Please see last comment for solution
I forgot &nbsp; became the hexadecimal \xA0 when using HTML Elements

Thanx for your help
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now