Substituting occurences of a word only if they are not in a tag

thaimin used Ask the Experts™
I currenty have a program that takes an entire HTML document, and then replaces one word with another. The problem is if the word is the source of an image tag, the image source is no longer right. The translating line is currently:

$d =~ s/$lookfor/$replace/gim;

How could I make it so it won't replace it if it's in a tag, or between script, title, or other tag?
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
You can actually use a library like HTML to parse it. That would be easier than trying to do one really complex regular expression. I have one code for that in some place. I just will post you that soon.
I answered something similar in another message. You can actually use Tree builder or HTML library which one you feel more comfortable and use regular expressions but just only in the src attribute.

If you want to dump that in a file a code like ...

open FILE,">output.txt";

should do the work. But I think that is still easier...
look at that page in CPAN...

I am pretty sure the method look_down is what you need to work with your application. I don´t understand much what you have to do but it seems look_down could do it.

And that works fine for this one example, but unless all thousand of your press releases have "Schreck" in the headline, that's just not a general solution. However, if all the ads-in-"h1"s that you want to exclude involve a link whose URL involves "/dyna/", then you can use that:

 my $real_h1 = $tree->look_down(
   '_tag', 'h1',
   sub {
     my $link = $_[0]->look_down('_tag','a');
     return 1 unless $link;
       # no link means it's fine
     return 0 if $link->attr('href') =~ m{/dyna/};
       # a link to there is bad
     return 1; # otherwise okay

Long time before I used HTML::LinkExtor it seems that is also a really good library.

Look at this three links if you are interested.

Anyway I am looking forward to any questions you can have about the above code.

Hope that helps.



Thanks, it seams to be working now, but if you could help me with a few problems that I'm having, it might be that I'm taking the wrong approach. Basically what I want to do is "highlight" all the same words in an HTML document. My script at the end is now:

my $tree = HTML::TreeBuilder->new();
$tree->parse($d); #HTML source is in $d
my $body = $tree->look_down('_tag', 'body');
for (my $i = 0; $i < scalar(@finalValues); $i++) {
     $lookfor = $finalValues[$i];
     $replace = "<b style=\"color:black;background-color:$refrence{$lookfor}\">$lookfor</b>"; #This is the highlight
     foreach my $item_r ($body->content_refs_list) {
             next if ref $$item_r;
             $$item_r =~ s/$lookfor/$replace/gim;
print $tree->as_HTML;
$tree = $tree->delete;

The problems I'm having are that as_HTML turns the <> into &..., I I would either need to make each ~text element a ~literal or split all the text apart an push the <B> tag in. The other problem is the content_refs_list only does imediate children, not children of children, so I only get the text that's right under the body.

Thanks for helping so far, and I would really like it if you could answers these too. Thanks again.


Actually, I foudn out away to turn all the ~text objects into ~literal objects and it worked well:

my @texts = $body->look_down("_tag","~text");
for (my $i = 0; $i < scalar(@texts); $i++) {

But I would still appreciate it if you could help with searching the grandchildren. Thanks a lot.


This really worked out, thanks a lot. If you could help me answer one question I have about HTML::Element now, there are more points.
It's at:

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial