asked on

Substituting occurences of a word only if they are not in a tag

I currenty have a program that takes an entire HTML document, and then replaces one word with another. The problem is if the word is the source of an image tag, the image source is no longer right. The translating line is currently:

$d =~ s/$lookfor/$replace/gim;

How could I make it so it won't replace it if it's in a tag, or between script, title, or other tag?

marti23esp

You can actually use a library like HTML to parse it. That would be easier than trying to do one really complex regular expression. I have one code for that in some place. I just will post you that soon.

ASKER CERTIFIED SOLUTION

marti23esp

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

thaimin

ASKER

Thanks, it seams to be working now, but if you could help me with a few problems that I'm having, it might be that I'm taking the wrong approach. Basically what I want to do is "highlight" all the same words in an HTML document. My script at the end is now:

my $tree = HTML::TreeBuilder->new();
$tree->parse($d); #HTML source is in $d
$tree->eof();
my $body = $tree->look_down('_tag', 'body');
for (my $i = 0; $i < scalar(@finalValues); $i++) {
$lookfor = $finalValues[$i];
$replace = "<b style=\"color:black;background-color:$refrence{$lookfor}\">$lookfor</b>"; #This is the highlight
foreach my $item_r ($body->content_refs_list) {
next if ref $$item_r;
$$item_r =~ s/$lookfor/$replace/gim;
}
}
print $tree->as_HTML;
$tree = $tree->delete;

The problems I'm having are that as_HTML turns the <> into &..., I I would either need to make each ~text element a ~literal or split all the text apart an push the <B> tag in. The other problem is the content_refs_list only does imediate children, not children of children, so I only get the text that's right under the body.

Thanks for helping so far, and I would really like it if you could answers these too. Thanks again.

thaimin

ASKER

Actually, I foudn out away to turn all the ~text objects into ~literal objects and it worked well:

$tree->objectify_text();
my @texts = $body->look_down("_tag","~text");
for (my $i = 0; $i < scalar(@texts); $i++) {
$texts[$i]->tag('~literal');
}

But I would still appreciate it if you could help with searching the grandchildren. Thanks a lot.

thaimin

ASKER

This really worked out, thanks a lot. If you could help me answer one question I have about HTML::Element now, there are more points.
It's at: https://www.experts-exchange.com/questions/20403280/A-way-to-use-content-refs-list-to-get-all-children-and-grandchildren.html