asked on

RegExp for highlighting keywords, except HTML tag names or tag attributes

Hi,

I want to highlight a given text (needle) within another text (haystack). The issue here is: the haystack may also have HTML/XML-tags and numerous attributes in it. With "highlight" I mean to turn "keyword" into "keyword" for example. But this breaks everything when the haystack turns out to be something like

Hello, my name is <a href="mail@keyword.net">mr. keyword</a>.

While substituting the second "keyword" is totally ok, the first one should NOT be recognized. It is like highlighting the keyword "IMG" - but only when it is a text, not when it is a tag like <IMG />.

Is it even possible with RegExp? Maybe by counting "<" and ">" somehow? I plan to use it in PHP (preg_replace).

Thanks in advance :)

ljubiccica

Hey there!

If you need to get rid of tags, you can try function strip_tags.

http://www.php.net/manual/en/function.strip-tags.php

BUT as manual says - Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected - better check twice if this is what you need.

Greets
Ljubiccica

Jason Minton

This example seems to work:

<?php
$text = 'Hello, my name is <a href="mail@keyword.net">mr. keyword</a>.';
$regex = '/(.*)\b(keyword)\b(.*)/i';
$replace = "$1$2$3";

$newtext = preg_replace($regex, $replace, $text);

print $newtext;
print "\n";

?>

Jason Minton

[root@ip-208-109-106-75 ~]# php test.php
Hello, my name is <a href="mail@keyword.net">mr. keyword</a>.

ddrudik

jasonbytes, matching HTML is problematic, consider the source:
Hello, my name is keyword<a href="mail@keyword.net">keyword mr. keyword</a>keyword
and the results with your pattern:
Hello, my name is keyword<a href="mail@keyword.net">keyword mr. keyword</a>keyword

See this previous solution(not tested):
https://www.experts-exchange.com/questions/21409201/Greedy-regex-in-php.html

Zvonko

I have another trick. I add one leading ">" char and one trailing "<" char to the text. Then I can ask for every matching keyword to be after > char and before < char with no such chars in betwean.
Is that an option for you? The expression is simple then.

AmigoJack

ASKER

@ljubiccica:
Of course stripping the HTML tags would things make a lot easier. But this is not an options, they have to be maintained.

@jasonsbytes:
This works nearly perfect on the first sight. The only disadvantage right now would be, that the whole text might start and/or end right off with a HTML tag, like "Hi, my name is.....".

@ddrudik:
Thanks for the link. I tested Batalf's expression "/([^\w<])(table)([^\w>])/si" and it worked the same way as jasonsbytes' one. How does it come? I've expected Batalf's expression being superior to jasonsbytes' one, but I don't really understand the "\b" in there. I always expected "(.*)" to be too greedy, so I'm really curious how both turn out to work the same. BTW: this one also fails if the text starts with a HTML-tag right off.

The Javascript-solution is worth an idea, haven't thought about that. This way I could at least release the server (i.e. PHP) from the pressure of calcuation and each client can chose itself if it wants highlighted keywords or not (i.e. turning Javascript on or off).

@Zvonko:
Yes, I already stumpled over that. In fact it shouldn't be an issue to add a prefix and a suffix to the whole text. As long as every keyword is found, also in nested tags, like "><table><tr><td>Hi, my keywords are:</td><td>mr. keyword</td></tr></table>found all the keywords?<".

ddrudik

\b matches a word boundary (between \w and \W characters).

ljubiccica

Well, my idea was:
1.) to get rid of tags -> you get all words that need to be highlighted -> you save it somewhere
2.) you search these words and
3.) highlight them

Greets

AmigoJack

ASKER

Thanks. I also just figured out that jasonsbytes' solution only matches the LAST occurance, while Batalf's one does it for all. So far so good :-)

One final issue would be advanced keywords itself, like "am*k" as needle and this as haystack: "Hi, my name is AmigoJack. I am ok and do never run amok, never!" With Batalf's expression, the match is quite greedy, and like you might imagine, I want all three occurances being matched, rather than matching "ame is AmigoJack. I am ok and do never run amok"...

ddrudik

AmigoJack, with preg_replace and a pattern of:
"/(?<=^|[> ])(keyword)(?=$|[< ])/si"

And a replacement pattern of:
$1

And the source text of:
keyword Hello, my name is <a href="mail@keyword.net">keyword mr. keyword</a>keyword

Results in:
keyword Hello, my name is <a href="mail@keyword.net">keyword mr. keyword</a>keyword

ddrudik

AmigoJack, example code for testing:

<?php
error_reporting(E_ALL);
$TXT = <<<EOF
keyword Hello, my name is <a href="mail@keyword.net">keyword mr. keyword</a>keyword
EOF;
$pattern = '/(?<=^|[> ])(keyword)(?=$|[< ])/is';
$repl = '$1';
echo preg_replace($pattern, $repl, $TXT );
?>

AmigoJack

ASKER

Sorry for being absent for a while. In my post "07.23.2007 at 05:05PM CEST, ID: 19548104" I meant as keyword of course "am.*k", which can come into a further problem. To make it pretty close:

The searchwords are given by a browsing user on a website. Like when you do a search on a bulletin board. Of course, most people only enter one keyword or two-three. I already solved the issue that nobody ever could do a quoted search (to also include blankspaces within words). What remains is a generic wildcard, the *. And so, like when I would search for "am*k", the resulting regex would be like "am.*k". A good example on how a message could look like would be this:

"Hi, my name is <a href="amigojack@none.net">AmigoJack</a>. I am ok and do never run amok, never! Signed: AmigoJack"

There are all issues in here:
- "amigojack" within HTML should not be found (as it breaks the HTML)
- "AmigoJack" at the end of the text should be found
- "AmigoJack" between the HTML tags should be found
- because of the wildcard "am ok" and "amok" should also be found

Greedyness is the least problem, I'm already happy if no HTML breaks but the last occurance would be "am ok and do never run amok, never! AmigoJack".

@ddrudik:
Thank you for your efforts so far, your last example worked for simple keywords very well on the first sight. I would like to keep this open, maybe Zvonko shows up again. Adding ">" and "<" around my haystack would be no problem.

ASKER CERTIFIED SOLUTION

ddrudik

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

dagesi

Just curious, ie, introducing a monkey wrench, but are you assured of having no > or < that are not related to an HTML tag...?
For instance this text if looking for 'Bob':
If my answer is < 12 and yet Bob has found > 17...

ddrudik

No worries, my pattern works in that case as well. AmigoJack, has the original question been answered?

dagesi

>ddrudik...
Not a pro at regexp, but won't this part:
(?<=^|[> ])
specifically avoid any instance contained within < and >...? Meaning in my example the Bob wouldn't be changed to Bob...?

ddrudik

<?php
$string = <<<EOF
Hi, my name is <a href="bob@none.net">If my answer is < 12 and yet Bob has found > 17...</a>. I am ok and do never run Bob, never! Signed: Bob
EOF;
$pattern = '/(?<=^|[> ])(Bob)(?=$|[^a-z])/is';
$repl = '$1';
echo preg_replace($pattern, $repl, $string );
?>

Result:
Hi, my name is <a href="bob@none.net">If my answer is < 12 and yet Bob has found > 17...</a>. I am ok and do never run Bob, never! Signed: Bob

The reason it found Bob correctly is that it followed a space in the text, Bob could have started the string, followed a >, or as in this case, followed a space.

dagesi

>ddrudik...
Oh, that's right... the [> ] is a CHOICE of > or a space... thx, for clear up...

AmigoJack

ASKER

Thanks to all who contributed their knowledge and hints. I decided to take ddrudik as the only answer because he threw answered most specific to my problem. Another pro is that he continually answered to my questions going more into the detail - and I understand that this is no behaviour to be guaranteed ;-) The chosen answer works best for my needs.

Have a nice weekend folks!

ddrudik

Thanks for the question and the points.