AmigoJack
asked on
RegExp for highlighting keywords, except HTML tag names or tag attributes
Hi,
I want to highlight a given text (needle) within another text (haystack). The issue here is: the haystack may also have HTML/XML-tags and numerous attributes in it. With "highlight" I mean to turn "keyword" into "<b>keyword</b>" for example. But this breaks everything when the haystack turns out to be something like
Hello, my name is <a href="mail@keyword.net">mr . keyword</a>.
While substituting the second "keyword" is totally ok, the first one should NOT be recognized. It is like highlighting the keyword "IMG" - but only when it is a text, not when it is a tag like <IMG />.
Is it even possible with RegExp? Maybe by counting "<" and ">" somehow? I plan to use it in PHP (preg_replace).
Thanks in advance :)
I want to highlight a given text (needle) within another text (haystack). The issue here is: the haystack may also have HTML/XML-tags and numerous attributes in it. With "highlight" I mean to turn "keyword" into "<b>keyword</b>" for example. But this breaks everything when the haystack turns out to be something like
Hello, my name is <a href="mail@keyword.net">mr
While substituting the second "keyword" is totally ok, the first one should NOT be recognized. It is like highlighting the keyword "IMG" - but only when it is a text, not when it is a tag like <IMG />.
Is it even possible with RegExp? Maybe by counting "<" and ">" somehow? I plan to use it in PHP (preg_replace).
Thanks in advance :)
This example seems to work:
<?php
$text = 'Hello, my name is <a href="mail@keyword.net">mr . keyword</a>.';
$regex = '/(.*)\b(keyword)\b(.*)/i' ;
$replace = "$1<b>$2</b>$3";
$newtext = preg_replace($regex, $replace, $text);
print $newtext;
print "\n";
?>
<?php
$text = 'Hello, my name is <a href="mail@keyword.net">mr
$regex = '/(.*)\b(keyword)\b(.*)/i'
$replace = "$1<b>$2</b>$3";
$newtext = preg_replace($regex, $replace, $text);
print $newtext;
print "\n";
?>
[root@ip-208-109-106-75 ~]# php test.php
Hello, my name is <a href="mail@keyword.net">mr . <b>keyword</b></a>.
Hello, my name is <a href="mail@keyword.net">mr
jasonbytes, matching HTML is problematic, consider the source:
Hello, my name is keyword<a href="mail@keyword.net">ke yword mr. keyword</a>keyword
and the results with your pattern:
Hello, my name is keyword<a href="mail@keyword.net">ke yword mr. keyword</a><b>keyword</b>
See this previous solution(not tested):
https://www.experts-exchange.com/questions/21409201/Greedy-regex-in-php.html
Hello, my name is keyword<a href="mail@keyword.net">ke
and the results with your pattern:
Hello, my name is keyword<a href="mail@keyword.net">ke
See this previous solution(not tested):
https://www.experts-exchange.com/questions/21409201/Greedy-regex-in-php.html
I have another trick. I add one leading ">" char and one trailing "<" char to the text. Then I can ask for every matching keyword to be after > char and before < char with no such chars in betwean.
Is that an option for you? The expression is simple then.
Is that an option for you? The expression is simple then.
ASKER
@ljubiccica:
Of course stripping the HTML tags would things make a lot easier. But this is not an options, they have to be maintained.
@jasonsbytes:
This works nearly perfect on the first sight. The only disadvantage right now would be, that the whole text might start and/or end right off with a HTML tag, like "<p>Hi, my name is.....</p>".
@ddrudik:
Thanks for the link. I tested Batalf's expression "/([^\w<])(table)([^\w>])/ si" and it worked the same way as jasonsbytes' one. How does it come? I've expected Batalf's expression being superior to jasonsbytes' one, but I don't really understand the "\b" in there. I always expected "(.*)" to be too greedy, so I'm really curious how both turn out to work the same. BTW: this one also fails if the text starts with a HTML-tag right off.
The Javascript-solution is worth an idea, haven't thought about that. This way I could at least release the server (i.e. PHP) from the pressure of calcuation and each client can chose itself if it wants highlighted keywords or not (i.e. turning Javascript on or off).
@Zvonko:
Yes, I already stumpled over that. In fact it shouldn't be an issue to add a prefix and a suffix to the whole text. As long as every keyword is found, also in nested tags, like "><table><tr><td>Hi, my keywords are:</td><td>mr. keyword</td></tr></table>f ound all the keywords?<".
Of course stripping the HTML tags would things make a lot easier. But this is not an options, they have to be maintained.
@jasonsbytes:
This works nearly perfect on the first sight. The only disadvantage right now would be, that the whole text might start and/or end right off with a HTML tag, like "<p>Hi, my name is.....</p>".
@ddrudik:
Thanks for the link. I tested Batalf's expression "/([^\w<])(table)([^\w>])/
The Javascript-solution is worth an idea, haven't thought about that. This way I could at least release the server (i.e. PHP) from the pressure of calcuation and each client can chose itself if it wants highlighted keywords or not (i.e. turning Javascript on or off).
@Zvonko:
Yes, I already stumpled over that. In fact it shouldn't be an issue to add a prefix and a suffix to the whole text. As long as every keyword is found, also in nested tags, like "><table><tr><td>Hi, my keywords are:</td><td>mr. keyword</td></tr></table>f
\b matches a word boundary (between \w and \W characters).
Well, my idea was:
1.) to get rid of tags -> you get all words that need to be highlighted -> you save it somewhere
2.) you search these words and
3.) highlight them
Greets
1.) to get rid of tags -> you get all words that need to be highlighted -> you save it somewhere
2.) you search these words and
3.) highlight them
Greets
ASKER
Thanks. I also just figured out that jasonsbytes' solution only matches the LAST occurance, while Batalf's one does it for all. So far so good :-)
One final issue would be advanced keywords itself, like "am*k" as needle and this as haystack: "Hi, my name is AmigoJack. I am ok and do never run amok, never!" With Batalf's expression, the match is quite greedy, and like you might imagine, I want all three occurances being matched, rather than matching "ame is AmigoJack. I am ok and do never run amok"...
One final issue would be advanced keywords itself, like "am*k" as needle and this as haystack: "Hi, my name is AmigoJack. I am ok and do never run amok, never!" With Batalf's expression, the match is quite greedy, and like you might imagine, I want all three occurances being matched, rather than matching "ame is AmigoJack. I am ok and do never run amok"...
AmigoJack, with preg_replace and a pattern of:
"/(?<=^|[> ])(keyword)(?=$|[< ])/si"
And a replacement pattern of:
<b>$1</b>
And the source text of:
keyword Hello, my name is <a href="mail@keyword.net">ke yword mr. keyword</a>keyword
Results in:
<b>keyword</b> Hello, my name is <a href="mail@keyword.net"><b >keyword</ b> mr. <b>keyword</b></a><b>keywo rd</b>
"/(?<=^|[> ])(keyword)(?=$|[< ])/si"
And a replacement pattern of:
<b>$1</b>
And the source text of:
keyword Hello, my name is <a href="mail@keyword.net">ke
Results in:
<b>keyword</b> Hello, my name is <a href="mail@keyword.net"><b
AmigoJack, example code for testing:
<?php
error_reporting(E_ALL);
$TXT = <<<EOF
keyword Hello, my name is <a href="mail@keyword.net">ke yword mr. keyword</a>keyword
EOF;
$pattern = '/(?<=^|[> ])(keyword)(?=$|[< ])/is';
$repl = '<b>$1</b>';
echo preg_replace($pattern, $repl, $TXT );
?>
<?php
error_reporting(E_ALL);
$TXT = <<<EOF
keyword Hello, my name is <a href="mail@keyword.net">ke
EOF;
$pattern = '/(?<=^|[> ])(keyword)(?=$|[< ])/is';
$repl = '<b>$1</b>';
echo preg_replace($pattern, $repl, $TXT );
?>
ASKER
Sorry for being absent for a while. In my post "07.23.2007 at 05:05PM CEST, ID: 19548104" I meant as keyword of course "am.*k", which can come into a further problem. To make it pretty close:
The searchwords are given by a browsing user on a website. Like when you do a search on a bulletin board. Of course, most people only enter one keyword or two-three. I already solved the issue that nobody ever could do a quoted search (to also include blankspaces within words). What remains is a generic wildcard, the *. And so, like when I would search for "am*k", the resulting regex would be like "am.*k". A good example on how a message could look like would be this:
"Hi, my name is <a href="amigojack@none.net"> AmigoJack< /a>. I am ok and do never run amok, never! Signed: AmigoJack"
There are all issues in here:
- "amigojack" within HTML should not be found (as it breaks the HTML)
- "AmigoJack" at the end of the text should be found
- "AmigoJack" between the HTML tags should be found
- because of the wildcard "am ok" and "amok" should also be found
Greedyness is the least problem, I'm already happy if no HTML breaks but the last occurance would be "am ok and do never run amok, never! AmigoJack".
@ddrudik:
Thank you for your efforts so far, your last example worked for simple keywords very well on the first sight. I would like to keep this open, maybe Zvonko shows up again. Adding ">" and "<" around my haystack would be no problem.
The searchwords are given by a browsing user on a website. Like when you do a search on a bulletin board. Of course, most people only enter one keyword or two-three. I already solved the issue that nobody ever could do a quoted search (to also include blankspaces within words). What remains is a generic wildcard, the *. And so, like when I would search for "am*k", the resulting regex would be like "am.*k". A good example on how a message could look like would be this:
"Hi, my name is <a href="amigojack@none.net">
There are all issues in here:
- "amigojack" within HTML should not be found (as it breaks the HTML)
- "AmigoJack" at the end of the text should be found
- "AmigoJack" between the HTML tags should be found
- because of the wildcard "am ok" and "amok" should also be found
Greedyness is the least problem, I'm already happy if no HTML breaks but the last occurance would be "am ok and do never run amok, never! AmigoJack".
@ddrudik:
Thank you for your efforts so far, your last example worked for simple keywords very well on the first sight. I would like to keep this open, maybe Zvonko shows up again. Adding ">" and "<" around my haystack would be no problem.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Just curious, ie, introducing a monkey wrench, but are you assured of having no > or < that are not related to an HTML tag...?
For instance this text if looking for 'Bob':
If my answer is < 12 and yet Bob has found > 17...
For instance this text if looking for 'Bob':
If my answer is < 12 and yet Bob has found > 17...
No worries, my pattern works in that case as well. AmigoJack, has the original question been answered?
>ddrudik...
Not a pro at regexp, but won't this part:
(?<=^|[> ])
specifically avoid any instance contained within < and >...? Meaning in my example the Bob wouldn't be changed to <b>Bob</b>...?
Not a pro at regexp, but won't this part:
(?<=^|[> ])
specifically avoid any instance contained within < and >...? Meaning in my example the Bob wouldn't be changed to <b>Bob</b>...?
<?php
$string = <<<EOF
Hi, my name is <a href="bob@none.net">If my answer is < 12 and yet Bob has found > 17...</a>. I am ok and do never run Bob, never! Signed: Bob
EOF;
$pattern = '/(?<=^|[> ])(Bob)(?=$|[^a-z])/is';
$repl = '<b>$1</b>';
echo preg_replace($pattern, $repl, $string );
?>
Result:
Hi, my name is <a href="bob@none.net">If my answer is < 12 and yet <b>Bob</b> has found > 17...</a>. I am ok and do never run <b>Bob</b>, never! Signed: <b>Bob</b>
The reason it found Bob correctly is that it followed a space in the text, Bob could have started the string, followed a >, or as in this case, followed a space.
$string = <<<EOF
Hi, my name is <a href="bob@none.net">If my answer is < 12 and yet Bob has found > 17...</a>. I am ok and do never run Bob, never! Signed: Bob
EOF;
$pattern = '/(?<=^|[> ])(Bob)(?=$|[^a-z])/is';
$repl = '<b>$1</b>';
echo preg_replace($pattern, $repl, $string );
?>
Result:
Hi, my name is <a href="bob@none.net">If my answer is < 12 and yet <b>Bob</b> has found > 17...</a>. I am ok and do never run <b>Bob</b>, never! Signed: <b>Bob</b>
The reason it found Bob correctly is that it followed a space in the text, Bob could have started the string, followed a >, or as in this case, followed a space.
>ddrudik...
Oh, that's right... the [> ] is a CHOICE of > or a space... thx, for clear up...
Oh, that's right... the [> ] is a CHOICE of > or a space... thx, for clear up...
ASKER
Thanks to all who contributed their knowledge and hints. I decided to take ddrudik as the only answer because he threw answered most specific to my problem. Another pro is that he continually answered to my questions going more into the detail - and I understand that this is no behaviour to be guaranteed ;-) The chosen answer works best for my needs.
Have a nice weekend folks!
Have a nice weekend folks!
Thanks for the question and the points.
If you need to get rid of tags, you can try function strip_tags.
http://www.php.net/manual/en/function.strip-tags.php
BUT as manual says - Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected - better check twice if this is what you need.
Greets
Ljubiccica