Link to home
Start Free TrialLog in
Avatar of LeeOMara
LeeOMara

asked on

search results highlighting/regexp-voodoo

PROBLEM:
I want to produce short "snippets" for search result pages. The search results are already partially processed (search terms get wrapped in [strong] tags), but I keep hitting dead-ends with this step. Now, I've tried a wide number of approaches, spent an embarasing amount of time and arrived at no viable solution. Learning regular expressions would probably have been less painfull, but there you have it.

REQUIREMENTS:
- the snippet should be made up of five words on either side of the search terms.
- end product cannot have any unclosed/unmatched tags.

SETUP:
- Search terms are (already) marked up in [strong] tags with a class attribute of term1 - term3.
- No other tags exist in the string (other than aforementioned [strong] tags)

EXAMPLE INPUT:
The journalist/critic Edmund Gosse begins championing <strong class="term3">Ibsen</strong> in English periodicals. His article "<strong class="term3">Ibsen</strong> the Norwegian Satirist" (Fortnightly Review, January 1873) is later expanded in his book Studies in the Literature of Northern Europe (1879). In contrast, the conservative dramatic critic Clement W. Scott begins writing columns for <strong class="term1">London</strong>'s Daily Telegraph; he comes to consider the influence of <strong class="term3">Ibsen</strong> the worst thing that ever happened to English drama. His hysterically negative attack on <strong class="term3">Ibsen's</strong> Ghosts in 1891 will be paraphrased (and reduced to ridicule) by <strong class="term2">Shaw</strong> in the first chapter of The Quintessence of <strong class="term3">Ibsen</strong>ism (1891).

EXAMPLE OUTPUT:
... journalist/critic Edmund Gosse begins championing <strong class="term3">Ibsen</strong> in English periodicals. His article ... Scott begins writing columns for <strong class="term1">London</strong>'s Daily Telegraph; he comes to consider ... (and reduced to ridicule) by <strong class="term2">Shaw</strong> in the first chapter of The Quintessence...
ASKER CERTIFIED SOLUTION
Avatar of shmert
shmert

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I think this may be in line with what you were trying to accomplish.
shmert has already given you the more elegant solution, but this will work as well:

<?php
// Search for these terms.
  $_terms = array
            (
              "London",
              "Shaw",
              "Ibsen"
            );

// Create a list of tags to replace the $_terms.
  $_tagged_terms = array
            (
              "<strong class=\"term1\">" . $_terms[0] . "</strong>",
              "<strong class=\"term2\">" . $_terms[1] . "</strong>",
              "<strong class=\"term3\">" . $_terms[2] . "</strong>"
            );

// Set up sample input that was supplied.
  $_input  = "The journalist/critic Edmund Gosse begins championing Ibsen in English ";
  $_input .= "periodicals. His article \"Ibsen the Norwegian Satirist\" (Fortnightly ";
  $_input .= "Review, January 1873) is later expanded in his book Studies in the ";
  $_input .= "Literature of Northern Europe (1879). In contrast, the conservative ";
  $_input .= "dramatic critic Clement W. Scott begins writing columns for London's ";
  $_input .= "Daily Telegraph; he comes to consider the influence of Ibsen the worst ";
  $_input .= "thing that ever happened to English drama. His hysterically negative ";
  $_input .= "attack on Ibsen's Ghosts in 1891 will be paraphrased (and reduced to ";
  $_input .= "ridicule) by Shaw in the first chapter of The Quintessence of ";
  $_input .= "Ibsenism (1891).";

// Create an array of all words to be scrutinized.
  $_tokenized_input = explode(" ",$_input);

// Replace the input text with the $_tagged_terms.
  $_tagged_input = str_replace($_terms,$_tagged_terms,$_tokenized_input);

// Iterate through the input text.  
  $_total_words = count($_tagged_input);
  for($_i = 0; $_i < $_total_words; $_i++)
  {
    $_key = null;

// Not the most elegant solution, but it works.
// If the input token was changed by the str_replace command, you have a hit.
    if( str_replace($_tagged_terms,"",$_tagged_input[$_i]) != $_tagged_input[$_i] ) $_key = $_i;

// When you get a hit, set up the output.
    if($_key)
    {
      $_output = "...";

// Grab the 5 words before and 5 words after the hit.
      for($_j = ($_key - 5); $_j < ($_key + 6); $_j++)
      {
        $_output .= $_tagged_input[$_j] . " ";
      }

// Clean up the extra space at the end of the line and add the ellipses.
      $_output = substr($_output,0,-1) . "... \n";
      echo $_output;
    }
  }

?>

--brian
Avatar of LeeOMara
LeeOMara

ASKER

Thanks sam,

You've basically got it. I have one question still. Is it much more difficult to list each term only once (I want to make sure the snippet doesn't get too long, but still contains the relevant text)

I know I left this out of the original question, and I will concede the points regardless.
thanks theonlygoodisknowledge:

One of the nasty parts of this problem is that their are many "forms" of terms(synonyms, etc), this has to do will how the data was originally input in to the system.

Without getting into mind-numbing details suffice it to say that even though the text for the terms are wrapped in [strong] tags, the actual string inside can vary quite a bit.
Oh. I thought you were trying to pull snippets, from text, based on search criteria, ie. a defined set of "terms".
I also see that you are actually looking for a single snippet per term, right?
I'm not exactly sure how you would do that, but I'll give it some thought.

--brian
fyi here is what I ended up using. (inspired by advice here and on php-general@lists.php.net)
<?php

$str = 'The journalist/critic Edmund Gosse begins championing <strong class="term3">Ibsen</strong> in English periodicals. His article "<strong class="term3">Ibsen</strong> the Norwegian Satirist" (Fortnightly Review, January 1873) is later expanded in his book Studies in the Literature of Northern Europe (1879). In contrast, the conservative dramatic critic Clement W. Scott begins writing columns for <strong class="term1">London</strong>\'s Daily Telegraph; he comes to consider the influence of <strong class="term3">Ibsen</strong> the worst thing that ever happened to English drama. His hysterically negative attack on <strong class="term3">Ibsen\'s</strong> Ghosts in 1891 will be paraphrased (and reduced to ridicule) by <strong class="term2">Shaw</strong> in the first chapter of The Quintessence of <strong class="term3">Ibsen</strong>ism (1891).';

preg_match_all('/(\S+\s*){0,5}<strong class="term[1-3]">([^<]+)<\/strong>(\s*\S+){0,5}/',$str,$matches);

$done = array(1=>false, 2=>false, 3=>false);

$final = array();

while($next = array_pop($matches[0])) {

    if(strpos($next, 'class="term1">')!==false &&  $done[1] == false) {
        $done[1] = true;
        $final[] = $next;
    }
    if(strpos($next, 'class="term2">')!==false &&  $done[2] == false) {
        $done[2] = true;
        $final[] = $next;
    }
    if(strpos($next, 'class="term3">')!==false &&  $done[3] == false) {
        $done[3] = true;
        $final[] = $next;
    }
}
echo 'Summary: ' . implode(' ... ',$final);
?>