Solved

Using PHP to parse text for tags

Posted on 2011-03-20
4
305 Views
Last Modified: 2012-05-11
Hi All,

I'm trying to create some php code (of which, I admittedly know next to nothing!) that will take in some text and extract the nouns and verbs.

In short, I have a paragraph or 2 of text that a user has entered. I want to parse for nouns and verbs in order to generate search tags, rather than ask the user for tags.

Does anyone know of some open source code available that already does this or have some advice on what I should be thinking about before I start coding?

Alternatively any thoughts on auto-generating tags would be much appreciated.

Thanks in advance.
0
Comment
Question by:GroganJ
  • 3
4 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 35177336
This is an interesting question, if I understand it correctly.  Can you please post an example of the input test data and the output result you want to get?  Then we can talk about the rules that need programmatic implementation to achieve the desired result set.  You may find that certain words, like "set" have context-sensitive characteristics.  Consider when "set" is a noun vs when it is a verb.

If you do not know very much about PHP you may be working on this project for a very long time without much success.  Consider getting this book to begin to get a foundation in the programming language.
http://www.sitepoint.com/books/phpmysql4/

After you have mastered that, get Elliott White's book, PHP5 in Practice - it's available on Amazon.
0
 

Author Comment

by:GroganJ
ID: 35182830
Ray,

Here is a sample paragraph:
"The campaign will focus on the largest potential user group in Shangzhen: local and migrant males, aged 18-26, to identify the key external and internal factors that are driving the adoption of 3G cell phones in China. It will explore the potential raise awareness of our products (in competition with state-sponsored handset manufacturers) and opportunities to sell direct to market."

The kind of tags I would like to extract are: China, cell phones, 3G, Shangzhen. Obviously, this is with human oversight that can filter!

I guess I should rephrase my question a little. Does anyone know of on-the-fly parsing algorithms (or code in PHP) that can handle some of this? Alternatively, any ideas on how some websites generate tags for pieces of text? As the volume of data will grow quite large, I want to avoid using mySQL code that searches on 'LIKE' as it would grind to a halt. As new text is loaded into the database, I want to parse the tags and store these in a separate table for quicker reference.

Thanks,
John.
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 35184645
Here is a strategy you might be able to adapt to a credible user-interface.  It works like this... The client is given the text and the ability to select words from the text string.  When the form is submitted it stores the text string and a key, along with a word and the key.  The keys are used to coordinate the words in one table with the associated text strings in another table.

Some thoughts for going forward:  

1. Consider case-sensitivity.  In MySQL "the" == "The" by default.  This might or might not be what you want.  You might normalize all words to lower case.
2. Consider time-sensitive relevance.  You might want to add DATETIME columns to these tables.
3. Consider the relevance of adjacent words like cell and phone - perhaps you could have an algorithm that detected sequential keys in $_POST and made these into a single term in addition to individual terms.
4. Consider the UI factors.  I did this in a quick-and-dirty script with checkboxes.  A more reasonable UI would present the text in a way that looked normal, but with each word of the text as simply-styled form inputs.  You might allow the reader to click on any word.  The click would modify an input control in the form to add the clicked word, or it might use an AJAX solution to send the words to a backend script as they are clicked.
5. Consider using Google Search to do all this work for you.  I don't really know the design of your application, but Google knows more about searching than you or I do!

HTH, ~Ray
<?php // RAY_temp_groganj.php
error_reporting(E_ALL);
echo "<pre>";

// CONNECT AND SELECT THE DATA BASE
require_once('YOUR_DB_CREDENTIALS.php');

// TEST DATA FROM THE POST AT EE
$txt = "The campaign will focus on the largest potential user group in Shangzhen: local and migrant males, aged 18-26, to identify the key external and internal factors that are driving the adoption of 3G cell phones in China. It will explore the potential raise awareness of our products (in competition with state-sponsored handset manufacturers) and opportunities to sell direct to market.";

// SANITIZE THE STRING - WORDS ONLY, SINGLE SPACED
$new = preg_replace('#[^A-Z0-9- ]#i', NULL, $txt);
$new = preg_replace('#\s\s+#',       ' ',  $new);

// MAKE A KEY TO COORDINATE THE TEXT WITH THE CHOSEN SEARCH WORDS
$md5 = md5($txt);

// MAKE AN ARRAY OF THE WORDS
$arr = explode(' ', $new);

if (!empty($_POST))
{
    // GET THE POST DATA AND CREATE THE QUERIES
    $sqls = array();

    // PUT THE THE MD5 KEY AND TEXT INTO THE TEXTS TABLE
    $esc = mysql_real_escape_string($txt);
    $sqls[] = "INSERT INTO searchTexts (m, t) VALUES ('$md5', '$esc')";

    // PUT THE MD5 KEY AND CHOSEN WORDS INTO THE WORDS TABLE
    foreach ($_POST as $key => $val)
    {
        if (in_array($val, $arr))
        {
            $wrd = mysql_real_escape_string($val);
            $sqls[] = "INSERT INTO searchWords (m, w) VALUES ('$md5', '$wrd')";
        }
    }

    // RUN THE QUERIES WE CREATED (SAVE ONLY ONE ROW FOR EACH WORD)
    $sqls = array_unique($sqls);
    foreach ($sqls as $sql)
    {
        echo PHP_EOL . $sql;
        mysql_query($sql);
    }
}

// CREATE THE FORM TO CHECK THE IMPORTANT WORDS
echo '<form method="post">';
echo PHP_EOL . "THE TEXT SAYS: $txt";
echo PHP_EOL . "CHOOSE INDEX WORDS:";
$k = 0;

// EACH WORD IS AN INPUT CONTROL
foreach ($arr as $wrd)
{
    // CREATE THE INPUT CONTROL USING HEREDOC NOTATION
$inp = <<<INP
<input type="checkbox" name="$k" value="$wrd" />$wrd
INP;
    echo PHP_EOL . $inp;
    $k++;
}

echo PHP_EOL . '<input type="submit" />';
echo PHP_EOL . '</form>';

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 35407920
Thanks for the points - it's an interesting question, ~Ray
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Introduction Many web sites contain image galleries; a common design for these galleries includes a page with a collection of thumbnail images.  You can click on each of the thumbnail images to see the larger version of the image.  This is easily i…
Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now