Solved

Using PHP to parse text for tags

Posted on 2011-03-20
4
309 Views
Last Modified: 2012-05-11
Hi All,

I'm trying to create some php code (of which, I admittedly know next to nothing!) that will take in some text and extract the nouns and verbs.

In short, I have a paragraph or 2 of text that a user has entered. I want to parse for nouns and verbs in order to generate search tags, rather than ask the user for tags.

Does anyone know of some open source code available that already does this or have some advice on what I should be thinking about before I start coding?

Alternatively any thoughts on auto-generating tags would be much appreciated.

Thanks in advance.
0
Comment
Question by:GroganJ
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
4 Comments
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 35177336
This is an interesting question, if I understand it correctly.  Can you please post an example of the input test data and the output result you want to get?  Then we can talk about the rules that need programmatic implementation to achieve the desired result set.  You may find that certain words, like "set" have context-sensitive characteristics.  Consider when "set" is a noun vs when it is a verb.

If you do not know very much about PHP you may be working on this project for a very long time without much success.  Consider getting this book to begin to get a foundation in the programming language.
http://www.sitepoint.com/books/phpmysql4/

After you have mastered that, get Elliott White's book, PHP5 in Practice - it's available on Amazon.
0
 

Author Comment

by:GroganJ
ID: 35182830
Ray,

Here is a sample paragraph:
"The campaign will focus on the largest potential user group in Shangzhen: local and migrant males, aged 18-26, to identify the key external and internal factors that are driving the adoption of 3G cell phones in China. It will explore the potential raise awareness of our products (in competition with state-sponsored handset manufacturers) and opportunities to sell direct to market."

The kind of tags I would like to extract are: China, cell phones, 3G, Shangzhen. Obviously, this is with human oversight that can filter!

I guess I should rephrase my question a little. Does anyone know of on-the-fly parsing algorithms (or code in PHP) that can handle some of this? Alternatively, any ideas on how some websites generate tags for pieces of text? As the volume of data will grow quite large, I want to avoid using mySQL code that searches on 'LIKE' as it would grind to a halt. As new text is loaded into the database, I want to parse the tags and store these in a separate table for quicker reference.

Thanks,
John.
0
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 35184645
Here is a strategy you might be able to adapt to a credible user-interface.  It works like this... The client is given the text and the ability to select words from the text string.  When the form is submitted it stores the text string and a key, along with a word and the key.  The keys are used to coordinate the words in one table with the associated text strings in another table.

Some thoughts for going forward:  

1. Consider case-sensitivity.  In MySQL "the" == "The" by default.  This might or might not be what you want.  You might normalize all words to lower case.
2. Consider time-sensitive relevance.  You might want to add DATETIME columns to these tables.
3. Consider the relevance of adjacent words like cell and phone - perhaps you could have an algorithm that detected sequential keys in $_POST and made these into a single term in addition to individual terms.
4. Consider the UI factors.  I did this in a quick-and-dirty script with checkboxes.  A more reasonable UI would present the text in a way that looked normal, but with each word of the text as simply-styled form inputs.  You might allow the reader to click on any word.  The click would modify an input control in the form to add the clicked word, or it might use an AJAX solution to send the words to a backend script as they are clicked.
5. Consider using Google Search to do all this work for you.  I don't really know the design of your application, but Google knows more about searching than you or I do!

HTH, ~Ray
<?php // RAY_temp_groganj.php
error_reporting(E_ALL);
echo "<pre>";

// CONNECT AND SELECT THE DATA BASE
require_once('YOUR_DB_CREDENTIALS.php');

// TEST DATA FROM THE POST AT EE
$txt = "The campaign will focus on the largest potential user group in Shangzhen: local and migrant males, aged 18-26, to identify the key external and internal factors that are driving the adoption of 3G cell phones in China. It will explore the potential raise awareness of our products (in competition with state-sponsored handset manufacturers) and opportunities to sell direct to market.";

// SANITIZE THE STRING - WORDS ONLY, SINGLE SPACED
$new = preg_replace('#[^A-Z0-9- ]#i', NULL, $txt);
$new = preg_replace('#\s\s+#',       ' ',  $new);

// MAKE A KEY TO COORDINATE THE TEXT WITH THE CHOSEN SEARCH WORDS
$md5 = md5($txt);

// MAKE AN ARRAY OF THE WORDS
$arr = explode(' ', $new);

if (!empty($_POST))
{
    // GET THE POST DATA AND CREATE THE QUERIES
    $sqls = array();

    // PUT THE THE MD5 KEY AND TEXT INTO THE TEXTS TABLE
    $esc = mysql_real_escape_string($txt);
    $sqls[] = "INSERT INTO searchTexts (m, t) VALUES ('$md5', '$esc')";

    // PUT THE MD5 KEY AND CHOSEN WORDS INTO THE WORDS TABLE
    foreach ($_POST as $key => $val)
    {
        if (in_array($val, $arr))
        {
            $wrd = mysql_real_escape_string($val);
            $sqls[] = "INSERT INTO searchWords (m, w) VALUES ('$md5', '$wrd')";
        }
    }

    // RUN THE QUERIES WE CREATED (SAVE ONLY ONE ROW FOR EACH WORD)
    $sqls = array_unique($sqls);
    foreach ($sqls as $sql)
    {
        echo PHP_EOL . $sql;
        mysql_query($sql);
    }
}

// CREATE THE FORM TO CHECK THE IMPORTANT WORDS
echo '<form method="post">';
echo PHP_EOL . "THE TEXT SAYS: $txt";
echo PHP_EOL . "CHOOSE INDEX WORDS:";
$k = 0;

// EACH WORD IS AN INPUT CONTROL
foreach ($arr as $wrd)
{
    // CREATE THE INPUT CONTROL USING HEREDOC NOTATION
$inp = <<<INP
<input type="checkbox" name="$k" value="$wrd" />$wrd
INP;
    echo PHP_EOL . $inp;
    $k++;
}

echo PHP_EOL . '<input type="submit" />';
echo PHP_EOL . '</form>';

Open in new window

0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 35407920
Thanks for the points - it's an interesting question, ~Ray
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
Many old projects have bad code, but the budget doesn't exist to rewrite the codebase. You can update this code to be safer by introducing contemporary input validation, sanitation, and safer database queries.
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

749 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question