?
Solved

Using PHP to parse text for tags

Posted on 2011-03-20
4
Medium Priority
?
313 Views
Last Modified: 2012-05-11
Hi All,

I'm trying to create some php code (of which, I admittedly know next to nothing!) that will take in some text and extract the nouns and verbs.

In short, I have a paragraph or 2 of text that a user has entered. I want to parse for nouns and verbs in order to generate search tags, rather than ask the user for tags.

Does anyone know of some open source code available that already does this or have some advice on what I should be thinking about before I start coding?

Alternatively any thoughts on auto-generating tags would be much appreciated.

Thanks in advance.
0
Comment
Question by:GroganJ
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
4 Comments
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 35177336
This is an interesting question, if I understand it correctly.  Can you please post an example of the input test data and the output result you want to get?  Then we can talk about the rules that need programmatic implementation to achieve the desired result set.  You may find that certain words, like "set" have context-sensitive characteristics.  Consider when "set" is a noun vs when it is a verb.

If you do not know very much about PHP you may be working on this project for a very long time without much success.  Consider getting this book to begin to get a foundation in the programming language.
http://www.sitepoint.com/books/phpmysql4/

After you have mastered that, get Elliott White's book, PHP5 in Practice - it's available on Amazon.
0
 

Author Comment

by:GroganJ
ID: 35182830
Ray,

Here is a sample paragraph:
"The campaign will focus on the largest potential user group in Shangzhen: local and migrant males, aged 18-26, to identify the key external and internal factors that are driving the adoption of 3G cell phones in China. It will explore the potential raise awareness of our products (in competition with state-sponsored handset manufacturers) and opportunities to sell direct to market."

The kind of tags I would like to extract are: China, cell phones, 3G, Shangzhen. Obviously, this is with human oversight that can filter!

I guess I should rephrase my question a little. Does anyone know of on-the-fly parsing algorithms (or code in PHP) that can handle some of this? Alternatively, any ideas on how some websites generate tags for pieces of text? As the volume of data will grow quite large, I want to avoid using mySQL code that searches on 'LIKE' as it would grind to a halt. As new text is loaded into the database, I want to parse the tags and store these in a separate table for quicker reference.

Thanks,
John.
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 35184645
Here is a strategy you might be able to adapt to a credible user-interface.  It works like this... The client is given the text and the ability to select words from the text string.  When the form is submitted it stores the text string and a key, along with a word and the key.  The keys are used to coordinate the words in one table with the associated text strings in another table.

Some thoughts for going forward:  

1. Consider case-sensitivity.  In MySQL "the" == "The" by default.  This might or might not be what you want.  You might normalize all words to lower case.
2. Consider time-sensitive relevance.  You might want to add DATETIME columns to these tables.
3. Consider the relevance of adjacent words like cell and phone - perhaps you could have an algorithm that detected sequential keys in $_POST and made these into a single term in addition to individual terms.
4. Consider the UI factors.  I did this in a quick-and-dirty script with checkboxes.  A more reasonable UI would present the text in a way that looked normal, but with each word of the text as simply-styled form inputs.  You might allow the reader to click on any word.  The click would modify an input control in the form to add the clicked word, or it might use an AJAX solution to send the words to a backend script as they are clicked.
5. Consider using Google Search to do all this work for you.  I don't really know the design of your application, but Google knows more about searching than you or I do!

HTH, ~Ray
<?php // RAY_temp_groganj.php
error_reporting(E_ALL);
echo "<pre>";

// CONNECT AND SELECT THE DATA BASE
require_once('YOUR_DB_CREDENTIALS.php');

// TEST DATA FROM THE POST AT EE
$txt = "The campaign will focus on the largest potential user group in Shangzhen: local and migrant males, aged 18-26, to identify the key external and internal factors that are driving the adoption of 3G cell phones in China. It will explore the potential raise awareness of our products (in competition with state-sponsored handset manufacturers) and opportunities to sell direct to market.";

// SANITIZE THE STRING - WORDS ONLY, SINGLE SPACED
$new = preg_replace('#[^A-Z0-9- ]#i', NULL, $txt);
$new = preg_replace('#\s\s+#',       ' ',  $new);

// MAKE A KEY TO COORDINATE THE TEXT WITH THE CHOSEN SEARCH WORDS
$md5 = md5($txt);

// MAKE AN ARRAY OF THE WORDS
$arr = explode(' ', $new);

if (!empty($_POST))
{
    // GET THE POST DATA AND CREATE THE QUERIES
    $sqls = array();

    // PUT THE THE MD5 KEY AND TEXT INTO THE TEXTS TABLE
    $esc = mysql_real_escape_string($txt);
    $sqls[] = "INSERT INTO searchTexts (m, t) VALUES ('$md5', '$esc')";

    // PUT THE MD5 KEY AND CHOSEN WORDS INTO THE WORDS TABLE
    foreach ($_POST as $key => $val)
    {
        if (in_array($val, $arr))
        {
            $wrd = mysql_real_escape_string($val);
            $sqls[] = "INSERT INTO searchWords (m, w) VALUES ('$md5', '$wrd')";
        }
    }

    // RUN THE QUERIES WE CREATED (SAVE ONLY ONE ROW FOR EACH WORD)
    $sqls = array_unique($sqls);
    foreach ($sqls as $sql)
    {
        echo PHP_EOL . $sql;
        mysql_query($sql);
    }
}

// CREATE THE FORM TO CHECK THE IMPORTANT WORDS
echo '<form method="post">';
echo PHP_EOL . "THE TEXT SAYS: $txt";
echo PHP_EOL . "CHOOSE INDEX WORDS:";
$k = 0;

// EACH WORD IS AN INPUT CONTROL
foreach ($arr as $wrd)
{
    // CREATE THE INPUT CONTROL USING HEREDOC NOTATION
$inp = <<<INP
<input type="checkbox" name="$k" value="$wrd" />$wrd
INP;
    echo PHP_EOL . $inp;
    $k++;
}

echo PHP_EOL . '<input type="submit" />';
echo PHP_EOL . '</form>';

Open in new window

0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 35407920
Thanks for the points - it's an interesting question, ~Ray
0

Featured Post

Why Off-Site Backups Are The Only Way To Go

You are probably backing up your data—but how and where? Ransomware is on the rise and there are variants that specifically target backups. Read on to discover why off-site is the way to go.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Suggested Courses

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question