Link to home
Start Free TrialLog in
Avatar of nainil
nainilFlag for United States of America

asked on

Parse RSS Feeds for words that start with a Capital Letter

Hi,

I have a news site which provides daily RSS feeds (around 15 feeds a day). What I am looking for is a script that will parse the RSS News and store the following in a mysql table:

1. Date & Time of the RSS News
2. News Title
3. News Hyper Link
4. Parse and store words that begin with a capital letter. However it must be a smart parser so that the following are taken into account:

Here is a sample part of the rss text. Here are the words that the script should parse and store in database.
Officials in Memphis, Tennessee, are bracing for the Mississippi River to crest Tuesday at its highest level in more than 70 years.

The river is expected to reach 48 feet -- just shy of the 48.7-foot record set in 1937 -- shortly after midnight Tuesday, National Weather Service meteorologist Bill Borghoff told CNN Sunday.

By daybreak Sunday, the Mississippi had already reached 47.3 feet.

Local officials have a "high level of confidence" that the prediction is accurate, said Bob Nations, director of preparedness in Shelby County, Tennessee.

Residents in 1,100 trailers and homes in low-lying areas in the county, which includes Memphis and surrounding areas, have already been told to evacuate. As of Saturday, 367 residents had moved into shelters, Borghoff said.

Open in new window


---

Expected Keywords Parsed & Stored:
Officials in Memphis
Tennessee
Mississippi River
Tuesday
National Weather Service
Bill Borghoff
CNN Sunday
Mississippi
Local
Shelby County
Residents
Memphis
Saturday
Borghoff

Open in new window



Note: it should have a ignore word list so that words that I choose should not be matched.

Any help is appreciated.
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of nainil

ASKER

Thanks. This is a good example. However, here are the rules that are needed to implement in the above script:

Sentence Example:

Officials in Memphis, Tennessee, are bracing for the Mississippi River to crest Tuesday at its highest level in more than 70 years.

1. Considering "Mississippi River" as a single/joint word in the result along with other words.

2. Tennessee, Tuesday be considered separate words

3. A special character marks the end of a word.

4.  "Officials Memphis" be considered a single/joint word (here I am assuming "in" as one of the words in the ignored word list) and the word Memphis is the last word before the special character (which is a comma in this case)

5. CNN Sunday should be considered a joint word so should be stored as "CNN Sunday"

6. Numbers are not a part of the word list.

Can you modify the script to incorporate these rules?
Can you modify the script to incorporate these rules?
No, sorry, I cannot.  They are not really "rules" in the programmatic sense -- they do not describe a generalized, programmable process that identifies a semantic approach to the problem.  Unfortunately computer programs suffer from semantic aphasia.  This is not a question; it is a loosely described requirement for application development and for that you really want to hire a professional developer, one who can devote the time necessary to work with you to bring focus to the details of the requirement.

I can help with this part.
6. Numbers are not a part of the word list.

To implement that change line 30 as shown in the code snippet.

As I said, it is an interesting application with a lot of moving parts.  You might consider full-text searches, or using a Google search appliance where much of this sort of search intelligence is already built in.
. 'A-Z '      // LETTERS AND BLANKS

Open in new window

Avatar of nainil

ASKER

Thanks. I understand that words don't behave like we want. I am wondering if there is at least a way to find Words that start with capitalized characters and are adjoining to each other.


Officials in Memphis, Tennessee, are bracing for the Mississippi River to crest Tuesday at its highest level in more than 70 years.

In our case it would be "Mississippi River"
Avatar of nainil

ASKER

With some additional help I now have the following. I am hoping this code can now be optimized:

<?php // RAY_temp_nainil.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO FIND CAPITALIZED WORDS AND HOW TO IGNORE CERTAIN WORDS


// THE INPUT DATA
$str = <<<EOD
Officials in Memphis, Tennessee, are bracing for the Mississippi River to crest Tuesday at its highest level in more than 70 years World Wide Web india.
EOD;

// THE IGNORED WORDS
$ign 
= array
( 'Officials'
, 'Foobar'
)
;

// A PLACE TO KEEP THE WORDS WE FOUND
$out = array();

/* 
// REMOVE PUNCTUATION FROM THE INPUT STRING
$rgx 
= '/'         // START REGEX DELIMITER
. '['         // START CHARACTER CLASS
. '^'         // NOT THE FOLLOWING
. 'A-Z 0-9'   // LETTERS NUMBERS AND BLANKS
. ']'         // END CHARACTER CLASS
. '/'         // END REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;
$new = preg_replace($rgx, NULL, $str);
*/


/* Remove Unwanted Characters Function  */
function clean_url2($text2)
{
#### FUNCTION BY WWW.WEBUNE.COM AND WALLPAPERAMA.COM

$code_entities_match2 = array( '&quot;' ,'!' ,'@' ,'#' ,'$' ,'%' ,'^' ,'&' ,'*' ,'(' ,')' ,'+' ,'{' ,'}' ,'|' ,':' ,'"' ,'<' ,'>' ,'?' ,'[' ,']' ,'' ,';' ,"'" ,',' ,'.' ,'_' ,'/' ,'*' ,'+' ,'~' ,'`' ,'=' ,' ' ,'---' ,'--','--');
$code_entities_replace2 = array( ' &quot; ' ,' ! ' ,' @ ' ,' # ' ,' $ ' ,' % ' ,' ^ ' ,' & ' ,' * ' ,' ( ' ,' ) ' ,' + ' ,' { ' ,' } ' ,' | ' ,' : ' ,' " ' ,' < ' ,' > ' ,' ? ' ,' [ ' ,' ] ' ,'' ,' ; ' ," ' " ,' , ' ,' . ' ,' _ ' ,' / ' ,' * ' ,' + ' ,' ~ ' ,' ` ' ,' = ' ,' ' ,' --- ' ,' -- ',' -- '); // replace all special character values with nothing
$text2 = str_replace($code_entities_match2, $code_entities_replace2, $text2);
return $text2;
} 

/*
// REMOVE PUNCTUATION FROM THE INPUT STRING
$rgx 
= '/'         // START REGEX DELIMITER
. '['         // START CHARACTER CLASS
. '^'         // NOT THE FOLLOWING
. 'A-Z 0-9'   // LETTERS NUMBERS AND BLANKS
. ']'         // END CHARACTER CLASS
. '/'         // END REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;
$new = preg_replace($rgx, NULL, $str);
*/

$new = clean_url2($str);

echo "Special Character Space: ".$new."<br>";

// MAKE AN ARRAY FROM THE STRING WITH ONE WORD IN EACH POSITION
$arr = explode(' ', $new);

$word1 = "";
$word2 = "";
$word3 = "";

// ITERATE OVER THE ARRAY OF WORDS
foreach ($arr as $wrd)
{
    // MAN PAGE http://php.net/manual/en/function.in-array.php
    if (in_array($wrd, $ign)) continue;
    

		// REMOVE PUNCTUATION FROM THE INPUT STRING
		$rgx 
		= '/'         // START REGEX DELIMITER
		. '['         // START CHARACTER CLASS
		. '^'         // NOT THE FOLLOWING
		. 'A-Z '   // LETTERS NUMBERS AND BLANKS
		. ']'         // END CHARACTER CLASS
		. '/'         // END REGEX DELIMITER
		. 'i'         // CASE-INSENSITIVE
		;

		$wrd= preg_replace($rgx, NULL, $wrd);

//		echo $new."<br>";

	if ($wrd !="") { // begin if $wrd != ""

    // MAN PAGE http://php.net/manual/en/function.ucfirst.php
    $ucw = ucfirst($wrd);

    if ($ucw != $wrd) {
    
		$word1 = "";
		$word2 = "";
		$word3 = "";

	continue;
	}

	/*

	Check if 1 is empty
		if 1 is empty, add the word to 1

				Check if 2 is empty
					if 2 is empty, add the word to 2

						Check if 3 is empty
							if 3 is empty, add the word to 3
						else
							1=2
							2=3
							3=NEW

	FINAL 1 & 2 = 1 + 2
	Final 2 & 3 = 2 + 3
	Final 1 & 3	= 1 + 3
	FINAL3 = 1 + 2 + 3

	*/

	if ($word1 == "")
	{
		$word1 = $wrd;
	}
	else // if word1
	{
		if($word2 == "")
		{
			$word2 = $wrd;
		}
		else // if word2
		{
			if($word3 == "")
			{
				$word3 = $wrd;
			}
			else // if word 3
			{
				$word1 = $word2;
				$word2 = $word3;
				$word3 = $wrd;
			}
		}
	}

	$word123	= $word1. ' ' .$word2. ' ' .$word3;
	$word12		= $word1. ' ' .$word2;
	$word13		= $word1. ' ' .$word3;

	echo '<BR>$word123: '.$word123;
	// echo '<BR>$word12: '.$word12;
	// echo '<BR>$word13: '.$word13;
	echo '<hr>';

    // SAVE THE UPPERCASE WORD
//    $out[] = $wrd;
}

} // close end if $wrd != ""

// SHOW THE WORK PRODUCT
echo "<pre>";
echo PHP_EOL . "WE STUDIED $str";
echo PHP_EOL . "WE MADE IT $new";
echo PHP_EOL . "WE IGNORED " . implode(' ', $ign);
echo PHP_EOL . "WE LOCATED " . implode(' ', $out);

Open in new window