Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 246
  • Last Modified:

Parse RSS Feeds for words that start with a Capital Letter

Hi,

I have a news site which provides daily RSS feeds (around 15 feeds a day). What I am looking for is a script that will parse the RSS News and store the following in a mysql table:

1. Date & Time of the RSS News
2. News Title
3. News Hyper Link
4. Parse and store words that begin with a capital letter. However it must be a smart parser so that the following are taken into account:

Here is a sample part of the rss text. Here are the words that the script should parse and store in database.
Officials in Memphis, Tennessee, are bracing for the Mississippi River to crest Tuesday at its highest level in more than 70 years.

The river is expected to reach 48 feet -- just shy of the 48.7-foot record set in 1937 -- shortly after midnight Tuesday, National Weather Service meteorologist Bill Borghoff told CNN Sunday.

By daybreak Sunday, the Mississippi had already reached 47.3 feet.

Local officials have a "high level of confidence" that the prediction is accurate, said Bob Nations, director of preparedness in Shelby County, Tennessee.

Residents in 1,100 trailers and homes in low-lying areas in the county, which includes Memphis and surrounding areas, have already been told to evacuate. As of Saturday, 367 residents had moved into shelters, Borghoff said.

Open in new window


---

Expected Keywords Parsed & Stored:
Officials in Memphis
Tennessee
Mississippi River
Tuesday
National Weather Service
Bill Borghoff
CNN Sunday
Mississippi
Local
Shelby County
Residents
Memphis
Saturday
Borghoff

Open in new window



Note: it should have a ignore word list so that words that I choose should not be matched.

Any help is appreciated.
0
nainil
Asked:
nainil
  • 3
  • 2
1 Solution
 
Ray PaseurCommented:
This sounds like an interesting application but it also seems to lack some consolidation of thought.  For example, what are the rules that would cause the program to store two words for Bill Borghoff, and to also store Borghoff but not also store Bill?  How would the program know that it was expected to overlook Bob Nations?  And how would the program know to store CNN Sunday but not CNN or Sunday separately?  Those are the kinds of things that you will need to figure out before you can write programming to do this.

I can help with a part of this - finding capitalized words and finding words in an ignore list.  You might want to consider whether you want to keep or discard numbers; this algorithm will consider numbers to be the same as capitalized words.  And since most PHP string functions are case-sensitive, you may want to be on the lookout for issues related to that as you do your programming.  Example: This algorithm would ignore Foobar but would not ignore FOOBAR.
http://www.laprbass.com/RAY_temp_nainil.php

Outputs:
WE STUDIED Officials in Memphis, Tennessee, are bracing for the Mississippi River to crest Tuesday at its highest level in more than 70 years.
WE MADE IT Officials in Memphis Tennessee are bracing for the Mississippi River to crest Tuesday at its highest level in more than 70 years
WE IGNORED Officials Tuesday Foobar
WE LOCATED Memphis Tennessee Mississippi River 70

Best of luck with it, ~Ray
<?php // RAY_temp_nainil.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO FIND CAPITALIZED WORDS AND HOW TO IGNORE CERTAIN WORDS


// THE INPUT DATA
$str = <<<EOD
Officials in Memphis, Tennessee, are bracing for the Mississippi River to crest Tuesday at its highest level in more than 70 years.
EOD;

// THE IGNORED WORDS
$ign 
= array
( 'Officials'
, 'Tuesday'
, 'Foobar'
)
;

// A PLACE TO KEEP THE WORDS WE FOUND
$out = array();

// REMOVE PUNCTUATION FROM THE INPUT STRING
$rgx 
= '/'         // START REGEX DELIMITER
. '['         // START CHARACTER CLASS
. '^'         // NOT THE FOLLOWING
. 'A-Z 0-9'   // LETTERS NUMBERS AND BLANKS
. ']'         // END CHARACTER CLASS
. '/'         // END REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;
$new = preg_replace($rgx, NULL, $str);

// MAKE AN ARRAY FROM THE STRING WITH ONE WORD IN EACH POSITION
$arr = explode(' ', $new);

// ITERATE OVER THE ARRAY OF WORDS
foreach ($arr as $wrd)
{
    // MAN PAGE http://php.net/manual/en/function.in-array.php
    if (in_array($wrd, $ign)) continue;
    
    // MAN PAGE http://php.net/manual/en/function.ucfirst.php
    $ucw = ucfirst($wrd);
    if ($ucw != $wrd) continue;
    
    // SAVE THE UPPERCASE WORD
    $out[] = $wrd;
}

// SHOW THE WORK PRODUCT
echo "<pre>";
echo PHP_EOL . "WE STUDIED $str";
echo PHP_EOL . "WE MADE IT $new";
echo PHP_EOL . "WE IGNORED " . implode(' ', $ign);
echo PHP_EOL . "WE LOCATED " . implode(' ', $out);

Open in new window

0
 
nainilAuthor Commented:
Thanks. This is a good example. However, here are the rules that are needed to implement in the above script:

Sentence Example:

Officials in Memphis, Tennessee, are bracing for the Mississippi River to crest Tuesday at its highest level in more than 70 years.

1. Considering "Mississippi River" as a single/joint word in the result along with other words.

2. Tennessee, Tuesday be considered separate words

3. A special character marks the end of a word.

4.  "Officials Memphis" be considered a single/joint word (here I am assuming "in" as one of the words in the ignored word list) and the word Memphis is the last word before the special character (which is a comma in this case)

5. CNN Sunday should be considered a joint word so should be stored as "CNN Sunday"

6. Numbers are not a part of the word list.

Can you modify the script to incorporate these rules?
0
 
Ray PaseurCommented:
Can you modify the script to incorporate these rules?
No, sorry, I cannot.  They are not really "rules" in the programmatic sense -- they do not describe a generalized, programmable process that identifies a semantic approach to the problem.  Unfortunately computer programs suffer from semantic aphasia.  This is not a question; it is a loosely described requirement for application development and for that you really want to hire a professional developer, one who can devote the time necessary to work with you to bring focus to the details of the requirement.

I can help with this part.
6. Numbers are not a part of the word list.

To implement that change line 30 as shown in the code snippet.

As I said, it is an interesting application with a lot of moving parts.  You might consider full-text searches, or using a Google search appliance where much of this sort of search intelligence is already built in.
. 'A-Z '      // LETTERS AND BLANKS

Open in new window

0
 
nainilAuthor Commented:
Thanks. I understand that words don't behave like we want. I am wondering if there is at least a way to find Words that start with capitalized characters and are adjoining to each other.


Officials in Memphis, Tennessee, are bracing for the Mississippi River to crest Tuesday at its highest level in more than 70 years.

In our case it would be "Mississippi River"
0
 
nainilAuthor Commented:
With some additional help I now have the following. I am hoping this code can now be optimized:

<?php // RAY_temp_nainil.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO FIND CAPITALIZED WORDS AND HOW TO IGNORE CERTAIN WORDS


// THE INPUT DATA
$str = <<<EOD
Officials in Memphis, Tennessee, are bracing for the Mississippi River to crest Tuesday at its highest level in more than 70 years World Wide Web india.
EOD;

// THE IGNORED WORDS
$ign 
= array
( 'Officials'
, 'Foobar'
)
;

// A PLACE TO KEEP THE WORDS WE FOUND
$out = array();

/* 
// REMOVE PUNCTUATION FROM THE INPUT STRING
$rgx 
= '/'         // START REGEX DELIMITER
. '['         // START CHARACTER CLASS
. '^'         // NOT THE FOLLOWING
. 'A-Z 0-9'   // LETTERS NUMBERS AND BLANKS
. ']'         // END CHARACTER CLASS
. '/'         // END REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;
$new = preg_replace($rgx, NULL, $str);
*/


/* Remove Unwanted Characters Function  */
function clean_url2($text2)
{
#### FUNCTION BY WWW.WEBUNE.COM AND WALLPAPERAMA.COM

$code_entities_match2 = array( '&quot;' ,'!' ,'@' ,'#' ,'$' ,'%' ,'^' ,'&' ,'*' ,'(' ,')' ,'+' ,'{' ,'}' ,'|' ,':' ,'"' ,'<' ,'>' ,'?' ,'[' ,']' ,'' ,';' ,"'" ,',' ,'.' ,'_' ,'/' ,'*' ,'+' ,'~' ,'`' ,'=' ,' ' ,'---' ,'--','--');
$code_entities_replace2 = array( ' &quot; ' ,' ! ' ,' @ ' ,' # ' ,' $ ' ,' % ' ,' ^ ' ,' & ' ,' * ' ,' ( ' ,' ) ' ,' + ' ,' { ' ,' } ' ,' | ' ,' : ' ,' " ' ,' < ' ,' > ' ,' ? ' ,' [ ' ,' ] ' ,'' ,' ; ' ," ' " ,' , ' ,' . ' ,' _ ' ,' / ' ,' * ' ,' + ' ,' ~ ' ,' ` ' ,' = ' ,' ' ,' --- ' ,' -- ',' -- '); // replace all special character values with nothing
$text2 = str_replace($code_entities_match2, $code_entities_replace2, $text2);
return $text2;
} 

/*
// REMOVE PUNCTUATION FROM THE INPUT STRING
$rgx 
= '/'         // START REGEX DELIMITER
. '['         // START CHARACTER CLASS
. '^'         // NOT THE FOLLOWING
. 'A-Z 0-9'   // LETTERS NUMBERS AND BLANKS
. ']'         // END CHARACTER CLASS
. '/'         // END REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;
$new = preg_replace($rgx, NULL, $str);
*/

$new = clean_url2($str);

echo "Special Character Space: ".$new."<br>";

// MAKE AN ARRAY FROM THE STRING WITH ONE WORD IN EACH POSITION
$arr = explode(' ', $new);

$word1 = "";
$word2 = "";
$word3 = "";

// ITERATE OVER THE ARRAY OF WORDS
foreach ($arr as $wrd)
{
    // MAN PAGE http://php.net/manual/en/function.in-array.php
    if (in_array($wrd, $ign)) continue;
    

		// REMOVE PUNCTUATION FROM THE INPUT STRING
		$rgx 
		= '/'         // START REGEX DELIMITER
		. '['         // START CHARACTER CLASS
		. '^'         // NOT THE FOLLOWING
		. 'A-Z '   // LETTERS NUMBERS AND BLANKS
		. ']'         // END CHARACTER CLASS
		. '/'         // END REGEX DELIMITER
		. 'i'         // CASE-INSENSITIVE
		;

		$wrd= preg_replace($rgx, NULL, $wrd);

//		echo $new."<br>";

	if ($wrd !="") { // begin if $wrd != ""

    // MAN PAGE http://php.net/manual/en/function.ucfirst.php
    $ucw = ucfirst($wrd);

    if ($ucw != $wrd) {
    
		$word1 = "";
		$word2 = "";
		$word3 = "";

	continue;
	}

	/*

	Check if 1 is empty
		if 1 is empty, add the word to 1

				Check if 2 is empty
					if 2 is empty, add the word to 2

						Check if 3 is empty
							if 3 is empty, add the word to 3
						else
							1=2
							2=3
							3=NEW

	FINAL 1 & 2 = 1 + 2
	Final 2 & 3 = 2 + 3
	Final 1 & 3	= 1 + 3
	FINAL3 = 1 + 2 + 3

	*/

	if ($word1 == "")
	{
		$word1 = $wrd;
	}
	else // if word1
	{
		if($word2 == "")
		{
			$word2 = $wrd;
		}
		else // if word2
		{
			if($word3 == "")
			{
				$word3 = $wrd;
			}
			else // if word 3
			{
				$word1 = $word2;
				$word2 = $word3;
				$word3 = $wrd;
			}
		}
	}

	$word123	= $word1. ' ' .$word2. ' ' .$word3;
	$word12		= $word1. ' ' .$word2;
	$word13		= $word1. ' ' .$word3;

	echo '<BR>$word123: '.$word123;
	// echo '<BR>$word12: '.$word12;
	// echo '<BR>$word13: '.$word13;
	echo '<hr>';

    // SAVE THE UPPERCASE WORD
//    $out[] = $wrd;
}

} // close end if $wrd != ""

// SHOW THE WORK PRODUCT
echo "<pre>";
echo PHP_EOL . "WE STUDIED $str";
echo PHP_EOL . "WE MADE IT $new";
echo PHP_EOL . "WE IGNORED " . implode(' ', $ign);
echo PHP_EOL . "WE LOCATED " . implode(' ', $out);

Open in new window

0

Featured Post

Become an Android App Developer

Ready to kick start your career in 2018? Learn how to build an Android app in January’s Course of the Month and open the door to new opportunities.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now