Link to home
Start Free TrialLog in
Avatar of kingroland
kingroland

asked on

Please adivse. How to count "Occurance of each word" on a webpage?

To
Mr. Expert

I am so thankful and grateful to you for all your advices in the past. It was really LIFE-SAVER.
I now really need your help again as I got stuck in a deadlock while trying
to finish my web project. I have tried many research online, on Google and
now I have exhausted. Any kind of help is appreciated. Thanks.
My test page is crashed repeatedly with warnings. It did not work.
Test page code (still in progress) is attached in the "Code" section below.

Please advise on:
"How to count numbers of occurance of every individual words on a webpage"?

What I really need was a PHP or Java Scripts that count of each word's occurannce on the given webpage. (Only care about  words longer than 3 characters)
Not just the total word count on the webpage.

I saw on internet that some programmers did a good job (counting every word) on
their webpage but they did not want to share the codes.
-----------------------------------------------------
For example:
Given text block on a webpage as below:

"About PHP
PHP is a powerful programming language and also widely used language
in the world. PHP version 5 is currently in use version 6 is about to release.
Try PHP language."

---------------------------------

Results on Word Count (displayed on the same webpage):

PHP = 4
language = 3
version = 2
world = 1
powerful = 1
programming = 1
so on .....etc .................




<?php

$url="http://mysite.com/search/product_info.php";

function get_web_page( $url )
{
	$options = array( 'http' => array(
		'user_agent'    => 'spider',        // who am i
		'max_redirects' => 10,              // stop after 10 redirects
		'timeout'       => 120,             // timeout on response
	) );
	$context = stream_context_create( $options );
	$page    = @file_get_contents( $url, false, $context );
	$result  = array( );
	if ( $page != false )
		$result['content'] = $page;
	else if ( !isset( $http_response_header ) )
		return null;    // Bad url, timeout

	// Save the header
	$result['header'] = $http_response_header;

	// Get the *last* HTTP status code
	$nLines = count( $http_response_header );
	for ( $i = $nLines-1; $i >= 0; $i-- )
	{
		$line = $http_response_header[$i];
		if ( strncasecmp( "HTTP", $line, 4 ) == 0 )
		{
			$response = explode( ' ', $line );
			$result['http_code'] = $response[1];
			break;
		}
	}

	return $result;
}


/* Get the content type from CURL */
//$content_type = curl_getinfo( $ch, CURLINFO_CONTENT_TYPE );
 
/* Get the MIME type and character set */
preg_match( '@([\w/+]+)(;\s+charset=(\S+))?@i', $content_type, $matches );
if ( isset( $matches[1] ) )
    $mime = $matches[1];
if ( isset( $matches[3] ) )
    $charset = $matches[3];


/* Get the MIME type and character set */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $page, $matches );
if ( isset( $matches[1] ) )
    $mime = $matches[1];
if ( isset( $matches[3] ) )
    $charset = $matches[3];

$contentType = $result['content_type'];
preg_match( '@([\w/+]+)(;\s+charset=(\S+))?@i', $contentType, $matches );
$charset = $matches[3];



$text = iconv( $charset, "utf-8", $encodedText );

$text = strip_html_tags( $text);

$text = html_entity_decode( $text, ENT_QUOTES, "utf-8" );

$text = strip_punctuation( $text );

$text = strip_symbols( $text );


$text = strip_numbers( $text );



$text = mb_strtolower( $text, "utf-8" );


mb_regex_encoding( "utf-8" );
$words = mb_split( ' +', $text );


foreach ( $words as $key => $word )
    $words[$key] = PorterStemmer::Stem( $word, true );



$stopText  = file_get_contents( $stopWordsFilename );
$stopWords = mb_split( '[ \n]+', mb_strtolower( $stopText, 'utf-8' ) );
foreach ( $stopWords as $key => $word )
    $stopWords[$key] = PorterStemmer::Stem( $word, true );


$words = array_diff( $words, $stopWords );

$words = array_diff( $words, $unwantedWords );



$keywordCounts = array_count_values( $words );
arsort( $keywordCounts, SORT_NUMERIC );

$uniqueKeywords = array_keys( $keywordCounts );



echo $keywordCounts ;

echo $words;

?>

Open in new window

Avatar of VanHackman
VanHackman
Flag of El Salvador image


You have a function in Php that will help you a lot: str_word_count()

Here the info from the Php Manual:

http://php.net/manual/en/function.str-word-count.php

As you can see all that you must do is pass to the function the Text Block to be analyzed, and it will return a associative array with each word in the text block.

$Words = str_word_count($Text, 1);

After that you can remove duplicated words with:

$Words = array_unique($Words) ;

http://php.net/manual/en/function.array-unique.php

Finally just make a loop and use: substr_count()

http://www.php.net/manual/en/function.substr-count.php

and you will get the occurrences number for each word.

something like:

foreach($Words as $Word){

$Occurrences = substr_count($Text,Word);

echo "The Word ".$Word." is ".$Occurrences." times in the text.";
}

So I think that your job is done.. :)

Bye!
just:

Avatar of hielo
>>Finally just make a loop and use: substr_count()
No way. substr_count() counts substrings. So if you have:
programming language in the world

when it looks for "in", it will find two due to PROGRAMMinG, which is not what he wants.

>>$Words = array_unique($Words) ;
No way. Use that duplicity to count the occurrences of words.
<?php
header("Content-Type: text/plain");
function sortIt($a,$b){
	if($a==$b)return 0;
	if($a<$b) return 1;
	return -1;
}
$str=<<<END2
"About PHP
PHP is a powerful programming language and also widely used language
in the world. PHP version 5 is currently in use version 6 is about to release.
Try PHP language."

END2;
echo $str;
$word=str_word_count($str, 1);

$result=array();
foreach($word as $v)
{
	if(!isset($result[$v]))
	{
		$result[$v]=0;
	}
	++$result[$v];
}

uasort($result,"sortIt");
print_r($result);
?>

Open in new window

Hello kingroland,

Try this :


<?php
    $page = "About, PHP\nPHP is a powerful programming language and also widely used language \nin the world. PHP version 5 is currently in use version 6 is about to release.\nTry PHP language.";
    // REMOVE WORDS WITH ONE OR TWO LETTERS
    $a = preg_replace("!\\b\\w{1,2}\\b!", "", $page);
    // REPLACE NON ALPHANUMERICAL CHAR BY A SPACE
    $a = preg_replace("/\W/", " ", $a);
    // BUILD AN ARRAY BY SPLITTING THE PAGE EACH SPACE FOUND
    $a = preg_split("/\s+/", trim($a));
    // JUST FOR FUN(?) WE SORT THE ARRAY
    sort($a);
    // COUNT OCCURENCE FOR EACH WORD IN ARRAY $a 
    $c = array_count_values($a);
    // DISPLAY EACH WORD AND ITS OCCURENCE
    foreach($c as $w=>$o) echo $w . " : " . $o . "\n";    
?>

Open in new window


@hielo: I am not agree with you at all, but anyway your method will work too.
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>>@hielo: I am not agree with you at all,
No problem.  However, it seems to me you don't understand what substr_count() does and what you suggested will not work. Here is what you suggested followed by the output:


<?php
$Text=<<<END
About PHP
PHP is a powerful programming language and also widely used language
in the world. PHP version 5 is currently in use version 6 is about to release.
Try PHP language.
END;

$Words = str_word_count($Text, 1);
$Words = array_unique($Words) ;
foreach($Words as $Word){

$Occurrences = substr_count($Text,$Word);

echo "\nThe Word ".$Word." is ".$Occurrences." times in the text.";
}
?>

The Word About is 1 times in the text.
The Word PHP is 4 times in the text.
The Word is is 3 times in the text.
The Word a is 12 times in the text.
The Word powerful is 1 times in the text.
The Word programming is 1 times in the text.
The Word language is 3 times in the text.
The Word and is 1 times in the text.
The Word also is 1 times in the text.
The Word widely is 1 times in the text.
The Word used is 1 times in the text.
The Word in is 3 times in the text.
The Word the is 1 times in the text.
The Word world is 1 times in the text.
The Word version is 2 times in the text.
The Word currently is 1 times in the text.
The Word use is 2 times in the text.
The Word about is 1 times in the text.
The Word to is 1 times in the text.
The Word release is 1 times in the text.
The Word Try is 1 times in the text.

Look at the fourth line. "a" does not appear 12 times. The word "in" does not appear 3 times either.

Open in new window

Regarding this: "(Only care about  words longer than 3 characters)"

Please note that the string, "PHP" is not longer than 3 characters

 ;-)

@hielo:

"what you suggested will not work"

No way.  I didn't take care about "a" and "in"  because kingroland said:

"(Only care about  words longer than 3 characters)"

Anyway, if you take that as a mistake. Just change this line:

$Occurrences = substr_count($Text,$Word);

for:

$Occurrences = substr_count($Text, $Word.' ');

and add:

$search  = array('.', ',', ';');
$replace = array(' ', ' ', ' ');
$Text  = str_replace($search,$replace,$Text);

before that line.

And by the way, I think that the code using my method is pretty much simple that yours. =)

As you can see the code will work perfectly:




<?php
$Text=<<<END
About PHP
PHP is a powerful programming language and also widely used language
in the world. PHP version 5 is currently in use version 6 is about to release.
Try PHP language.
END;

$search  = array('.', ',', ';');
$replace = array(' ', ' ', ' ');
$Text  = str_replace($search,$replace,$Text);

$Words = str_word_count($Text, 1);
$Words = array_unique($Words) ;
foreach($Words as $Word){

$Occurrences = substr_count($Text,$Word.' ');

echo "\nThe Word ".$Word." is ".$Occurrences." times in the text.".'</br>';
}
?>

The Word About is 1 times in the text.
The Word PHP is 3 times in the text.
The Word is is 3 times in the text.
The Word a is 1 times in the text.
The Word powerful is 1 times in the text.
The Word programming is 1 times in the text.
The Word language is 2 times in the text.
The Word and is 1 times in the text.
The Word also is 1 times in the text.
The Word widely is 1 times in the text.
The Word used is 1 times in the text.
The Word in is 2 times in the text.
The Word the is 1 times in the text.
The Word world is 1 times in the text.
The Word version is 2 times in the text.
The Word currently is 1 times in the text.
The Word use is 1 times in the text.
The Word about is 1 times in the text.
The Word to is 1 times in the text.
The Word release is 1 times in the text.
The Word Try is 1 times in the text

Open in new window

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
=>chr(10), chr(13)
=>' '.$Word.' '
and little by little you keep hacking it more and more
:)

'and little by little you keep hacking it more and more'

Mmmm... I don't thing that, I think that my last code solve the problem in all ways,
(Simpler, Faster and using only build-in functions... =P )

I had to modify it twice, because I didn't try to provide a script at the first time.
(MSG ID = ID 26141775 )

The first modification was over the code that you write, to show you what I had in mind.
The second modifications solve the problem with chr(10), chr(13) a tricky possibility that I didn't consider.

Any way, my code works perfectly know and your code works too, so kingroland must choose
which one is better.

;)


>>and using only build-in functions... =P
As opposed to what? The optional sorting function that I added? :)

and as for simplicity, care for a one-liner:
print_r( $word=array_count_values(str_word_count($str, 1)) );
Oh goodness, not hackery involved!
... =P

print_r( $word=array_count_values(str_word_count($str, 1)) );

Great method. Look like if you have spent some time dusting the Php Manual... jaja. =P
Any way, Good Job!. It is sooooo much better that your initial script, simple and efficient.
>>Look like if you have spent some time dusting the Php Manual
Not the manual - this post ( ID: 26141963 ). I noticed it earlier but then got engaged in our conversation... jaja. =P

Take care.
Avatar of kingroland
kingroland

ASKER

To
All Mr. Expert

Thank you so much for your advices.
I have tested all the scripts and All of your codes are working and I am so impressed
by your talent and skills in programming. I give you full credit for your good work.
I wish you more success in career



I accept all the solutions from all 4 experts because all codes are proved to be working. I appreciate for their advice and honor them for their talent and skills. Thank you all.