[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 521
  • Last Modified:

Please adivse. How to count "Occurance of each word" on a webpage?

To
Mr. Expert

I am so thankful and grateful to you for all your advices in the past. It was really LIFE-SAVER.
I now really need your help again as I got stuck in a deadlock while trying
to finish my web project. I have tried many research online, on Google and
now I have exhausted. Any kind of help is appreciated. Thanks.
My test page is crashed repeatedly with warnings. It did not work.
Test page code (still in progress) is attached in the "Code" section below.

Please advise on:
"How to count numbers of occurance of every individual words on a webpage"?

What I really need was a PHP or Java Scripts that count of each word's occurannce on the given webpage. (Only care about  words longer than 3 characters)
Not just the total word count on the webpage.

I saw on internet that some programmers did a good job (counting every word) on
their webpage but they did not want to share the codes.
-----------------------------------------------------
For example:
Given text block on a webpage as below:

"About PHP
PHP is a powerful programming language and also widely used language
in the world. PHP version 5 is currently in use version 6 is about to release.
Try PHP language."

---------------------------------

Results on Word Count (displayed on the same webpage):

PHP = 4
language = 3
version = 2
world = 1
powerful = 1
programming = 1
so on .....etc .................




<?php

$url="http://mysite.com/search/product_info.php";

function get_web_page( $url )
{
	$options = array( 'http' => array(
		'user_agent'    => 'spider',        // who am i
		'max_redirects' => 10,              // stop after 10 redirects
		'timeout'       => 120,             // timeout on response
	) );
	$context = stream_context_create( $options );
	$page    = @file_get_contents( $url, false, $context );
	$result  = array( );
	if ( $page != false )
		$result['content'] = $page;
	else if ( !isset( $http_response_header ) )
		return null;    // Bad url, timeout

	// Save the header
	$result['header'] = $http_response_header;

	// Get the *last* HTTP status code
	$nLines = count( $http_response_header );
	for ( $i = $nLines-1; $i >= 0; $i-- )
	{
		$line = $http_response_header[$i];
		if ( strncasecmp( "HTTP", $line, 4 ) == 0 )
		{
			$response = explode( ' ', $line );
			$result['http_code'] = $response[1];
			break;
		}
	}

	return $result;
}


/* Get the content type from CURL */
//$content_type = curl_getinfo( $ch, CURLINFO_CONTENT_TYPE );
 
/* Get the MIME type and character set */
preg_match( '@([\w/+]+)(;\s+charset=(\S+))?@i', $content_type, $matches );
if ( isset( $matches[1] ) )
    $mime = $matches[1];
if ( isset( $matches[3] ) )
    $charset = $matches[3];


/* Get the MIME type and character set */
preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $page, $matches );
if ( isset( $matches[1] ) )
    $mime = $matches[1];
if ( isset( $matches[3] ) )
    $charset = $matches[3];

$contentType = $result['content_type'];
preg_match( '@([\w/+]+)(;\s+charset=(\S+))?@i', $contentType, $matches );
$charset = $matches[3];



$text = iconv( $charset, "utf-8", $encodedText );

$text = strip_html_tags( $text);

$text = html_entity_decode( $text, ENT_QUOTES, "utf-8" );

$text = strip_punctuation( $text );

$text = strip_symbols( $text );


$text = strip_numbers( $text );



$text = mb_strtolower( $text, "utf-8" );


mb_regex_encoding( "utf-8" );
$words = mb_split( ' +', $text );


foreach ( $words as $key => $word )
    $words[$key] = PorterStemmer::Stem( $word, true );



$stopText  = file_get_contents( $stopWordsFilename );
$stopWords = mb_split( '[ \n]+', mb_strtolower( $stopText, 'utf-8' ) );
foreach ( $stopWords as $key => $word )
    $stopWords[$key] = PorterStemmer::Stem( $word, true );


$words = array_diff( $words, $stopWords );

$words = array_diff( $words, $unwantedWords );



$keywordCounts = array_count_values( $words );
arsort( $keywordCounts, SORT_NUMERIC );

$uniqueKeywords = array_keys( $keywordCounts );



echo $keywordCounts ;

echo $words;

?>

Open in new window

0
kingroland
Asked:
kingroland
  • 6
  • 6
  • 2
  • +2
3 Solutions
 
VanHackmanCommented:

You have a function in Php that will help you a lot: str_word_count()

Here the info from the Php Manual:

http://php.net/manual/en/function.str-word-count.php

As you can see all that you must do is pass to the function the Text Block to be analyzed, and it will return a associative array with each word in the text block.

$Words = str_word_count($Text, 1);

After that you can remove duplicated words with:

$Words = array_unique($Words) ;

http://php.net/manual/en/function.array-unique.php

Finally just make a loop and use: substr_count()

http://www.php.net/manual/en/function.substr-count.php

and you will get the occurrences number for each word.

something like:

foreach($Words as $Word){

$Occurrences = substr_count($Text,Word);

echo "The Word ".$Word." is ".$Occurrences." times in the text.";
}

So I think that your job is done.. :)

Bye!
just:

0
 
hieloCommented:
>>Finally just make a loop and use: substr_count()
No way. substr_count() counts substrings. So if you have:
programming language in the world

when it looks for "in", it will find two due to PROGRAMMinG, which is not what he wants.

>>$Words = array_unique($Words) ;
No way. Use that duplicity to count the occurrences of words.
<?php
header("Content-Type: text/plain");
function sortIt($a,$b){
	if($a==$b)return 0;
	if($a<$b) return 1;
	return -1;
}
$str=<<<END2
"About PHP
PHP is a powerful programming language and also widely used language
in the world. PHP version 5 is currently in use version 6 is about to release.
Try PHP language."

END2;
echo $str;
$word=str_word_count($str, 1);

$result=array();
foreach($word as $v)
{
	if(!isset($result[$v]))
	{
		$result[$v]=0;
	}
	++$result[$v];
}

uasort($result,"sortIt");
print_r($result);
?>

Open in new window

0
 
leakim971PluritechnicianCommented:
Hello kingroland,

Try this :


<?php
    $page = "About, PHP\nPHP is a powerful programming language and also widely used language \nin the world. PHP version 5 is currently in use version 6 is about to release.\nTry PHP language.";
    // REMOVE WORDS WITH ONE OR TWO LETTERS
    $a = preg_replace("!\\b\\w{1,2}\\b!", "", $page);
    // REPLACE NON ALPHANUMERICAL CHAR BY A SPACE
    $a = preg_replace("/\W/", " ", $a);
    // BUILD AN ARRAY BY SPLITTING THE PAGE EACH SPACE FOUND
    $a = preg_split("/\s+/", trim($a));
    // JUST FOR FUN(?) WE SORT THE ARRAY
    sort($a);
    // COUNT OCCURENCE FOR EACH WORD IN ARRAY $a 
    $c = array_count_values($a);
    // DISPLAY EACH WORD AND ITS OCCURENCE
    foreach($c as $w=>$o) echo $w . " : " . $o . "\n";    
?>

Open in new window

0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
VanHackmanCommented:

@hielo: I am not agree with you at all, but anyway your method will work too.
0
 
Ray PaseurCommented:
You can install this and run it.  Please post back if you have any questions.  

Best wishes for the new year, ~Ray
<?php // RAY_word_count_web_page.php
error_reporting(E_ALL);

// COUNT INSTANCES OF ALL WORDS IN A WEB PAGE - CASE INSENSITIVE (SEE LINE 16)

// TEST DATA - PUT YOUR URL HERE
$url = 'http://www.nationalpres.org/sermon.php?d=2009-10-04';

// ACQUIRE THE CONTENT FROM THE WEB PAGE
$htm = file_get_contents($url);

// REMOVE THE NON-TEXT AND NON-WORD PORTIONS AND TIDY THE INPUT
$txt = strip_tags($htm);
$str = preg_replace('/[^A-Z ]/i', '', $txt);
$str = preg_replace('/\s\s+/', ' ', $str);
$str = trim(strtoupper($str)); // ALL UPPER CASE == CASE-INSENSITIVE

// HOW MANY WORDS?  SEE HERE: http://us.php.net/manual/en/function.str-word-count.php
$num = str_word_count($str);

// EXPLODE THE STRING OF WORDS INTO AN ARRAY
$arr = explode(' ', $str);

// REMOVE THE WORDS THAT ARE THREE OR FEWER LETTERS
$ign = 0;
foreach ($arr as $ptr => $wrd)
{
    if (strlen($wrd) > 4) continue;
    unset($arr[$ptr]);
    $ign++;
}

// SORT AND REMOVE DUPLICATES - See http://us.php.net/manual/en/array.sorting.php
natcasesort($arr);
$unq = array_unique($arr);

// THE NUMBER OF UNIQUE WORDS
$uum = count($unq);

// COUNT THE INSTANCES OF EACH WORD
$cnt = array();
foreach ($arr as $wrd)
{
    if (empty($cnt[$wrd])) $cnt[$wrd] = 0;
    $cnt[$wrd]++;
}

// ACTIVATE THIS TO SEE THE GROSS COUNTS OF EACH WORD - ALPHABETICAL ORDER
// var_dump($cnt);

// SORT BY FREQUENCY
arsort($cnt);

// REPORT THE RESULTS
echo "<br/>WE PROCESSED $url \n";
echo "<br/>THERE ARE $num WORDS";
echo "<br/>WE REMOVED $ign SHORT WORD";
echo "<br/>WE KEPT $uum UNIQUE WORDS \n";
$kount = 10;
echo "<br/>THE TEN MOST FREQUENTLY USED WORDS ARE:\n";
foreach ($cnt as $wrd => $nbr)
{
    echo "<br/>$wrd OCCURS $nbr TIMES\n";
    $kount--;
    if (!$kount) break;
}

Open in new window

0
 
hieloCommented:
>>@hielo: I am not agree with you at all,
No problem.  However, it seems to me you don't understand what substr_count() does and what you suggested will not work. Here is what you suggested followed by the output:


<?php
$Text=<<<END
About PHP
PHP is a powerful programming language and also widely used language
in the world. PHP version 5 is currently in use version 6 is about to release.
Try PHP language.
END;

$Words = str_word_count($Text, 1);
$Words = array_unique($Words) ;
foreach($Words as $Word){

$Occurrences = substr_count($Text,$Word);

echo "\nThe Word ".$Word." is ".$Occurrences." times in the text.";
}
?>

The Word About is 1 times in the text.
The Word PHP is 4 times in the text.
The Word is is 3 times in the text.
The Word a is 12 times in the text.
The Word powerful is 1 times in the text.
The Word programming is 1 times in the text.
The Word language is 3 times in the text.
The Word and is 1 times in the text.
The Word also is 1 times in the text.
The Word widely is 1 times in the text.
The Word used is 1 times in the text.
The Word in is 3 times in the text.
The Word the is 1 times in the text.
The Word world is 1 times in the text.
The Word version is 2 times in the text.
The Word currently is 1 times in the text.
The Word use is 2 times in the text.
The Word about is 1 times in the text.
The Word to is 1 times in the text.
The Word release is 1 times in the text.
The Word Try is 1 times in the text.

Look at the fourth line. "a" does not appear 12 times. The word "in" does not appear 3 times either.

Open in new window

0
 
Ray PaseurCommented:
Regarding this: "(Only care about  words longer than 3 characters)"

Please note that the string, "PHP" is not longer than 3 characters

 ;-)

0
 
VanHackmanCommented:
@hielo:

"what you suggested will not work"

No way.  I didn't take care about "a" and "in"  because kingroland said:

"(Only care about  words longer than 3 characters)"

Anyway, if you take that as a mistake. Just change this line:

$Occurrences = substr_count($Text,$Word);

for:

$Occurrences = substr_count($Text, $Word.' ');

and add:

$search  = array('.', ',', ';');
$replace = array(' ', ' ', ' ');
$Text  = str_replace($search,$replace,$Text);

before that line.

And by the way, I think that the code using my method is pretty much simple that yours. =)

As you can see the code will work perfectly:




<?php
$Text=<<<END
About PHP
PHP is a powerful programming language and also widely used language
in the world. PHP version 5 is currently in use version 6 is about to release.
Try PHP language.
END;

$search  = array('.', ',', ';');
$replace = array(' ', ' ', ' ');
$Text  = str_replace($search,$replace,$Text);

$Words = str_word_count($Text, 1);
$Words = array_unique($Words) ;
foreach($Words as $Word){

$Occurrences = substr_count($Text,$Word.' ');

echo "\nThe Word ".$Word." is ".$Occurrences." times in the text.".'</br>';
}
?>

The Word About is 1 times in the text.
The Word PHP is 3 times in the text.
The Word is is 3 times in the text.
The Word a is 1 times in the text.
The Word powerful is 1 times in the text.
The Word programming is 1 times in the text.
The Word language is 2 times in the text.
The Word and is 1 times in the text.
The Word also is 1 times in the text.
The Word widely is 1 times in the text.
The Word used is 1 times in the text.
The Word in is 2 times in the text.
The Word the is 1 times in the text.
The Word world is 1 times in the text.
The Word version is 2 times in the text.
The Word currently is 1 times in the text.
The Word use is 1 times in the text.
The Word about is 1 times in the text.
The Word to is 1 times in the text.
The Word release is 1 times in the text.
The Word Try is 1 times in the text

Open in new window

0
 
hieloCommented:
>> because kingroland said:
>>"(Only care about  words longer than 3 characters)"
The "a" and the "in" cases were not working because of the problem in your logic, not because they were less than 3 chars. As a matter of fact try your updated code with a slightly modified input:
$Text=<<<END
About PHP current power
The nation that produces the most contamination 
PHP is a powerful programming language and also widely used language
in the world. PHP version 5 is currently in use version 6 is about to release.
Try PHP language.
END;

$search  = array('.', ',', ';');
$replace = array(' ', ' ', ' ');
$Text  = str_replace($search,$replace,$Text);

$Words = str_word_count($Text, 1);
$Words = array_unique($Words) ;
foreach($Words as $Word){

$Occurrences = substr_count($Text,$Word.' ');

echo "\nThe Word ".$Word." is ".$Occurrences." times in the text.".'</br>';
}
exit;


Result:
...
The Word power is 0 times in the text.
...
The Word nation is 2 times in the text.
...


>>And by the way, I think that the code using my method is pretty much simple that yours. =)
In my opinion, this is not useful if it doesn't work.

>>As you can see the code will work perfectly:
No Comment
:)

Open in new window

0
 
VanHackmanCommented:

Obviously I didn't consider chr(10)chr(13) in a "text block on a webpage". ¬¬

"because of the problem in your logic"  There are no problem in my logic:

The Word power is 1 times in the text.
...
The Word nation is 1 times in the text...
...

As you can see... the code using my method remains simpler that yours. =)



<?php

$Text=<<<END
About PHP current power
The nation that produces the most contamination 
PHP is a powerful programming language and also widely used language
in the world. PHP version 5 is currently in use version 6 is about to release.
Try PHP language.
END;

$search  = array('.', ',', ';', chr(10), chr(13));
$replace = array(' ', ' ', ' ', ' ', ' ');
$Text  = ' '.str_replace($search,$replace,$Text);

$Words = str_word_count($Text, 1);
$Words = array_unique($Words) ;
foreach($Words as $Word){

	$Occurrences = substr_count($Text,' '.$Word.' ');
	echo "\nThe Word ".$Word." is ".$Occurrences." times in the text.".'</br>';
}

?>

Open in new window

0
 
hieloCommented:
=>chr(10), chr(13)
=>' '.$Word.' '
and little by little you keep hacking it more and more
:)
0
 
VanHackmanCommented:

'and little by little you keep hacking it more and more'

Mmmm... I don't thing that, I think that my last code solve the problem in all ways,
(Simpler, Faster and using only build-in functions... =P )

I had to modify it twice, because I didn't try to provide a script at the first time.
(MSG ID = ID 26141775 )

The first modification was over the code that you write, to show you what I had in mind.
The second modifications solve the problem with chr(10), chr(13) a tricky possibility that I didn't consider.

Any way, my code works perfectly know and your code works too, so kingroland must choose
which one is better.

;)


0
 
hieloCommented:
>>and using only build-in functions... =P
As opposed to what? The optional sorting function that I added? :)

and as for simplicity, care for a one-liner:
print_r( $word=array_count_values(str_word_count($str, 1)) );
Oh goodness, not hackery involved!
... =P
0
 
VanHackmanCommented:

print_r( $word=array_count_values(str_word_count($str, 1)) );

Great method. Look like if you have spent some time dusting the Php Manual... jaja. =P
Any way, Good Job!. It is sooooo much better that your initial script, simple and efficient.
0
 
hieloCommented:
>>Look like if you have spent some time dusting the Php Manual
Not the manual - this post ( ID: 26141963 ). I noticed it earlier but then got engaged in our conversation... jaja. =P

Take care.
0
 
kingrolandAuthor Commented:
To
All Mr. Expert

Thank you so much for your advices.
I have tested all the scripts and All of your codes are working and I am so impressed
by your talent and skills in programming. I give you full credit for your good work.
I wish you more success in career



0
 
kingrolandAuthor Commented:
I accept all the solutions from all 4 experts because all codes are proved to be working. I appreciate for their advice and honor them for their talent and skills. Thank you all.
0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

  • 6
  • 6
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now