troubleshooting Question

How can I extract, count, and compare sentences from a text document

Avatar of befidled
befidled asked on
PHP
15 Comments2 Solutions967 ViewsLast Modified:
I'm working on a script that I need to load a text document, extract each sentence into an array and then tabulate how many times each sentence occurred. So for instance if the following was my text:

This is test sentence number one. This is test sentence number two. Is this a test sentence? This is the final test sentence before repeating a few. This is test sentence number one. This is test sentence number two. This is test sentence number one.

I'd have the following results:

SENTENCE (# OF OCCURANCES)
This is test sentence number one.  (3)
This is test sentence number two. (2)
Is this a test sentence? (1)
This is the final test sentence before repeating a few. (1)

Most sentences end with a period (.) however some sentences can end with a ? or even a " which complicates the issue. Also there are times when you could have a period in the body of a sentence such as the following sentence:

I like asking questions at Experts-Exchange.com.

I'd like a solution that can handle each of those instances well if possible. I've included some sample code as well.

<?php
 
//COMMENTING OUT FOR THE SAKE OF TESTING AND EXPLICITLY DECLARING STRING BELOW
//$filename = "test.txt";
//$handle = fopen($filename, "r");
//$file = fread($handle, filesize($filename));
 
//$contents = explode("\n", $file);
 
$string = 'This is test sentence number one. This is test sentence number two. Is this a test sentence? This is the final test sentence before repeating a few. This is test sentence number one. This is test sentence number two. This is test sentence number one. ';
 
$string = preg_replace("/<([^>]+)>/i", '', $string); //Delete all tags in the string
 
$results = array(); //Initialize an array to hold our results
 
$results = explode('. ', $string);
$resultcount = count($results);
 
//THIS WAS CODE I FOUND THAT WOULD DO WHAT I WANT TO DO ON A WORD BY WORD LEVEL BUT I COULDN'T GET IT TO WORK ON SENTENCES
//while(preg_match('/[A-Z].*?[.!?]((?=\s[A-Z])|$)/i', $string, $matches)) //While we can match any words with a-z or a ' in them do the following ($matches holds the item we have matched)
//{    $string = preg_replace("/\b{$matches[0]}\b/i", '', $string, -1, $count); //Remove any instances of the word we just matched, so we don;t keep continually matching the same word. $count holds how many replacements of the word occur.
//    $results[$matches[0]] = $count; //Update the results array so the key is the word matched and the value is the amount of times that word occured
//}
 
echo $results[4]."<br>"; //Print the results
echo $resultcount."<br><br>";
 
print_r($results);
 
?>
Join the community to see this answer!
Join our exclusive community to see this answer & millions of others.
Unlock 2 Answers and 15 Comments.
Join the Community
Learn from the best

Network and collaborate with thousands of CTOs, CISOs, and IT Pros rooting for you and your success.

Andrew Hancock - VMware vExpert
See if this solution works for you by signing up for a 7 day free trial.
Unlock 2 Answers and 15 Comments.
Try for 7 days

”The time we save is the biggest benefit of E-E to our team. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange.

-Mike Kapnisakis, Warner Bros