Method to find most used words PHP

I'm trying to figure out the most efficient method to:

1. parse through the "description" field
2. break out each word
3. count how many times each word is used
4. return list of ordered by highest to lowest based on number of times counted where the word is longer than 3 characters.

Datset attached mysql database.
LVL 12
Nathan RileyFounderAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
"Efficient" does not really matter with small data sets.  If you have several thousand lines of "description" to process every second you might consider tuning this script.

As with all things related to text processing, the devil is in the details.  I tried to make some sense of the "sanitize" process here, but it may not be exactly what you want.  Questions to ask yourself include (1) does case sensitivity count, (2) what character set am I getting, (3) do I need a list of stop-words, (4) do I care about words that only appear once.  As you develop this application you will find other questions that must be answered along the way.

I could not use the Excel spreadsheet directly, so I copied its contents into an array.  This is as close as I could get to simulating the data base query results set.

Please see:

<?php // demo/temp_nathan_riley.php

echo '<pre>';

$dat = array
( 'Don\'t judge a Pepper by its cover. ?#?SecretStash?'
, 'Happy Fourth.'
, 'Wake up and smell the Pepper.'
, 'Go for a 23 flavor taste bud ride.'
, 'Put your ear close to experience a symphony of Pepper.'
, 'Haven\'t had a Pepper all day? Deploy the 20oz.'
, 'Think you have lip sync skills to rival Lycia Faith? Post your version of "Who Let the Dogs Out" using ?#?OneofaKindLipSync? and ?#?contest? to Twitter, Instagram, or YouTube for a chance of being featured on TV when Season One of Lip Sync Battle returns.'
, 'Relax, you\'ve got Pepper by your side.'
, 'Because any float without Pepper would be too expected.'
, 'The classics never go out of style.'

$all = [];

foreach ($dat as $str)
    $str = str_replace("'", '', $str);
    $str = str_replace(". ", ' ', $str);
    $str = trim($str);
    $str = preg_replace('/\.$/',         ' ', $str);
    $str = preg_replace('/\.?<=\s/',       ' ', $str);
    $str = preg_replace('/[^A-Z0-9.]/i', ' ', $str);
    $str = preg_replace('/\s\s+/',       ' ', $str);
    $str = trim($str);

    $arr = explode(' ', $str);
    $all = array_merge($all, $arr);

foreach ($all as $key => $str)
    if (strlen($str) <= 3) unset($all[$key]);

$cnt = array_count_values($all);

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Nathan RileyFounderAuthor Commented:
Thanks a bunch Ray, this really helps.  Also makes sense on the other areas I may want to try and limit/cleanup etc...for now this is a great base to start with.

So the end result looks like an array and it's already sorted so does that mean I'd just run a final foreach loop to dump the name of the word?
Nathan RileyFounderAuthor Commented:
Hmm..still looking through:

foreach($cnt as $tag){
                                echo $tag;

Open in new window

This gives me the numerical count number, but not sure on how to get the tag name.
IT Pros Agree: AI and Machine Learning Key

We’d all like to think our company’s data is well protected, but when you ask IT professionals they admit the data probably is not as safe as it could be.

Ray PaseurCommented:
Yes, foreach() makes sense.  The word itself is the key and the value is the count of occurrences.  Try something like this instead of print_r() at the end of the script.
foreach ($cnt as $word => $count)
    echo PHP_EOL . "$word OCCURS $count TIME";
    if ($count != 1) echo 'S';

Open in new window

Nathan RileyFounderAuthor Commented:
Ah, here we go.

foreach($cnt as $key => $value){
                                echo $key;

Open in new window

Nathan RileyFounderAuthor Commented:
Great thanks, one last tweak.  Is there any way to limit the loop to only show the top 20?
Ray PaseurCommented:
Sure, there are two ways.  You can truncate the array or just add a counter to the loop and stop when you get to 20.  It just depends on whether you want to use the elements past #20.  If they're expendable, you can just slice off the top of the array and keep that.

This function will do it.  Be sure to set preserve_keys in the argument list.
Nathan RileyFounderAuthor Commented:
Got it thanks and have a good weekend.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.