Link to home
Start Free TrialLog in
Avatar of Nura111
Nura111

asked on

php function question about preg_match_all

HI Im using   preg_match_all
to find number of occurances in a text and saving in $results
from some reason the results will save a word as a key and than the number 0 as well

what am I missing here??

preg_match_all("/\b$word\b/", $text, $matches,PREG_PATTERN_ORDER);
$results[$word] = count($matches[0]);
Avatar of Terry Woods
Terry Woods
Flag of New Zealand image

$text = "Hello there there is a dog over there.";
$word = "there";
preg_match_all("/\b$word\b/", $text, $matches,PREG_PATTERN_ORDER);
$results[$word] = count($matches[0]);
print_r($matches);
print_r($results);

Output:

Array
(
    [0] => Array
        (
            [0] => there
            [1] => there
            [2] => there
        )

)
Array
(
    [there] => 3
)

Looks ok to me - are you sure you're handling the results correctly?
ps: Unless $word is already sanitised, you should escape special characters like this:
preg_match_all("/\b".preg_quote($word)."\b/", $text, $matches,PREG_PATTERN_ORDER);
Avatar of Nura111
Nura111

ASKER

I attached the code im using

what is that doing?
preg_match_all("/\b".preg_quote($word)."\b/", $text, $matches,PREG_PATTERN_ORDER);


$results = array();
	$words = str_word_count($text,1);
	// print_r($words);
	$words = array_unique($words); //no really need for that  but just incase for future changes
	foreach($words as $word){
		preg_match_all("/\b$word\b/", $text, $matches,PREG_PATTERN_ORDER);
		$results[$word] = count($matches[0]);
	}
	arsort($results);          //in Desc order by max occurrences

Open in new window

Avatar of Nura111

ASKER

for the next text im getiing the word "a" as 0 times in the text
when im printing the results:


foreach($results as $k=>$v) {
			$i += 1;
			$resultStr.= "The Word ".$k."-".$v." times in the \\text."."\n";
		echo $resultStr;

Open in new window

Avatar of Nura111

ASKER

the results  from:
print_r($matches);
print_r($results);


how can it shoe [a]=>0??
)
Array
(
    [win] => 4
    [dream] => 1
    [car] => 4
    [support] => 2
    [students] => 1
    [schools] => 5
    [need] => 2
    [chance] => 1
    [following] => 2
    [cars] => 1
    [brand] => 4
    [new] => 3
    [range] => 1
    [rover] => 1
    [sport] => 1
    [chevrolet] => 1
    [camaro] => 1
    [audi] => 1
    [a] => 0
    [http] => 2
    [www] => 2
    [softwarecharity] => 2
    [org] => 3
    [raffles] => 3
    [major] => 1
    [fundraiser] => 1
    [our] => 5
    [charity] => 1
    [software] => 2
    [program] => 1
    [proceeds] => 1
    [help] => 1
    [many] => 1
    [great] => 1
    [who] => 1
    [their] => 3
    [own] => 1
    [would] => 1
    [not] => 3
    [able] => 1
    [afford] => 2
    [type] => 1
    [technology] => 1
    [international] => 1
    [all] => 6
    [countries] => 1
    [invited] => 1
    [participate] => 1
    [responsibility] => 2
    [comply] => 1
    [laws] => 2
    [area] => 1
    [supporting] => 1
    [cause] => 1
    [helping] => 3
    [children] => 1
    [throughout] => 1
    [world] => 1
    [increasing] => 1
    [literacy] => 1
    [programs] => 1
    [purchasing] => 2
    [raffle] => 15
    [tickets] => 6
    [continue] => 1
    [goal] => 1
    [certified] => 1
    [law] => 1
    [offices] => 1
    [kelly] => 1
    [g] => 1
    [rogers] => 1
    [quorum] => 1
    [dr] => 1
    [ste] => 1
    [dallas] => 1
    [official] => 3
    [rules] => 5
    [regulations] => 6
    [purpose] => 1
    [benefit] => 1
    [cannot] => 1
    [high] => 1
    [expense] => 1
    [learning] => 1
    [set] => 2
    [forth] => 2
    [below] => 1
    [ticket] => 2
    [agree] => 3
    [bound] => 1
    [these] => 2
    [ais] => 11
    [integral] => 10
    [charitable] => 10
    [foundation] => 10
    [interpretation] => 1
    [application] => 1
    [shall] => 2
    [final] => 1
    [must] => 1
    [years] => 1
    [old] => 1
    [purchase] => 2
    [prize] => 8
    [employees] => 3
    [directors] => 2
    [any] => 18
    [subsidiaries] => 1
    [eligible] => 1
    [digital] => 1
    [emailed] => 1
    [purchaser] => 1
    [no] => 3
    [more] => 1
    [sold] => 2
    [than] => 2
    [number] => 1
    [listed] => 1
    [page] => 1
    [drawn] => 1
    [random] => 2
    [using] => 1
    [winners] => 2
    [assume] => 1
    [local] => 1
    [state] => 2
    [federal] => 1
    [taxes] => 1
    [fees] => 1
    [incidental] => 2
    [expenses] => 1
    [where] => 1
    [applicable] => 1
    [may] => 1
    [required] => 1
    [execute] => 1
    [affidavit] => 1
    [eligibility] => 1
    [publicity] => 1
    [release] => 1
    [permitting] => 1
    [use] => 4
    [name] => 1
    [photograph] => 1
    [likeness] => 1
    [voice] => 1
    [promotional] => 1
    [purposes] => 1
    [media] => 1
    [agents] => 2
    [representatives] => 1
    [responsible] => 1
    [injuries] => 3
    [losses] => 3
    [damages] => 5
    [kind] => 3
    [arising] => 3
    [connection] => 2
    [result] => 2
    [winner] => 3
    [acceptance] => 3
    [nonuse] => 1
    [entering] => 1
    [each] => 2
    [participant] => 2
    [officers] => 1
    [from] => 4
    [liability] => 1
    [caused] => 1
    [resulting] => 1
    [possession] => 1
    [misuse] => 1
    [agrees] => 1
    [indemnify] => 1
    [hold] => 1
    [harmless] => 1
    [rights] => 1
    [claims] => 1
    [actions] => 1
    [there] => 1
    [representations] => 2
    [warranties] => 2
    [other] => 2
    [disclaims] => 1
    [express] => 1
    [implied] => 1
    [regarding] => 1
    [sole] => 1
    [exclusive] => 1
    [remedy] => 1
    [breach] => 1
    [limited] => 1
    [return] => 1
    [price] => 1
    [paid] => 1
    [his] => 1
    [event] => 1
    [liable] => 1
    [party] => 1
    [loss] => 1
    [earnings] => 1
    [profits] => 1
    [goodwill] => 1
    [special] => 1
    [punitive] => 1
    [consequential] => 1
    [person] => 1
    [entity] => 1
    [whether] => 1
    [contract] => 1
    [tort] => 1
    [otherwise] => 1
    [even] => 1
    [advised] => 1
    [possibility] => 1
    [such] => 1
    [take] => 2
    [country] => 1
    [reside] => 1
    [reserves] => 1
    [right] => 1
    [postpone] => 1
    [until] => 1
    [delivery] => 1
    [closest] => 1
    [authorized] => 1
    [auto] => 1
    [dealer] => 1
    [won] => 1

Avatar of Nura111

ASKER

Its something in this fucntion that r causing that I dont know what Im tring to clean the text form common words and html tag
function extractContent($text){
	$html=strip_tags($text);
	$commonWords = array('is','that','them','and','he','the','-','of','to','for','were','was','--','in','at','as','a','an','on','by','or','it',
	'us','be','her','me','we','will','so','she','i','this','has','have','off','been','nbsp','s','\'s','you','my','don\'t','can','your','won\'t','are','if','what','with','but','its');
	
	$text = strtolower($html);//text
	$cleanText = preg_replace('/\b('.implode('|',$commonWords).')\b/','',$text);
	$text = preg_replace('/\b('.implode('|',$commonWords).')\b/','',$text);
	return  $text;//$html
	}

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Terry Woods
Terry Woods
Flag of New Zealand image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If you copy and paste the text you ran the code on, it should be fairly clear where the "a" came from I think.
Avatar of Nura111

ASKER

I dont understand how cone I get  a results for a that it appear 0 times when a. "a" got cleaned form the text and there is no way in the text and b. preg_match_all is not suppose to return "words" that r 0 times in the text
preg_match_all didn't return it. It was already in the $words array as a result of the str_word_count function, then you did this:

foreach($words as $word){
            preg_match_all("/\b$word\b/", $text, $matches,PREG_PATTERN_ORDER);
            $results[$word] = count($matches[0]);
      }

which searched for it, didn't find it, and put a count of 0 in the result. You could easily workaround the issue by changing the code to:

foreach($words as $word){
            preg_match_all("/\b$word\b/", $text, $matches,PREG_PATTERN_ORDER);
            if (count($matches[0])>0) $results[$word] = count($matches[0]);
      }
Avatar of Nura111

ASKER

Im really going crazy here  
when im trying to test it by using
preg_match_all("/\b".preg_quote($word)."\b/", $text, $matches,PREG_PATTERN_ORDER);
            $results[$word] = count($matches[0]);
            if (array_search('a',$matches[0]) !== false)
            {
               print_r($matches[0]);
              print_r($results);
            }

a is not even found (and other words are)
so I dint understand where its coming from
Win Your Dream car!


Support Students and Schools in need and have a chance to win a the Following cars:


1.	Brand new Range Rover Sport!
2.	Brand New Chevrolet Camaro!
3.	Brand New Audi A4


http://www.softwarecharity.org/

Car Raffles are a major fundraiser for our Charity Software program. The proceeds help us support many great schools who on their own would not be able to afford this type of technology.
Our Raffles are International and all countries are invited to participate. It is your responsibility to comply with laws in your area.


By supporting our cause you are helping children throughout the world and increasing literacy programs in schools. By purchasing our Car Raffle tickets, you are helping us to continue our goal of helping schools in need.


Raffles are certified by the  Law Offices of Kelly G. Rogers
5050 Quorum Dr., Ste. 320, Dallas, U.S.A.

http://www.softwarecharity.org/


Raffle Official Rules and Regulations:

1. The purpose of this raffle is to benefit Schools that cannot afford the high expense of learning software. The official rules and regulations of the raffle are set forth below. By purchasing a raffle ticket, you agree to be bound by these rules and regulations. AIS Integral’s Charitable Foundation interpretation and application of the rules and regulations shall be final.
2. You must be 18 years old or to purchase tickets or win a prize. Employees and Directors of AIS or any of its subsidiaries are not eligible to win a prize.
3. Raffle tickets digital and will be emailed to the purchaser. 4. No more raffle tickets will be sold than the number listed on the raffle prize page.
5. Raffle tickets will be drawn at random using Random.org
6. Winners assume all local, state, and federal taxes, fees, and incidental expenses where applicable.
7. Winners may be required to execute an affidavit of eligibility and a publicity release permitting AIS Integral Charitable Foundation to use their name, photograph, likeness, and voice for promotional purposes in any media.
8. AIS Integral Charitable Foundation and their agents, representatives and employees are not responsible for any injuries, losses, or damages of any kind arising in connection with or as a result of the winner’s acceptance, use, or non-use of any prize. By entering the raffle, each participant AIS Integral Charitable Foundation, its directors, officers, employees and agents from any and all liability for injuries, losses or damages of any kind caused by any prize or resulting from acceptance, possession, use or misuse of any prize, and each winner agrees to indemnify and hold AIS Integral Charitable Foundation harmless from any and all losses, damages, rights, claims and actions of any kind arising in connection with or as a result of the winner’s acceptance or use of any prize.
9. There are no representations and warranties other than as set forth in these official rules and regulations AIS Integral Charitable Foundation disclaims all other representations and warranties express or implied, regarding the raffle. A raffle participant’s sole and exclusive remedy for any breach AIS Integral Charitable Foundation shall be limited to the return of the purchase price paid for his or her raffle ticket(s). In no event AIS Integral Charitable Foundation be liable to any party for any loss or injuries to earnings, profits or goodwill, or for any incidental, special, punitive or consequential damages of any person or entity whether arising in contract, tort or otherwise, even if AIS Integral Charitable Foundation has been advised of the possibility of such damages.
10. You agree and take responsibility for following the laws and regulations in the country and state you reside.
11. AIS Integral Charitable Foundation reserves the right to postpone any raffle until all tickets are sold for that raffle.
12. You agree to take delivery of the prize from the closest authorized auto dealer for the brand of car won.

Open in new window

Chances are, you've got something like "a2" in the text.

By the way, if you want to ignore the case of the words, you'll need to use the "i" pattern modifier:

            preg_match_all("/\b$word\b/i", $text, $matches,PREG_PATTERN_ORDER);
It's the "A4"
Avatar of Nura111

ASKER

so do you think its ok that im using the str_word_count iff all add the condition?
> By the way, if you want to ignore the case of the words, you'll need to use the "i" pattern modifier

Actually, you are probably handling that some other way, as your results seem correct.
> so do you think its ok that im using the str_word_count iff all add the condition?

It depends on how much it matters to be 100% correct, and what you define as "correct". eg Do you want "A4" to be treated as a word?
Avatar of Nura111

ASKER

is  str_word_count doesnt count only a whole word? I dont see in the text where this a could have come from
preg_* treats A4 as a single word (word characters are alphanumeric or _ )
str_word_count treats ' characters as word characters, but not numbers, unless you specify them to be included like this:

$words = str_word_count($text,1,'0123456789');

I don't know how str_word_count treats underscore characters - it would be easy to test though.
Avatar of Nura111

ASKER

It depends on how much it matters to be 100% correct, and what you define as "correct". eg Do you want "A4" to be treated as a word?

I wouldnt mind it treat A4 as a word the problem is that its treat it as if it the word "a" and not a4
Try this then:
$words = str_word_count($text,1,'0123456789');