Question

Count Keywords in html

Asked by: chrisj1963

Hi - I am struggling to get this to work with my current code.

Please see http://70.87.107.194/~g3crmco/delay.php?keyword=website&url=http%3A%2F%2Fwww.homestead.com&Submit=Submit

I am trying to add code in line 105, 107 and 109 AND in 248, 251, 254, 257, 260, 263
that would display the frequency of times the keyword Phrase entered as the variable "keyword" appears between the <H1>, <b>, and <body> tags.

I was provided with the following 2 solutions earlier but I could not make either work (I just don't know how to add their code to my code to make it go...)

Help would be very, very much appreciated....

Here is my code:
 
<?php
error_reporting(E_ALL ^ E_NOTICE);
//prevents script from timing out
set_time_limit(0);
 
 
//--> for google pagerank
function StrToNum($Str, $Check, $Magic)
{
	$Int32Unit = 4294967296;  // 2^32
 
	$length = strlen($Str);
	for ($i = 0; $i < $length; $i++)
	{
		$Check *= $Magic;
		//If the float is beyond the boundaries of integer (usually +/- 2.15e+9 = 2^31),
		//  the result of converting to integer is undefined
		if ($Check >= $Int32Unit)
		{
			$Check = ($Check - $Int32Unit * (int) ($Check / $Int32Unit));
			//if the check less than -2^31
			$Check = ($Check < -2147483648) ? ($Check + $Int32Unit) : $Check;
		}
		$Check += ord($Str{$i});
	}
	return $Check;
}
 
//--> for google pagerank
/*
* Genearate a hash for a url
*/
function HashURL($String)
{
    $Check1 = StrToNum($String, 0x1505, 0x21);
    $Check2 = StrToNum($String, 0, 0x1003F);
 
    $Check1 >>= 2;
    $Check1 = (($Check1 >> 4) & 0x3FFFFC0 ) | ($Check1 & 0x3F);
    $Check1 = (($Check1 >> 4) & 0x3FFC00 ) | ($Check1 & 0x3FF);
    $Check1 = (($Check1 >> 4) & 0x3C000 ) | ($Check1 & 0x3FFF);
 
    $T1 = (((($Check1 & 0x3C0) << 4) | ($Check1 & 0x3C)) <<2 ) | ($Check2 & 0xF0F );
    $T2 = (((($Check1 & 0xFFFFC000) << 4) | ($Check1 & 0x3C00)) << 0xA) | ($Check2 & 0xF0F0000 );
 
    return ($T1 | $T2);
}
 
//--> for google pagerank
/*
* genearate a checksum for the hash string
*/
function CheckHash($Hashnum)
{
    $CheckByte = 0;
    $Flag = 0;
 
    $HashStr = sprintf('%u', $Hashnum) ;
    $length = strlen($HashStr);
 
    for ($i = $length - 1;  $i >= 0;  $i --)
    {
        $Re = $HashStr{$i};
        if (1 === ($Flag % 2))
        {
            $Re += $Re;
            $Re = (int)($Re / 10) + ($Re % 10);
        }
        $CheckByte += $Re;
        $Flag ++;
    }
 
    $CheckByte %= 10;
    if (0 !== $CheckByte)
    {
        $CheckByte = 10 - $CheckByte;
        if (1 === ($Flag % 2) )
        {
            if (1 === ($CheckByte % 2))
            {
                $CheckByte += 9;
            }
            $CheckByte >>= 1;
        }
    }
 
    return '7'.$CheckByte.$HashStr;
}
 
//get google pagerank
function getpagerank($url)
{
    $query="http://toolbarqueries.google.com/search?client=navclient-auto&ch=".CheckHash(HashURL($url)). "&features=Rank&q=info:".$url."&num=100&filter=0";
    $data=file_get_contents_curl($query);
    //print_r($data);
    $pos = strpos($data, "Rank_");
    if($pos === false){} else
    {
        $pagerank = substr($data, $pos + 9);
        return $pagerank;
    }
}
 
//code for h1
 
//code for b
 
//code for body
 
 
//for POST request with curl
function do_post_request_curl($url, $data)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,$url); // set url to post to
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
    curl_setopt($ch, CURLOPT_POST, 1); // set POST method
    curl_setopt($ch, CURLOPT_POSTFIELDS, $data); // add POST fields
    $result = curl_exec($ch); // run the whole process
    //echo $result;
    curl_close($ch);
    return $result;
}
 
function file_get_contents_curl($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //Set curl to return the data instead of printing it to the browser.
    curl_setopt($ch, CURLOPT_URL, $url);
    $data = curl_exec($ch);
    curl_close($ch);
 
    return $data;
}
function getUrl ($sel_keyword) {
	$arrLinks=array();
	$cnt=0;
 
	$url = 'http://www.google.com/search?hl=en&q='.$sel_keyword.'&btnG=Google+Search&aq=f&oq=';
	$str = "'".file_get_contents($url)."'";
 
    $h1count = preg_match_all('/\<li class=g\>\<h3 class=r\>\<a href="(.*?)" class=l\>/',$str,$patterns);
 
    $href_add = $patterns[1];
 
 
    if(!empty($href_add[0]))
    {
        foreach($href_add as $key => $val)
        {
        $val = preg_replace("/</","&lt;a",$val);
            //echo "<li>" . htmlentities($val) . "</li>";
            $arrLinks[$cnt]=htmlentities($val);
            ++$cnt;
			//number of urls code CJ
            if($cnt==3)
            {
				break;
            }
 
        }
        echo "</ul>";
    }
	return $arrLinks;
}
?>
 
 
 
 
 
<!--<?php include('includes/header.php');?>-->
 
<body>
<form id="form1" name="form1" method="get" action="">
  KEYWORD:
	<input name="keyword" type="text" id="keyword" value="<?php echo $_REQUEST['keyword']; ?>" />
  URL :
  	<input name="url" type="text" size="30" id="url" value="<?php echo $_REQUEST['url']; ?>" />
  	<input type="submit" name="Submit" value="Submit" />
  	<br>
  <br>
</form>
<?php
	if ((isset($_REQUEST['Submit'])) && ($_REQUEST['url']!='')) {
?>
<fieldset>
<legend>Query result</legend>
<?php
	$urlarray = array();
 
	$urlarray[0]=$_REQUEST['url'];
 
 
	$arrLinks=getUrl($_REQUEST['keyword']);
 
	for($i=0,$j=1;$i<count($arrLinks);$i++,$j++){
		$urlarray[$j]=$arrLinks[$i];
	}
 
 
?>
 
<table border=1 width="80%">
	<tr>
		<th>Link</th>
		<th>Google pagerank</th>
        <th># of times keyword is <br>
	    in &lt;Title&gt;&lt;/Title&gt; tags</th>
		<th><p>Keyword in Title as %<br>
		  of all words in Title</p>
	    </th>
		<th># of times keyword is <br>
	    in Description tags</th>
		<th>Keyword in b as % <br>
	    of all words in Description</th>
		<th># of times keyword is <br>
	    in &lt;h1&gt;&lt;/h1&gt; tags</th>
		<th><p>Keyword in H1 as %<br>
		  of all words in Header 1s</p>
	    </th>
		<th># of times keyword is <br>
	    in &lt;b&gt;&lt;/b&gt; tags</th>
		<th>Keyword in b as % <br>
	    of all words in bold</th>
         <th># of times keyword is <br>
	    in &lt;body&gt;&lt;/body&gt; tags</th>
		 <th>Keyword in &lt;body&gt; as % <br>
	    of all words in body.</th>
 
  </tr>
<?php for($i=0;$i<count($urlarray);$i++){
// wait for 5 seconds = 5,000,000  HUMAN EMULATION
usleep(5000000); 
?>
	<tr>
		<td>
			<?php
			if($i==0){
			?>
				<a href="<?php echo $urlarray[$i];?>" style="color:blue"><?php echo $urlarray[$i];?></a>
			<?php
			}
			else{
			?>
				<a href="<?php echo $urlarray[$i];?>"><?php echo $urlarray[$i];?></a>
			<?php
			}
			?>
 
		</td>
		<td>
			<?php echo getpagerank($urlarray[$i]);?>
		</td>
	
        <td>
			Title # result
        </td>
        <td>
			Title % result
        </td>
        <td>
		    Description # result	
        </td>
        <td>
			Description % result
        </td>
                <td>
		    h1 # result	
        </td>
        <td>
			h1 % result
        </td>
                <td>
		    b # result	
        </td>
                <td>
			b % result
        </td>
                <td>
		    body # result	
        </td>
                <td>
			body % result
        </td>
	</tr>
<?php } ?>
</table>
 
</fieldset>
<?php
}
else if(isset($_REQUEST['Submit']))
{
	echo 'Please Enter a URL.';
}
?>
</body>
<!--<?php include('includes/footer.php'); ?>-->
</html>
 
 
HERE IS ONE SOLUTION FROM EARLIER: (I could not get this to work at all)
$needle1 = "<h1>";
$needle2 = "</h1>";
		$counter = substr_count($temp, $needle1);
		//echo $counter;
		$i=1;
		while($i<=$counter){
			$pos1 = stripos($temp, $needle1, $pos4+1);
			$pos2 = $pos1+4;
			$pos3 = stripos($temp, $needle2, $pos2);
			$stuffinh1[] = substr($temp, $pos2, ($pos3-$pos2));
			$pos4 = $pos1;
			$i++;
		}
 
HERE IS THE OTHER SOLUTION FROM EARLIER: (I got the code to run and print, but 1) there is an error and 2) I could not figure out how to integrate with my code.
 
<?php
$page="
<b>word2</b>	
<b>word1 word3 word3 word2 word2</b>
<b>word2 word2 word5 word2 word3</b>
<b>word1 word2 word3 word4 word5</b>
<b>word1 word3 word3 word2 word2</b>
<b>word2 word2 word5 word2 word3</b>
<b>word1 word2 word3 word4 word5</b>
<h1>word1 word3 word3 word2 word2</h1>
<h1>word2 word2 word5 word2 word3</h1>
<h1>word1 word2 word3 word4 word5</h1>
<h2>word1 word 2 word3 word4</h2>
";
$keyword = "word2";
 
$counth1 = "0";
$countb = "0";
$keywords_counth1 = "0";
$keywords_countb = "0";
 
preg_match_all("%<h1>[\w\s]*(?<!\w)(?=\w)($keyword)(?<=\w)(?!\w)[\w\s]*</h1>%m", $page, $allh1tags);
preg_match_all("%<b>[\w\s]*(?<!\w)(?=\w)($keyword)(?<=\w)(?!\w)[\w\s]*</b>%m", $page, $allbtags);
 
$count = (count($allh1tags['0']) > (count($allh1tags['0']))) ? count($allbtags['0']) : count($allbtags['0']);
 
for($i=0;$i<$count;$i++)
{
@$counth1+=count(explode(" ", $allh1tags['0'][$i]));
@$keywords_counth1+=count(explode($keyword, $allh1tags['0'][$i]))-1;
@$countb+=count(explode(" ", $allbtags['0'][$i]));
@$keywords_countb+=count(explode($keyword, $allbtags['0'][$i]))-1;
}
 
print $counth1; // total of all matching keywords betwen ALL tags <h1> and <b>  SUPPOSED TO BE JUST MATCHING H1.... 19 -- 				MATCHING <H1> & <B>
?>
<br>
<?php	 
print $keywords_counth1; // total number of matching keyword matches betwen <h1> tags 6 -- 			MATCHING <H1>
?>
 <br>
 <?php
print $countb; // total of all words betwen <b> tags 31 -- 											ALL <B>
?>
<br>
<?php
 
print $keywords_countb; // total number of keyword matches betwen b tags (YES) 13 -- 				MATCHING <B>
 
//based on above values you just do the math with them.
 
 
?>
                                  
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44:
45:
46:
47:
48:
49:
50:
51:
52:
53:
54:
55:
56:
57:
58:
59:
60:
61:
62:
63:
64:
65:
66:
67:
68:
69:
70:
71:
72:
73:
74:
75:
76:
77:
78:
79:
80:
81:
82:
83:
84:
85:
86:
87:
88:
89:
90:
91:
92:
93:
94:
95:
96:
97:
98:
99:
100:
101:
102:
103:
104:
105:
106:
107:
108:
109:
110:
111:
112:
113:
114:
115:
116:
117:
118:
119:
120:
121:
122:
123:
124:
125:
126:
127:
128:
129:
130:
131:
132:
133:
134:
135:
136:
137:
138:
139:
140:
141:
142:
143:
144:
145:
146:
147:
148:
149:
150:
151:
152:
153:
154:
155:
156:
157:
158:
159:
160:
161:
162:
163:
164:
165:
166:
167:
168:
169:
170:
171:
172:
173:
174:
175:
176:
177:
178:
179:
180:
181:
182:
183:
184:
185:
186:
187:
188:
189:
190:
191:
192:
193:
194:
195:
196:
197:
198:
199:
200:
201:
202:
203:
204:
205:
206:
207:
208:
209:
210:
211:
212:
213:
214:
215:
216:
217:
218:
219:
220:
221:
222:
223:
224:
225:
226:
227:
228:
229:
230:
231:
232:
233:
234:
235:
236:
237:
238:
239:
240:
241:
242:
243:
244:
245:
246:
247:
248:
249:
250:
251:
252:
253:
254:
255:
256:
257:
258:
259:
260:
261:
262:
263:
264:
265:
266:
267:
268:
269:
270:
271:
272:
273:
274:
275:
276:
277:
278:
279:
280:
281:
282:
283:
284:
285:
286:
287:
288:
289:
290:
291:
292:
293:
294:
295:
296:
297:
298:
299:
300:
301:
302:
303:
304:
305:
306:
307:
308:
309:
310:
311:
312:
313:
314:
315:
316:
317:
318:
319:
320:
321:
322:
323:
324:
325:
326:
327:
328:
329:
330:
331:
332:
333:
334:
335:
336:
337:
338:
339:
340:
341:
342:
343:
344:
345:
346:
347:
348:
349:
350:
351:
352:
353:
354:
355:
356:
357:
358:
359:
360:
361:
362:
363:
364:
365:
366:
367:
368:
369:
370:
371:
372:
373:
374:

Select allOpen in new window

This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.

Subscribe now for full access to Experts Exchange and get

Instant Access to this Solution

  • Plus...
  • 30 Day FREE access, no risk, no obligation
  • Collaborate with the world's top tech experts
  • Unlimited access to our exclusive solution database
  • Never be left without tech help again

Subscribe Now

Asked On
2009-01-12 at 21:33:40ID24046531
Topic

PHP Scripting Language

Participating Experts
3
Points
500
Comments
11

Trusted by hundreds of thousands everyday for fast, accurate and reliable tech support.

  • "The time we save is the biggest benefit of Experts Exchange to Warner Bros. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange." Mike Kapnisakis, Warner Bros.
  • "Our team likes having a resource that is more secure than just using Google and most experts using this service really know their stuff. It's nice to look here first versus using Google." Dayna Sellner, Lockheed Martin
  • "Anytime that I've been stumped with a problem, 9 out of 10 times Experts Exchange has either the accepted solution or an open discussion of the potential solution to the problem." Kenny Red, eBay Inc.

See what Experts Exchange can do for you.

Got a question?

We've got the answer.

Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.

Screenshot of Experts Exchange Knowledgebase

Need individual assistance?

Our experts are ready to help.

If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.

Screenshot of Experts Exchange Knowledgebase

Want to learn from the best?

Read articles from industry experts.

Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.

Screenshot of an Article

Working on a long term project?

Store your work and research.

Save solutions to your questions, answers you’ve discovered through searching plus helpful articles in your personal knowledgebase for easy future access.

Screenshot of Experts Exchange Knowledgebase

Access the answers to your technology questions today.

Subscribe Now

30-day free trial. Register in 60 seconds.

What Makes Experts Exchange Unique?

Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Trusted by the world's most respected brands.

image of each brand's logo

Faithfully serving IT professionals since 1996.

Experts Exchange Logo

Try it out and discover for yourself.

Subscribe Now

30-day free trial. Register in 60 seconds.

Related Solutions

  1. keyword extraction php script
    hi there I am in need of a script that can extract from overture from a form seek all possiblilites based on that keyword then remove an phrase duplciates and save to a named text file based on the keyword typed in the form. http://inventory.overture.com/d/searchinventory/s...
  2. Javascript Calcualtion for keyword Phrase density
    Hi - Please see that attached script that does a good job of calculating keyword count and density but does not calculated keyword phrase density. eg. clown clown clown old clown keyword is "old clown" should calculate 5 words and a 25% keyword phrase density. ...

Free Tech Articles

  1. WARNING: 5 Reasons why you should NEVER fix a computer for free.
    It is in our nature to love the puzzle. We are obsessed. The lot of us. We love puzzles. We love the challenge. We thrive on finding the answer. We hate disarray. It bothers us deep in our soul. W...
  2. SCCM OSD Basic troubleshooting
    SCCM 2007 OSD is a fantastic way to deploy operating systems, however, like most things SCCM issues can sometimes be difficult to resolve due to the sheer volume of logs to sift through and the dispe...
  3. Migrate Small Business Server 2003 to Exchange 2010 and Windows 2008 R2
    This guide is intended to provide step by step instructions on how to migrate from Small Business Server 2003 to Windows 2008 R2 with Exchange 2010. For this migration to work you will need the fo...
  4. Create a Win7 Gadget
    This article shows you how to create a simple "Gadget" -- a sort of mini-application supported by Windows 7 and Vista. Gadgets can be dropped anywhere on the desktop to provide instant information, ...
  5. Outlook continually prompting for username and password
    There have been a lot of questions recently regarding Outlook prompting for a username and password whilst using Exchange 2007. There are a few reasons why this would happen and I will try to cover t...
  6. Backup Exchange 2010 Information Store using Windows Backup
    There seems to be quite a lot of confusion around the ability to backup Exchange 2010 using the built in Windows Backup feature. This stems from the omission of this feature prior to Exchange 2007 s...

Cloud Class Webinars

  1. Avoiding Bugs in Microsoft Access
    Alison Balter takes and in-depth look at avoiding bugs in Access. In this webinar you will learn about using the immediate window to debug your applications, invoking the debugger, using breakpoints to troubleshoot, stepping through code, setting the next statement to execute, ...
  2. Top 10 Best New Features in Visio 2010
    Scott Helmers gives live demonstrations of the top 10 new features in Visio 2010. This webinar will teach you how to create compelling diagrams by adding shapes to the page with a single click, linking the shapes in a diagram to data in Excel (or SQL Server, or SharePoint), ...
  3. IT Consultant Business Secrets Revealed
    Michael Munger, Experts Exchange tech pro and IT consultant, pulls back the curtain on his very successful businesses and answers question on every IT consultant and business owner should know about. He shares secrets on what he did to solve the 5 most common problems in IT, ...
  4. Disaster Recovery and Business Continuity
    Quest CTO, Mike Billon, gives an overview of the steps involved in building a dunamic disaster recovery plan. Through case studies and an examination of software/hardware tooles for monitoring and testing, you'll gain a better understandin of where you are, where you want ...
  5. Organize Your Visio Diagrams with Containers and Lists
    Scott Helmers uses cross functional flowcharts, wireframe diagrams, data graphic legends and seating charts to teach you: how to ustilize all three new structured diagram components in Visio 2010, the best practices for organizeing shapes in previous version of Visio, how to organize ...
  6. How to Us Objects, Properties, Events and Methods in Microsoft Access
    Alison Dalter gives an in-depbth look at objects, properties, events and methods in Microsoft Access. In this webinar you will learn about using the object browser, referring to objects, working with properties and methods, working with object variables, understanding the ...

Join the Community

Give a Little. Get a Lot.

Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.

Join the Community

Answers

 

by: RoonaanPosted on 2009-01-12 at 22:53:03ID: 23360489

Hi,

Try and change:

<td>
                  <?php echo getpagerank($urlarray[$i]);?>
            </td>
      
        <td>
                  h1 # result
        </td>
        <td>
                  h1 % result
        </td>
        <td>
                b # result      
        </td>
        <td>
                  b % result
        </td>
                <td>
                body # result      
        </td>
        <td>
                  body % result
        </td>
      </tr>

Into the following:

      <td>
                  <?php echo getpagerank($urlarray[$i]);?>
      </td>
      <?php
            # 1 Get the html for the page
        $page_contents = file_get_contents($urlarray[$i]);
        
        # 2 Distill the <body> section from the page content
        $body_content  = preg_replace('/^(.*)(<body(.*)<\/body>)(.*)/ims', '\2', $page_contents);
        
        # 3 Store the keyword in a local variable
        $keyword = $_REQUEST['keyword'];
        
        # 4 Test the h1's for keywords
        $all_h1_words_string = '';
        $all_h1_words_array = array();

        # 4.1 Find all h1 tags, store all separate words in an array as well as a string
        if(preg_match_all('#<h1>(.*?)</h1>#ims', $page_contents, $m)) {
              $all_h1_words_string = strip_tags(implode('.', $m[1]));
              $all_h1_words_array  = preg_split('/[^\w\-\']+/', $all_h1_words_string);
        }
        # 4.2 Count the number of times the keyword is in the string with all h1 contents
        $num_keyword_in_h1   = count(preg_match_all('/(^|\W)'.preg_quote($keyword).'(\W|$)/i', $all_h1_words_string, $m));
        # 4.3 Compare the counted number with the amount of words, taking into account that a division by zero error should be prevented
        if($num_keyword_in_h1   == 0 || count($all_h1_words_array) == 0) {
              $prc_keyword_in_h1 = 0;
        } else {
                  $prc_keyword_in_h1 = (100 / count($all_h1_words_array) * $num_keyword_in_h1);
            }
        
            
            # 5. Test the <b>'s for keywords. (Similar to h1)
        $all_bold_words_string = '';
        $all_bold_words_array = array();
        # 5.1
        if(preg_match_all('#<b>(.*?)</b>#ims', $page_contents, $m)) {
              $all_bold_words_string = implode('.', $m[1]);
              $all_bold_words_array  = preg_split('/[^\w\-\']+/', $all_bold_words_string);
        }
            # 5.2
        $num_keyword_in_bold = count(preg_match_all('/(^|\W)'.preg_quote($keyword).'(\W|$)/i', $all_bold_words_string, $m));
        # 5.3 (same as 4.3 but in ternairy expression: $value = $somebooleanexpression ? $value_if_expression_is_true : $value_if_expression_is_false
        $prc_keyword_in_bold = $num_keyword_in_bold == 0 || count($all_bold_words_array) == 0 ? 0 : (100 / count($all_bold_words_array) * $num_keyword_in_bold);
        
        # 6. Test the body. We don't use a array for storing all words, as that might get memory intensive
        $num_keyword_in_body = count(preg_match_all('/(^|\W)'.preg_quote($keyword).'(\W|$)/i', strip_tags($body_content), $m));
        $num_words_in_body   = count(preg_split('/[^\w\-\']+/', strip_tags($body_content)));
            $prc_keyword_in_body = $num_keyword_in_body == 0 || $num_words_in_body == 0 ? 0 : (100 / $num_words_in_body * $num_keyword_in_body);
            
            # 7. Output: For the count, I used number_of_matches / number_of_words instead of only number_of_matches. This allows for better debugging for the time being
      ?>

       <td>
                  <?php printf('%d / %d', $num_keyword_in_h1, count($all_h1_words_array));?>
        </td>
        <td>
                  <?php printf('%0.2f', $prc_keyword_in_h1);?>
        </td>
        <td>
                  <?php printf('%d / %d', $num_keyword_in_bold, count($all_bold_words_array));?>
        </td>
        <td>
                  <?php printf('%0.2f', $prc_keyword_in_bold);?>
        </td>
                <td>
                  <?php printf('%d / %d', $num_keyword_in_body, $num_words_in_body);?>
        </td>
        <td>
                  <?php printf('%0.2f', $prc_keyword_in_body);?>
        </td>
      </tr>
Note that the code does not take into account any script tags, flash content or otherwise possibly relevant or most likely irrelevant html.

 

by: nizsmoPosted on 2009-01-12 at 23:10:08ID: 23360541

Hi there, hope this solution is of help to you:
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_22815066.html

As they discuss how to count the number of occurrences in a string.

Goodluck!

 

by: al3cs12Posted on 2009-01-12 at 23:13:42ID: 23360552

Give this a try also replace in your script from where it starts here


<?php for($i=0;$i<count($urlarray);$i++){
// wait for 5 seconds = 5,000,000  HUMAN EMULATION
usleep(5000000);
$url_content = file_get_contents_curl($urlarray[$i]);
$words_h1 = "0";
$words_b = "0";
$words_body = "0";
$keywords_counth1 = "0";
$keywords_countb = "0";
$keywords_countbody = "0";

preg_match_all("%<h1>[\w\s]*(?<!\w)(?=\w)($keyword)(?<=\w)(?!\w)[\w\s]*</h1>%m", $url_content, $allh1match);
preg_match_all("%<b>[\w\s]*(?<!\w)(?=\w)($keyword)(?<=\w)(?!\w)[\w\s]*</b>%m", $url_content, $allbmatch);
preg_match_all("%<body>[\w\s]*(?<!\w)(?=\w)($keyword)(?<=\w)(?!\w)[\w\s]*</body>%m", $url_content, $allbodymatch);

$count = (count($allh1match['0']) > (count($allbmatch['0']))) ? count($allh1match['0']) : count($allbmatch['0']);
$count = ($count > (count($allbodymatch['0']))) ? $count : count($allbodymatch['0']);

for($i=0;$i<$count;$i++)
{
@$words_h1+=count(explode(" ", $allh1match['0'][$i]));
@$keywords_counth1+=count(explode($keyword, $allh1match['0'][$i]))-1;
@$words_b+=count(explode(" ", $allbmatch['0'][$i]));
@$keywords_countb+=count(explode($keyword, $allbmatch['0'][$i]))-1;
@$words_body+=count(explode(" ", $allbodymatch['0'][$i]));
@$keywords_countbody+=count(explode($keyword, $allbodymatch['0'][$i]))-1;
}

?>
<tr>
 <td>
  <?php
  if($i==0){
  ?>
   <a href="<?php echo $urlarray[$i];?>" style="color:blue"><?php echo $urlarray[$i];?></a>
  <?php
  }
  else{
  ?>
   <a href="<?php echo $urlarray[$i];?>"><?php echo $urlarray[$i];?></a>
  <?php
  }
  ?>

  </td>
 <td>
  <?php echo getpagerank($urlarray[$i]);?>
 </td>

        <td>
  <?php echo $keywords_counth1; // TOTAL NUMBERS OF KEYWORDS MATCHED BETWEN H1 TAGS !
  ?>
       </td>
       <td>
  <?php ceil(($keywords_counth1/$words_h1)*100); // PERCENTAGE OF KEYWORDS MATCHED FROM THE TOTAL WORDS BETWEN H1 TAGS !
  ?>
       </td>
       <td>
     <?php echo $keywords_countb;  // TOTAL NUMBERS OF KEYWORDS MATCHED BETWEN B TAGS !
  ?>
       </td>
       <td>
  <?php ceil(($keywords_countb/$words_b)*100); // PERCENTAGE OF KEYWORDS MATCHED FROM THE TOTAL WORDS BETWEN B TAGS !
  ?>
       </td>
               <td>
     <?php echo $keywords_countbody; // TOTAL NUMBERS OF KEYWORDS MATCHED BETWEN BODY TAG !
  ?>
       </td>
       <td>
  <?php ceil(($keywords_countbody/$words_body)*100); // TOTAL NUMBERS OF KEYWORDS MATCHED BETWEN BODY TAG !
  ?>
       </td>
</tr>
<?php } ?>
</table>

</fieldset>
<?php
}
else if(isset($_REQUEST['Submit']))
{
echo 'Please Enter a URL.';
}
?>
</body>
<!--<?php include('includes/footer.php'); ?>-->
</html>

 

by: chrisj1963Posted on 2009-01-12 at 23:41:16ID: 23360639

Roonaan - Thanks very much for the response and the detailed comments.  I don't think, though, that it is working quite right. For example if I put

elder+law in the "Keyword" text box and then put "http://www.grosskopfandblack.prontopage.com" in the url box I should get at least 2 results for "Elder Law" but the code only shows 1 result. Additionally every cell that should show  X keywords / of X Total  always has a 1 as the first # no matter what.

Can you please take a quick look at your code again and see if you can see what the issue is:

example: http://70.87.107.194/~g3crmco/deny2.php?keyword=elder%2Blaw&url=http%3A%2F%2Fwww.grosskopfandblack.prontopage.com&Submit=Submit

thanks very much.

 

by: RoonaanPosted on 2009-01-13 at 00:01:40ID: 23360706

Yes sorry. It seemed indeed that preg_match_all returns a boolean instead of the number of matches. So the count() wasn't working as expected.

After changing the code a little, I got 5 hits on the body, but zero on the h1 and bolds. Looking at the source html for the site, this seems about right.

# 1 Get the html for the page
	  $page_contents = file_get_contents($urlarray[$i]);
	  
	  # 2 Distill the <body> section from the page content
	  $body_content  = preg_replace('/^(.*)(<body(.*)<\/body>)(.*)/ims', '\2', $page_contents);
	  
	  # 3 Store the keyword in a local variable
	  $keyword = $_REQUEST['keyword'];
	  
	  # 4 Test the h1's for keywords
	  $all_h1_words_string = '';
	  $all_h1_words_array = array(); 
	  # 4.1 Find all h1 tags, store all separate words in an array as well as a string
	  if(preg_match_all('#<h1>(.*?)</h1>#ims', $page_contents, $m)) {
	  	$all_h1_words_string = strip_tags(implode('.', $m[1]));
	  	$all_h1_words_array  = preg_split('/[^\w\-\']+/', $all_h1_words_string);
	  }
	  # 4.2 Count the number of times the keyword is in the string with all h1 contents
	  $num_keyword_in_h1   = preg_match_all('/(^|\W)'.preg_quote($keyword).'(\W|$)/i', $all_h1_words_string, $m) ? count($m[0]) : 0;
	  # 4.3 Compare the counted number with the amount of words, taking into account that a division by zero error should be prevented
	  if($num_keyword_in_h1   == 0 || count($all_h1_words_array) == 0) {
	  	$prc_keyword_in_h1 = 0;
  	} else {
			$prc_keyword_in_h1 = (100 / count($all_h1_words_array) * $num_keyword_in_h1);
		}
	  
		
		# 5. Test the <b>'s for keywords. (Similar to h1)
	  $all_bold_words_string = '';
	  $all_bold_words_array = array();
	  # 5.1
	  if(preg_match_all('#<b>(.*?)</b>#ims', $page_contents, $m)) {
	  	$all_bold_words_string = implode('.', $m[1]);
	  	$all_bold_words_array  = preg_split('/[^\w\-\']+/', $all_bold_words_string);
	  }
		# 5.2
	  $num_keyword_in_bold = preg_match_all('/(^|\W)'.preg_quote($keyword).'(\W|$)/i', $all_bold_words_string, $m) ? count($m[0]) : 0;
	  # 5.3 (same as 4.3 but in ternairy expression: $value = $somebooleanexpression ? $value_if_expression_is_true : $value_if_expression_is_false
	  $prc_keyword_in_bold = $num_keyword_in_bold == 0 || count($all_bold_words_array) == 0 ? 0 : (100 / count($all_bold_words_array) * $num_keyword_in_bold);
	  
	  # 6. Test the body. We don't use a array for storing all words, as that might get memory intensive
	  $num_keyword_in_body = preg_match_all('/(^|\W)'.preg_quote($keyword).'(\W|$)/i', strip_tags($body_content), $m) ? count($m[0]) : 0;
	  $num_words_in_body   = count(preg_split('/[^\w\-\']+/', strip_tags($body_content)));
		$prc_keyword_in_body = $num_keyword_in_body == 0 || $num_words_in_body == 0 ? 0 : (100 / $num_words_in_body * $num_keyword_in_body);
		
		# 7. Output: For the count, I used number_of_matches / number_of_words instead of only number_of_matches. This allows for better debugging for the time being
	?>

                                              
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44:
45:
46:
47:
48:

Select allOpen in new window

 

by: chrisj1963Posted on 2009-01-13 at 00:14:39ID: 23360758

Hey. Thanks again.  I am still seeing an issue though.  for example

http://70.87.107.194/~g3crmco/deny4.php?keyword=invisible%2Bthread&url=http%3A%2F%2Fwww.sewing.org&Submit=Submit  

if you view the source of "www.sewing.org" there is an instance if "invisible thread" in the code.

<b>6.141 Invisible Thread</b>

Is there a reason why your code might be missing it?  

by the way, in my example, if you enter "invisible thread" in the Keyword field it won't work, you have to enter a + sign between words in a phrase. In this case "invisible+thread".

Thanks for helping me with this.

 

by: RoonaanPosted on 2009-01-13 at 00:22:39ID: 23360784

The + is only required for the google page rank. But when you enter it, it will break the search, as the string "invisible+thread" is not anywhere in the , <h1> or <body>

For google page rank, you can solve that by using (around line 120):

replace
$url = 'http://www.google.com/search?hl=en&q='.$sel_keyword.'&btnG=Google+Search&aq=f&oq=';
with
$url = 'http://www.google.com/search?hl=en&q='.url_encode($sel_keyword).'&btnG=Google+Search&aq=f&oq=';

Then you can enter invisible thread without the + sign:
http://70.87.107.194/~g3crmco/deny4.php?keyword=invisible+thread&url=http%3A%2F%2Fwww.sewing.org&Submit=Submit
On that search I get a 1 on the keyword in bolds section.

 

by: chrisj1963Posted on 2009-01-13 at 00:34:15ID: 23360834

Unfortunately that did not work. please see http://70.87.107.194/~g3crmco/deny5.php.  

I am going to close this out and open an new question.  I think that this is a different issue....

If you could respond to that, I would appreciate it.

Thanks very much!

 

by: RoonaanPosted on 2009-01-13 at 00:39:18ID: 23360855

sorry, url_encode should be urlencode. Should have tested it, before writing it from heart.

 

by: chrisj1963Posted on 2009-01-13 at 01:05:20ID: 23360961

sorry, my internet went down right after my last post, and I could not get back on till now.  
I did try the correction and it worked great. thanks very much.
more questions will follow!

 

by: chrisj1963Posted on 2009-01-13 at 01:06:12ID: 31533882

Excellent help. Great knowledge. Patient. Thorough.
Thanks very much!

20120131-EE-VQP-002

3 Ways to Join

30-Day Free Trial

The Experts

98% positive feedback on 31,087 answers since March 2000. angeliii is a Microsoft Most Valuable Professional for his work with MS SQL Server & Develoment.

He has also proven his knowledge of Visual Basic Programming, PHP Scripting and Oracle Databases.

The Experts

97% positive feedback on 10,752 answers since July 2000. lrmoore has more than 18 years experience in the networking industry.

The six-time Mircosoft MVPs specialties include firewalls, virtual private networking, and network management.

Testimonials

"...and excellent source for support... Kind of like having your very own IT dept." Electriciansnet

Testimonials

"I was apprehensive at signing up at first. However... it has already made my life as an IT administrator much easier." JaCrews

Testimonials

"WOW! You guys have great, active, and knowledgeable people on here." moore50

Business Clients

Business Clients

In the Press

"If you’ve got a question... Experts Exchange can supply an answer.”

In the Press

"...an invaluable aid for both IT professionals and those who require tech support."

In the Press

"where IT professionals provide quick answers on just about any topic"

Business Account Plans

Loading Advertisement...