Solved

extract overture keywords and save to text file line by line

Posted on 2004-04-30
9
815 Views
Last Modified: 2013-11-28
hi there

Im looking at a script that can save out overture keywords and save line by line to a text file. If its possible then to extract each of the keyword possibilities into the text file as well. Can this be saved in numalphabetic order.

best regards

0
Comment
Question by:playstat
  • 5
  • 4
9 Comments
 
LVL 9

Expert Comment

by:techtonik
ID: 10962533
I'm not an english native, but a kind of PHP programmer, so if provide an example of this Overture, then perhaps I could understad you. Otherwise I just can't help.
0
 
LVL 9

Expert Comment

by:techtonik
ID: 10962535
I'm not an english native, but a kind of PHP programmer, so if you provide an example of this Overture, then perhaps I could understad you. Otherwise I just can't help.
0
 

Author Comment

by:playstat
ID: 10964588
http://inventory.overture.com/d/searchinventory/suggestion/

this is the actual url enter a keyword and possibilities are displayed. I need something that can extract that info into a text file line by line and if yer notice that the actual results have links to another set of possiblities of that keyword.

A script that can do this would be great!
0
 

Author Comment

by:playstat
ID: 10964601
If yer can extract all under that keyword that would be ideal and make sure there are no duplicates thx
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 9

Expert Comment

by:techtonik
ID: 10967998
Easy. Here is an example.

<?php
$your_keyword = "mykey";

$htmlpage = file_get_contents("http://inventory.overture.com/d/searchinventory/suggestion/?term=".$your_keyword."&mkt=us&lang=en_US");

$textpage = strip_tags( $htmlpage );
$textpage = str_replace("&nbsp;","", $textpage);

$texttosave = strstr($textpage, "Searches done in");

$f = fopen("file.txt", "w");
fwrite($f, $texttosave);
fclose($f);

?>

I think you've got an idea. Further refinement can be done with String functions.
http://us2.php.net/manual/en/ref.strings.php
0
 

Author Comment

by:playstat
ID: 10988778
Can you show me how to refine it further and where to take out the numbers it produces in the files

and if possible line by line.

the other thing is it takes out the actual keywords but what about going another level for each href and extracting those to then removing duplicates.
0
 
LVL 9

Expert Comment

by:techtonik
ID: 10989275
There are two possibilities with refinement. First - using regular expressions and second - using PHP string functions. Regexps are more convenient in many cases but require more effort to learn.
Falling back to EE rules http://www.experts-exchange.com/Web/Web_Languages/PHP/help.jsp#hi56 I feel lazy to write code for you. =) Since I don't know what kind of knowledge do you require. If you in doubts about how this script works or can't see a way how to improve it, please show what exactly you do not understand.
0
 

Author Comment

by:playstat
ID: 11006771
im trying to understand how the information extracts the infiormation from the page.

For example

Where does it know where to start to extract and stop.

how the filters take place etc

If you can give me many examples from say a html php pages then the appropriate filter maybe I can work out the rest its just a means of doing this and then using variations for other applications.

The text file output could you make that into line by line save without the numbers I would be most grateful.

best regards

0
 
LVL 9

Accepted Solution

by:
techtonik earned 500 total points
ID: 11016217
Ok. Here we go.. While making filters echo your intermediate results to see what result have you got.
<?php
// here you specify keyword to substitute in URL to fetch html page with results
$your_keyword = "mykey";

// now reading whole page into string with all html markup - note
// use of $you_keyword defined above
$htmlpage = file_get_contents("http://inventory.overture.com/d/searchinventory/suggestion/?term=".$your_keyword."&mkt=us&lang=en_US");

// filter section
// i'll modify it a bit from previous example, where I just stripped html tags
// here we will crop the text to contain only result table
$htmlpage = strstr($htmlpage, "Searches done in");
// RTFM: string strstr ( string haystack, string needle )
// strstr returns part of haystack string from the first occurrence of needle to
// the end of haystack php.net/strstr

// now $htmlpage variable contains following fragment
/*
Searches done in March 2004</font></th>
  </tr>
  <tr align=left bgcolor=#999999>
    <th><font face="verdana,sans-serif" size=2 color=E8E8E8>Count</font></th>
    <th><font face="verdana,sans-serif" size=2 color=E8E8E8>Search Term</font></th>
  </tr>
<tr bgcolor=#333333>
<td><font face="verdana,sans-serif" size=2 color=E8E8E8>&nbsp;24306</td>
<td><font face="verdana,sans-serif" size=2 color=E8E8E8>&nbsp;overture</a></td>
</tr>
<tr bgcolor="#F4F4F4">
<td><font face="verdana,sans-serif" size=1>&nbsp;4266</td>
<td>&nbsp;<a href="/d/searchinventory/suggestion/?term=080101%20ctxtid%20ilgan%2Ejoins%2Ecom%20ilgan%2Eshtml%20overture%20sports&mkt=us&lang=en_US"><font face="verdana,sans-serif" size=1 color=#000000>080101 ctxtid ilgan.joins.com ilgan.shtml overture sports</a></td>
</tr>
<tr>
<td><font face="verdana,sans-serif" size=1>&nbsp;2733</td>
<td>&nbsp;<a href="/d/searchinventory/suggestion/?term=1812%20overture&mkt=us&lang=en_US"><font face="verdana,sans-serif" size=1 color=#000000>1812 overture</a></td>
</tr>
<tr bgcolor="#F4F4F4">
<td><font face="verdana,sans-serif" size=1>&nbsp;1898</td>
<td>&nbsp;<a href="/d/searchinventory/suggestion/?term=international%20overture&mkt=us&lang=en_US"><font face="verdana,sans-serif" size=1 color=#000000>international overture</a></td>
</tr>
*/

// now, if you strip all html markup and &nbsp; entit- you will end with the output from  the
// previous example, but now we will go a little bit further to make a more sophisticated filter
// we will extract fields "count" and "search term" from html table into array, where
// search term will be a key and "count" will be value associated with that key
// additionally we will extract all links with suggestions into third array to be
// able to parse these also

// now look at the html markup
// each html row begins with a <tr> tag, so we should split string by this tag to get an
// array of html rows for further processing, but first we need to strip table header
// that is, all up to first <tr bgcolor=#333333>
// since value of bgcolor is not 100% guaranteed to be #333333, we will use only first part
$htmlpage = strstr($htmlpage, "<tr bgcolor");
// since header rows begin with a <tr align=left they will not match and hence will be stripped
// you can check what you've got with the following construction
// echo $htmlpage; die();

// next, split string with php.net/explode
$htmlrowsarr = explode("<tr", $htmlpage);
// test your result with print_r($htmlrowsarr);

// now filling arrays
for ($i = 1; $i <count($htmlrowsarr); $i++) {
// number begins right after the first &nbsp; and to the next closing tag </td>
// strip to &nbsp;
  $str = strstr($htmlrowsarr[$i], "&nbsp;");
// determine position of </td>
  $to = strpos($str, "</td>");
// getting value for the first array - substring from 7th symbol (skip &nbsp;) to position
// of </td> closing tag. indexes are numerated from zero, so 7th symbol have an index 6
  $value = substr($str,6,$to-6);
// $to-6 indicates how much symbols do we need to extract
// echo "$value.";

// next &nbsp; will precede our search term or a link to other suggestion, so
// search string for &nbsp; with preceding > to match only second &nbsp;
  $str = strstr($str, ">&nbsp;");
// determine where the link ends
  $to = strpos($str, "</td>");
// getting link  
  $link = substr($str,7,$to-7);
// strip html markup to get the key
  $key = strip_tags( $link );
// extracting actual URL from href attribute
// it will be substring from symbol after href's quote and up to next quote
  $from = strpos($str,"href=") + 6;
  $to = strpos($str, "\"", $from);
// first element is our search term so it doesnt have any links
  if ($i != 1) {
     $url = substr($str, $from, $to-$from);
  } else {
     $url = "";
  }

// now building arrays
$overtures[$key] = $value;
$overlinks[$key] = $url;
}

print_r($overtures);
// now when you've got this info - do what you want =)

// actually you do not need to extract links - just supply $key as a parameter
// $your_keyword at the beginning of this script and it will fetch a page for parsing
?>

You can make a function from this example.
Download the manual and.. good luck. =)
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Before we dive into the marketing strategies involved with creating an effective homepage, it’s crucial that EE members know what a homepage is. In essence, a homepage is the introductory, or default page, of a website that typically highlights the …
Every business owner understands the significance of online customer reviews and the impact it can have on sales and revenues. With technology advancing at such a rapid pace, getting online reviews has never been easier, especially when many regions…
Use Wufoo, an online form creation tool, to make powerful forms. Learn how to selectively show certain fields based on user input using rules to gather relevant information and data from your forms. The rules feature provides you with an opportunity…
Use Wufoo, an online form creation tool, to make powerful forms. Learn how to choose which pages of your form are visible to your users based on their inputs. The page rules feature provides you with an opportunity to create if:then statements for y…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now