Solutionabc
asked on
PHP Scraper hangs and will not complete
Hi,
I am running into a problem with a scraping script that I am using. I am using Simple_html_dom.php http://simplehtmldom.sourceforge.net/ to help scrape the pages I want. The problem is that the script ends up hanging and stops updating the db after a couple hundred urls. Also, I will want to use a cronjob with this script and run it weekly or biweekly.
I am trying to scrape 3000-4000 different URLs and I assume my struggles are coming from my inefficient code.
I basically create an array with all the different URLs then I iterate through each one and scrape. I read somewhere that to scrape a lot of urls you should use multithreading with cURL and to scrape using "blocks" of 8 to help improve performance.
Can someone shed some light on the best way to approach this hurdle? and maybe some sample script or point me to a good resource/book etc. I have checked google but it's hard to hunt down the right answer if I'm doubtful of the correct approach.
thanks!
I am running into a problem with a scraping script that I am using. I am using Simple_html_dom.php http://simplehtmldom.sourceforge.net/ to help scrape the pages I want. The problem is that the script ends up hanging and stops updating the db after a couple hundred urls. Also, I will want to use a cronjob with this script and run it weekly or biweekly.
I am trying to scrape 3000-4000 different URLs and I assume my struggles are coming from my inefficient code.
I basically create an array with all the different URLs then I iterate through each one and scrape. I read somewhere that to scrape a lot of urls you should use multithreading with cURL and to scrape using "blocks" of 8 to help improve performance.
Can someone shed some light on the best way to approach this hurdle? and maybe some sample script or point me to a good resource/book etc. I have checked google but it's hard to hunt down the right answer if I'm doubtful of the correct approach.
thanks!
you might need to increase the max_execution_time and max_input_time inside your php.ini to resonable value
ASKER
For testing purposes I increased memory to 900mb and time to 1200 seconds. It did last longer but still didn't complete. That's why I am assuming there is a better approach.
check wether you destroy the objects used by simplehtmldom so php has a chance to free the corresponding memory
post your code if that does not help
post your code if that does not help
ASKER
I think it might have helped but still it hangs. Check out my code. thx
$pageArray holds all the page names (about 3000-4000).
$pageArray holds all the page names (about 3000-4000).
foreach ($pageArray as $page){
$html = new simple_html_dom();
$html->load_file('http://www.website.com/'.$page);
$single = $html->find('#section',0);
$awards = strip_tags($single, '<img>');
$html->clear();
if (!empty($awards)){
$html = str_get_html($awards);
$chanNames = "";
foreach($html->find('img') as $element){
$chanNames .= $element->alt.";";
}
$html->clear();
$res = mysql_query("SELECT page from Table where page = '".$page."'");
$num = mysql_num_rows($res);
if ($num) {
mysql_query("UPDATE table"); // with $chanNames
} else {
mysql_query("INSERT table"); // with $chanNames
}
}
}
Are you scraping the HTML at the same time you are reading the HTML? If so you might try a different strategy. Here is how I would do it (admittedly this is a pretty broad brush solution description but I have used the design pattern before and it worked well for me on about 50,000 URLs).
Create a list of the URLs you want to scrape. Store these URLs in a data base table along with two true/false columns that say "HTML_stored" and "HTML_scraped." The default value for these columns is zero, meaning not done yet.
Data retrieval: Make a SELECT query from the URLs table with a WHERE clause to select all the URLS that have HTML_stored == 0. Iterate over the results set. Use file_get_contents() or curl() to read the HTML from each URL. Store the HTML string in the data base. As each HTML string is retrieved and stored, update the data base table of URLs to set the HTML_stored == 1. If this process times out, just restart it. Eventually it will complete and you will have all the HTML gathered.
Data analysis: Make a SELECT query from the URLS table with a WHERE clause to select all the URLs that have HTML_stored == 1 and HTML_scraped == 0. As you complete the scraping process and store the final data, update the URLs table to set HTML_scraped == 1. Again, this is a restartable process. It can be started even before the data retrieval is complete, but it should be run as a separate process, and not as part of the retrieval.
Hopefully the decoupling will result in some easier-to-use processes. Best of luck with it, ~Ray
Create a list of the URLs you want to scrape. Store these URLs in a data base table along with two true/false columns that say "HTML_stored" and "HTML_scraped." The default value for these columns is zero, meaning not done yet.
Data retrieval: Make a SELECT query from the URLs table with a WHERE clause to select all the URLS that have HTML_stored == 0. Iterate over the results set. Use file_get_contents() or curl() to read the HTML from each URL. Store the HTML string in the data base. As each HTML string is retrieved and stored, update the data base table of URLs to set the HTML_stored == 1. If this process times out, just restart it. Eventually it will complete and you will have all the HTML gathered.
Data analysis: Make a SELECT query from the URLS table with a WHERE clause to select all the URLs that have HTML_stored == 1 and HTML_scraped == 0. As you complete the scraping process and store the final data, update the URLs table to set HTML_scraped == 1. Again, this is a restartable process. It can be started even before the data retrieval is complete, but it should be run as a separate process, and not as part of the retrieval.
Hopefully the decoupling will result in some easier-to-use processes. Best of luck with it, ~Ray
ASKER
Yeah I can definitely try that method of scraping.
How can I tell that it has timed out and needs to be run again if I use a cronjob for execution?
And after I have the page HTML in string how do I extract the section of HTML that I am interested in? Using the simple_HTML_Dom I would just use jquery selectors.
Thx
How can I tell that it has timed out and needs to be run again if I use a cronjob for execution?
And after I have the page HTML in string how do I extract the section of HTML that I am interested in? Using the simple_HTML_Dom I would just use jquery selectors.
Thx
ASKER
Additionally to my second question, I guess I could still use simple_html_dom selectors and just put the string html through it when the page is called.
Do you recommend having another routine that goes through each record and cleans up the html so that it only contains the information that is important (strip away unwanted html) or should I extract the info from the string on the fly when a page is called? (Not sure how this would add to my page loading time)
Do you recommend having another routine that goes through each record and cleans up the html so that it only contains the information that is important (strip away unwanted html) or should I extract the info from the string on the fly when a page is called? (Not sure how this would add to my page loading time)
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.