Link to home
Start Free TrialLog in
Avatar of Pedro Chagas
Pedro ChagasFlag for Portugal

asked on

Transform script for processing of multiple cURL handles in parallel

Hi E's, the code in snippet code is the code I use for get the container of url's.
I call from database the url's, and with a WHILE I get the container of each url, and if the return is true I save the url container in database.
In every time I execute the script, I will get the contain of 12 or more url's, and the avarage time for the execution of all script is about 26 seconds.
26 seconds is to long time for my project, the perfect time is less then 10 seconds. So, I know cURL functions have the power for check all urls in parallel, and I need help to implement curl_multi in my script.
I read about cURL, but really I don't know how I implement that resources of cURL in my script.

What kind of changes I have to do in my code for the script check all url's in parallel?

Regards, JC
function my_curl($url) {
// HEADERS FROM FIREFOX - APPEARS TO BE A BROWSER REFERRED BY GOOGLE
        $curl = curl_init();
 
        $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] = "Cache-Control: max-age=0";
        $header[] = "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // browsers keep this blank.
 
        curl_setopt($curl, CURLOPT_URL, $url);
        curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15');
        curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
        curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com');
        curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
        curl_setopt($curl, CURLOPT_AUTOREFERER, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($curl, CURLOPT_TIMEOUT, 1);
 
        if (!$html = curl_exec($curl)) { $html = FALSE; }
        curl_close($curl);
return $html;
}
 
$urlc_result = mysql_query("SELECT * FROM urls where palavra = '$palavradb'", $db);
$urlc_rows = mysql_num_rows($urlc_result);
while($urlc = mysql_fetch_object($urlc_result)) { //1000
	$ee = my_curl("$urlc->url");
	if ($ee == FALSE){ //2000
		echo "false";
			} else { //M2000
		$codigo = $ee;
mysql_query("insert into sites set url_bruto = '$urlc->url', contain = '$codigo'", $db);
                        }
                }

Open in new window

Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

This page might be useful.  You might want to turn your speakers off - the audio is annoying but the code samples might  be helpful.

http://www.askapache.com/php/curl-multi-downloads.html

Best regards, ~Ray
Avatar of Pedro Chagas

ASKER

Hi @Ray, I try to adapt the code you recommend (can see in snippet code), but I have some difficulties because I don't know where I put the code for save in database (mysql_query("insert into sites set url_bruto = '$url', entirecode = '???????'", $db);), and what is the variable that contain the code of each url .......entirecode = '???????'........

The code I post in snippet code in my first post of this question, has been given from you in other question, and I have certain is very useful, so I request your help for askapache script become more perfect basis in your code.

Regards, JC



include ("database.php"); //conecção à base de dados
$urlc_result = mysql_query("SELECT * FROM urls where palavra = '$palavradb'", $db);
$urlc_rows = mysql_num_rows($urlc_result);
while($urlc = mysql_fetch_object($urlc_result)) { //1000
$urls[] = $urlc->url;
}
print_r($urls);
 
$save_to='aaaaaa/';
 
$mh = curl_multi_init();
foreach ($urls as $i => $url) {
    $g=$save_to.basename($url);
    if(!is_file($g)){
        $conn[$i]=curl_init($url);
        $fp[$i]=fopen ($g, "w");
        curl_setopt ($conn[$i], CURLOPT_FILE, $fp[$i]);
        curl_setopt ($conn[$i], CURLOPT_HEADER ,0);
        curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,60);
        curl_multi_add_handle ($mh,$conn[$i]);
    }
}
do {
    $n=curl_multi_exec($mh,$active);
}
while ($active);
foreach ($urls as $i => $url) {
    curl_multi_remove_handle($mh,$conn[$i]);
    curl_close($conn[$i]);
    fclose ($fp[$i]);
}
curl_multi_close($mh);

Open in new window

Hi again, I don't want to save the files in any folder ($save_to='aaaaaa/'), I want put the contain of each site in database.

Regards, JC
Have you got the CURL part of this working - in other words, can you see the data you are retrieving from the foreign web sites?  If so, please show us where it shows up and how you visualize it.  Thanks, ~Ray
Hi, After I run the script I get what you see in attach file (image). The contain of foreign sites goes to "aaaaaa" path.
I don't want save the code of foreign urls in any path, I just want save the code of each url in database.
If you want more information tell me, this script is very important for my project.

Regards, JC
curl.png
Let me ask this again...

Have you got the CURL part of this working?  Can you see the data you are retrieving from the foreign web sites?  For example, can you print out the foreign HTML with var_dump()?
Hi @Ray, I don't understand well your answer, but I try to do what you say (my English is not so good).
I try put var_dump in different places in the the code with the variables $conn[i] and $fp[i], the only thing I see in the browser is the word "NULL", so I can't print the html.
Like I tell you in my last post, the urls been saved in the folder /aaaaaa. The only thing I need is INSERT html of foreign urls in my database.

Please tell me if you need more information.

Regards, JC
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi @Ray, that works now!
But now I have to erase the container of foreign urls that go to the folder "/joao".
The erase not is empty all folder, but identify the url's and erase one by one, because that folder are a common folder.

Please tell me how erase the files in /joao folder one by one.
I appreciate your big help, thanks to you.

Regards, JC
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi @ray, I discover a issue in the script. When I go see in database what the script insert from the foreign url's, some url's has complete transfer to the database, but other they just insert a partial code.
I open the content of each file in folder /joao and the files are complete, but in database some of them not.
One of the reasons I thing is the special characters, because when the system found a special characters they stop insert in database, characters like "`".

One example - url check(http://www.pizzasmaispizzas.com.br/):
==================what has been insert in database=========================
<html>
<head>
   <title>Pizzas & Mais Pizzas - A melhor pizza de S
===========================================
If you check the url http://www.pizzasmaispizzas.com.br/ and see the page source, you can see more then 3 line, and when I go check the file in the folder /joao I can see entire html code of that site.
Note: after "A melhor pizza de S" the letter come next is "ã", so I thing the problem is the characters.

This issue have solution? Do you want I open a new question for this issue?

Regards, JC
 
Hi again, urlencode() helps?

Regards, JC
Not sure about urlencode() - that is not needed for putting things into the data base - just for putting things on the client screen.  you might want to check and see that your fields are all UTF-8, both in the data base and in the HTML that you generate from the contents of the data base.

Another possible way to handle this is to use base64_encode() and base64_decode() on the data.  You can read up on those functions on php.net.  They should enable you to put anything - even binary data - into a text field in the data base.