Transform script for processing of multiple cURL handles in parallel

Hi E's, the code in snippet code is the code I use for get the container of url's.
I call from database the url's, and with a WHILE I get the container of each url, and if the return is true I save the url container in database.
In every time I execute the script, I will get the contain of 12 or more url's, and the avarage time for the execution of all script is about 26 seconds.
26 seconds is to long time for my project, the perfect time is less then 10 seconds. So, I know cURL functions have the power for check all urls in parallel, and I need help to implement curl_multi in my script.
I read about cURL, but really I don't know how I implement that resources of cURL in my script.

What kind of changes I have to do in my code for the script check all url's in parallel?

Regards, JC
function my_curl($url) {
// HEADERS FROM FIREFOX - APPEARS TO BE A BROWSER REFERRED BY GOOGLE
        $curl = curl_init();
 
        $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] = "Cache-Control: max-age=0";
        $header[] = "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // browsers keep this blank.
 
        curl_setopt($curl, CURLOPT_URL, $url);
        curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15');
        curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
        curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com');
        curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
        curl_setopt($curl, CURLOPT_AUTOREFERER, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($curl, CURLOPT_TIMEOUT, 1);
 
        if (!$html = curl_exec($curl)) { $html = FALSE; }
        curl_close($curl);
return $html;
}
 
$urlc_result = mysql_query("SELECT * FROM urls where palavra = '$palavradb'", $db);
$urlc_rows = mysql_num_rows($urlc_result);
while($urlc = mysql_fetch_object($urlc_result)) { //1000
	$ee = my_curl("$urlc->url");
	if ($ee == FALSE){ //2000
		echo "false";
			} else { //M2000
		$codigo = $ee;
mysql_query("insert into sites set url_bruto = '$urlc->url', contain = '$codigo'", $db);
                        }
                }

Open in new window

LVL 4
Pedro ChagasWebmasterAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
This page might be useful.  You might want to turn your speakers off - the audio is annoying but the code samples might  be helpful.

http://www.askapache.com/php/curl-multi-downloads.html

Best regards, ~Ray
0
Pedro ChagasWebmasterAuthor Commented:
Hi @Ray, I try to adapt the code you recommend (can see in snippet code), but I have some difficulties because I don't know where I put the code for save in database (mysql_query("insert into sites set url_bruto = '$url', entirecode = '???????'", $db);), and what is the variable that contain the code of each url .......entirecode = '???????'........

The code I post in snippet code in my first post of this question, has been given from you in other question, and I have certain is very useful, so I request your help for askapache script become more perfect basis in your code.

Regards, JC



include ("database.php"); //conecção à base de dados
$urlc_result = mysql_query("SELECT * FROM urls where palavra = '$palavradb'", $db);
$urlc_rows = mysql_num_rows($urlc_result);
while($urlc = mysql_fetch_object($urlc_result)) { //1000
$urls[] = $urlc->url;
}
print_r($urls);
 
$save_to='aaaaaa/';
 
$mh = curl_multi_init();
foreach ($urls as $i => $url) {
    $g=$save_to.basename($url);
    if(!is_file($g)){
        $conn[$i]=curl_init($url);
        $fp[$i]=fopen ($g, "w");
        curl_setopt ($conn[$i], CURLOPT_FILE, $fp[$i]);
        curl_setopt ($conn[$i], CURLOPT_HEADER ,0);
        curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,60);
        curl_multi_add_handle ($mh,$conn[$i]);
    }
}
do {
    $n=curl_multi_exec($mh,$active);
}
while ($active);
foreach ($urls as $i => $url) {
    curl_multi_remove_handle($mh,$conn[$i]);
    curl_close($conn[$i]);
    fclose ($fp[$i]);
}
curl_multi_close($mh);

Open in new window

0
Pedro ChagasWebmasterAuthor Commented:
Hi again, I don't want to save the files in any folder ($save_to='aaaaaa/'), I want put the contain of each site in database.

Regards, JC
0
Starting with Angular 5

Learn the essential features and functions of the popular JavaScript framework for building mobile, desktop and web applications.

Ray PaseurCommented:
Have you got the CURL part of this working - in other words, can you see the data you are retrieving from the foreign web sites?  If so, please show us where it shows up and how you visualize it.  Thanks, ~Ray
0
Pedro ChagasWebmasterAuthor Commented:
Hi, After I run the script I get what you see in attach file (image). The contain of foreign sites goes to "aaaaaa" path.
I don't want save the code of foreign urls in any path, I just want save the code of each url in database.
If you want more information tell me, this script is very important for my project.

Regards, JC
curl.png
0
Ray PaseurCommented:
Let me ask this again...

Have you got the CURL part of this working?  Can you see the data you are retrieving from the foreign web sites?  For example, can you print out the foreign HTML with var_dump()?
0
Pedro ChagasWebmasterAuthor Commented:
Hi @Ray, I don't understand well your answer, but I try to do what you say (my English is not so good).
I try put var_dump in different places in the the code with the variables $conn[i] and $fp[i], the only thing I see in the browser is the word "NULL", so I can't print the html.
Like I tell you in my last post, the urls been saved in the folder /aaaaaa. The only thing I need is INSERT html of foreign urls in my database.

Please tell me if you need more information.

Regards, JC
0
Ray PaseurCommented:
Hi, my Portuguese is not so good either!

Let's try something like this.  First you get the contents of the URLs into local files on your server, then you iterate over the list of files to add them to the data base.

This assumes that you have a data base connection and that you have created a table to hold the URLs and their contents.

Best regards, ~Ray
<?php // RAY_temp_joao.php
error_reporting(E_ALL);
 
// MAKE SOME TEST DATA
$urls[] = 'http://www.google.com';
$urls[] = 'http://news.google.com/nwshp?hl=en&tab=wn';
 
// SHOW THE TEST DATA ARRAY
print_r($urls);
 
// THE DIRECTORY TO SAVE TO
$save_to='joao/';
 
// THE CURL PROCESSING TO COPY DATA FROM URLS TO FILES
$mh = curl_multi_init();
foreach ($urls as $i => $url)
{
    $g = $save_to . basename($url);
    if(!is_file($g))
    {
        $conn[$i] = curl_init($url);
        $fp[$i] = fopen ($g, "w");
        curl_setopt ($conn[$i], CURLOPT_FILE, $fp[$i]);
        curl_setopt ($conn[$i], CURLOPT_HEADER ,0);
        curl_setopt ($conn[$i], CURLOPT_CONNECTTIMEOUT,60);
        curl_multi_add_handle ($mh,$conn[$i]);
    }
}
 
do
{
    $n = curl_multi_exec($mh,$active);
}
while ($active);
 
foreach ($urls as $i => $url)
{
    curl_multi_remove_handle($mh,$conn[$i]);
    curl_close($conn[$i]);
    fclose ($fp[$i]);
}
 
curl_multi_close($mh);
 
 
 
// GET THE FILES AND LOAD THEM INTO A DATA BASE
foreach ($urls as $i => $url)
{
 
// GET THE FILE NAME
    $g = $save_to . basename($url);
    echo "<br/>$g\n";
    if(is_file($g))
    {
// GET THE CONTENTS OF THE FILE
        $txt    = file_get_contents($g);
// ESCAPE THE DATA FOR THE MYSQL QUUERY
        $my_url = mysql_real_escape_string($url);
        $my_txt = mysql_real_escape_string($txt);
// CONSTRUCT AND EXECUTE THE QUERY
        $sql = "INSERT INTO my_table (my_url, my_txt) VALUES (\"$my_url\", \"$my_txt\")";
        $res = mysql_query($sql);
        if (!$res) die( mysql_error() );
    }
}

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Pedro ChagasWebmasterAuthor Commented:
Hi @Ray, that works now!
But now I have to erase the container of foreign urls that go to the folder "/joao".
The erase not is empty all folder, but identify the url's and erase one by one, because that folder are a common folder.

Please tell me how erase the files in /joao folder one by one.
I appreciate your big help, thanks to you.

Regards, JC
0
Ray PaseurCommented:
Sure.  Here is the way to do that...

See: http://us3.php.net/manual/en/function.unlink.php


// REMOVE THE FILES AFTER LOADING THEM INTO THE DATA BASE
foreach ($urls as $i => $url)
{
 
// GET THE FILE NAME
    $g = $save_to . basename($url);
    if(is_file($g))
    {
        unlink($g); // REMOVE THE TEMPORARY WORKING FILE
    }
}

Open in new window

0
Pedro ChagasWebmasterAuthor Commented:
Hi @ray, I discover a issue in the script. When I go see in database what the script insert from the foreign url's, some url's has complete transfer to the database, but other they just insert a partial code.
I open the content of each file in folder /joao and the files are complete, but in database some of them not.
One of the reasons I thing is the special characters, because when the system found a special characters they stop insert in database, characters like "`".

One example - url check(http://www.pizzasmaispizzas.com.br/):
==================what has been insert in database=========================
<html>
<head>
   <title>Pizzas & Mais Pizzas - A melhor pizza de S
===========================================
If you check the url http://www.pizzasmaispizzas.com.br/ and see the page source, you can see more then 3 line, and when I go check the file in the folder /joao I can see entire html code of that site.
Note: after "A melhor pizza de S" the letter come next is "ã", so I thing the problem is the characters.

This issue have solution? Do you want I open a new question for this issue?

Regards, JC
 
0
Pedro ChagasWebmasterAuthor Commented:
Hi again, urlencode() helps?

Regards, JC
0
Ray PaseurCommented:
Not sure about urlencode() - that is not needed for putting things into the data base - just for putting things on the client screen.  you might want to check and see that your fields are all UTF-8, both in the data base and in the HTML that you generate from the contents of the data base.

Another possible way to handle this is to use base64_encode() and base64_decode() on the data.  You can read up on those functions on php.net.  They should enable you to put anything - even binary data - into a text field in the data base.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.