Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

why does this cURL function fail on $_GET variables with '+'?

Posted on 2011-02-11
12
Medium Priority
?
1,152 Views
Last Modified: 2012-06-27
using a basic, very helpful cURL script (see below) that I pulled off of the php.net cURL page comments, I tried parsing two pages

http://example.com/index.php?a=b

and

http://example.com/index.php?a=b+c

The first works, the second fails as a 'Bad Request'. It appears that any link with + in a get variable fails, whereas any other links I try work. I am curious what might be causing this behavior.

Here's a script that shows the output of the function for both of the cases above. (Substitute any page for the above url's and get the same problem using +. The actual url's are irrelevant.)

<?php

/*==================================
Get url content and response headers (given a url, follows all redirections on it and returned content and response headers of final url)

@return    array[0]    content
        array[1]    array of response headers
==================================*/
function get_url( $url,  $javascript_loop = 0, $timeout = 5 )
{
    $url = str_replace( "&amp;", "&", urldecode(trim($url)) );

    $cookie = tempnam ("/tmp", "CURLCOOKIE");
    $ch = curl_init();
    curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
    curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt( $ch, CURLOPT_ENCODING, "" );
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
    curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );    # required for https urls
    curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
    curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
    curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
    $content = curl_exec( $ch );
    $response = curl_getinfo( $ch );
    curl_close ( $ch );

    if ($response['http_code'] == 301 || $response['http_code'] == 302)
    {
        ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");

        if ( $headers = get_headers($response['url']) )
        {
            foreach( $headers as $value )
            {
                if ( substr( strtolower($value), 0, 9 ) == "location:" )
                    return get_url( trim( substr( $value, 9, strlen($value) ) ) );
            }
        }
    }

    if (    ( preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value) ) &&
            $javascript_loop < 5
    )
    {
        return get_url( $value[1], $javascript_loop+1 );
    }
    else
    {
        return array( $content, $response );
    }
}

$url="http://example.com/index.php?a=b";

$output = get_url($url,0,20);

print_r($output);

$url="http://example.com/index.php?a=b+c";

$output = get_url($url,0,20);

print_r($output);

?>

Open in new window

0
Comment
Question by:bitt3n
  • 6
  • 5
12 Comments
 

Author Comment

by:bitt3n
ID: 34874756
the specific error message returned is:

Bad Request

Your browser sent a request that this server could not understand.
The request line contained invalid characters following the protocol string.
0
 
LVL 143

Accepted Solution

by:
Guy Hengel [angelIII / a3] earned 1000 total points
ID: 34874770
you shall not pass "+" in the url, but "%2B" instead

aka url_encore("+")
http://php.net/manual/en/function.urlencode.php
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 34878381
All characters in the GET string, except the equal sign that ties keys to values, must be URLencoded().  This is not unique to CURL; it is a rule of HTTP.  The plus sign in a URL is decoded into a blank. So if you want to send a blank in a string like Hello World, you would send Hello+World.

You can see the effect in action here:
http://www.laprbass.com/RAY_bounce_get.php?a=b+c
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 111

Expert Comment

by:Ray Paseur
ID: 34878385
0
 

Author Comment

by:bitt3n
ID: 34880350
ok I think I get it.. the function is decoding the url via

$url = str_replace( "&amp;", "&", urldecode(trim($url)) );

prior to curl_exec(), so if there's a '+' in the url, this gets decoded to ' ', which results in a malformed url. (any url containing '+' in the get variable string will cause the above script to return 'Bad Request')

I am afraid I do not follow the logic of decoding the url before calling curl_exec(). wouldn't cURL require the url to be encoded? the fact that a '+' in the url causes the bad request error seems to be a result of this decoding.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 34880423
Just curious... Where did this code come from?  I have found lots of CURL examples (of widely varying quality) but did not see it on this page:
http://us.php.net/manual/en/book.curl.php

But a better question would be the practical one... What are you trying to do? What are the actual URLs and arguments you want to pass?  If you give us an accurate picture of the issues we can probably give you a concrete solution.

This sample uses POST, but the encoding would be the same whether you use GET or POST to send the arguments.  See lines 10-16.
<?php // RAY_curl_post_example.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO USE CURL POST TO START AN ASYNCHRONOUS PROCESS


function curl_post($url, $post_array, $timeout=2, $error_report=FALSE)
{
    // PREPARE THE POST STRING
    $post_string = '';
    foreach ($post_array as $key => $val)
    {
        $post_string .= urlencode($key) . '=' . urlencode($val) . '&';
    }
    $post_string = rtrim($post_string, '&');

    // PREPARE THE CURL CALL
    $curl = curl_init();
    curl_setopt( $curl, CURLOPT_URL,            $url         );
    curl_setopt( $curl, CURLOPT_HEADER,         FALSE        );
    curl_setopt( $curl, CURLOPT_POST,           TRUE         );
    curl_setopt( $curl, CURLOPT_POSTFIELDS,     $post_string );
    curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout     );
    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE         );

    // EXECUTE THE CURL CALL
    $htm = curl_exec($curl);
    $err = curl_errno($curl);
    $inf = curl_getinfo($curl);

    // ON FAILURE
    if (!$htm)
    {
        // PROCESS ERRORS HERE
        if ($error_report)
        {
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            echo "<pre>\n";
            var_dump($inf);
            echo "</pre>\n";
        }
        curl_close($curl);
        return FALSE;
    }

    // ON SUCCESS
    curl_close($curl);
    return $htm;
}


// USAGE EXAMPLE CREATES ASSOCIATIVE ARRAY OF KEY=>VALUE PAIRS
$args["name"]  = 'Ray';
$args["email"] = 'Ray.Paseur@Gmail.com';

// ACTIVATE THIS TO SEE THE ARRAY OF ARGS
// var_dump($args);

// SET THE URL
$url = "http://LAPRBass.com/RAY_bounce_post.php";

// CALL CURL TO POST THE DATA
$htm = curl_post($url, $args, 3, TRUE);

// SHOW WHAT CAME BACK, IF ANYTHING
if ($htm)
{
	echo "<pre>";
	echo htmlentities($htm);
}
else
{
    echo "NO RESPONSE YET FROM $url -- MAYBE BECAUSE IT IS RUNNING ASYNCHRONOUSLY";
}

Open in new window

0
 

Author Comment

by:bitt3n
ID: 34883296
the function in my script is the second comment on this page

http://php.net/manual/en/ref.curl.php

I started using it after researching an issue with cURL timing out on a slow page.

I want to retrieve price and availability data from a few sites who have granted me permission to do so when linking to products they sell. My girlfriend blogs about clothes and other frippery, and she wanted some way to get details such as 'in stock' and 'current price' for products she blogs about, so as to include these data with links to the relevant stores. I figured this would be an excuse for me to learn about cURL and using pearl style regexes, since I was only familiar with the POSIX kind, which I discovered have been deprecated since I last fiddled around with PHP some time ago.

The sites are generally friendly enough, granted my purpose, but much to my dismay, a lot of them offer an API for the data I want. Luckily not all of them do, so I've been tinkering around with cURL to see if it's worth using when necessary (but mostly just to see how it works and what I can do with it).

I came upon the above noted problem experimenting with cURL on the following bunch of get vars:

.asp?dept=1&page=2&sortorderrequested=&searchhandle_rhs=A%3dproduct+not+in+olet~B%3dproduct+not+in+olet~D%3d12~G%3d11221^2~K%3d4~L%3d1~M%3d31~N%3d2~&originalquery=product%20not%20in%20olet&sort_order_rhs=&sort_order=4%204&selectpagesize=12%2012&size=&colouranswerselected=&spellercorrections=&hd=&match=&fid=&productid=&productcolour=

(Obviously this is an ASP page, but I figured it wouldn't matter for parsing purposes.) When I tried parsing this page with a variation of the script provided in response to an earlier question

 
all I got back was an error page (customererror.php?applicationerror). Using the script from php.net posted in the original post above, I got the 'Bad Request' message I posted, which some tinkering revealed was on account of the '+' signs. Replacing these with %2B stopped the 'Bad Request' message, but just produced the same error page the original curl script I used (embedded in this comment) produced.

Strangely, *deleting* the '+' signs resulted in successful execution of the cURL via the php.net script (and only that script).

This site has made the data I need available to me directly, so there is no practical purpose for this exercise outside of learning what's going on. However, since that was my purpose to begin with, I'm curious both why deleting '+' would actually have a happy result, and why only in the php.net script (since the other script still fails with this new set of values).

I figured the more basic question to ask first is why '+' should cause the php.net script to return 'Bad Request', since '+' is a urlencoded ' ', which seems like it should be acceptable.

I understand now the error occurs when it gets decoded by

$url = str_replace( "&amp;", "&", urldecode(trim($url)) );

but I don't understand why it's being decoded. Presumably the fellow who wrote the function had some clear purpose for decoding the url, and the function otherwise appears to work admirably with every url I've provided.

Not knowing the answers to these questions isn't causing me any practical problems (so far), since if I replace '+' with '%2B' in other cases, the parsing works fine. At this point I'm just trying to appreciate what the logic is behind how the function works, because it is annoying to use code one doesn't understand.
0
 

Author Comment

by:bitt3n
ID: 34883313
edit: the error page was ASP of course, customerror.asp?Application_Error
0
 
LVL 111

Assisted Solution

by:Ray Paseur
Ray Paseur earned 1000 total points
ID: 34884177
Bad Request headers could come from any of a variety of conditions, including a malformed or unusable GET string.  Maybe the foreign server is not configured to accept GET requests that do not come from browsers?  Who knows?
http://en.wikipedia.org/wiki/List_of_HTTP_status_codes

It sounds like the design you want is called a "remote facade" in which each web site that you want to communicate with would have its own method to get the data.  The facade code would normalize the information from the other site, and present a consistent set of properties back to the main script.
0
 

Author Comment

by:bitt3n
ID: 34885070
Yes, I reckon that I'll probably never know why that particular .asp site accepts the GET vars with '+' through the browser, but only with '+' removed via cURL.

What about the function in the original post that decodes the url, which makes any GET vars with '+' in them malformed. Is there some obvious reason why this decoding is necessary that I am not seeing? I suspect the answer is obvious, but I don't see the reason.

What I've been doing for the sites that require cURL is creating a function for each site that parses the data (price, brand, title, etc.), and then returns those data for another function to handle. Is that what you mean by remote facade?
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 34887160
"creating a function for each site" -- exactly.  The remote facade is one of the classic design patterns from the Gang of Four.  Design patterns are considered some of the seminal thinking in modern computer science.
http://www.amazon.com/Design-Patterns-Elements-Reusable-Object-Oriented/dp/0201633612

I doubt if I can shed any light on why someone else's code does not work or contains unnecessary parts.  PHP.net does not test or "vet" the comments and while most are useful for learning something, some of them are dreck.  For example, this comment is wrong, or unnecessary and irrelevant at best:
http://www.php.net/manual/en/ref.curl.php#81651

I hope you have managed to get some code that works for you now, and that through this exercise you have learned that giving EE a URL like http://example.com/index.php?a=b is nowhere near as useful as giving us the actual URL!

Best of luck with your project, and remember to use urlencode() before you send text data over the HTTP, ~Ray
0
 

Author Closing Comment

by:bitt3n
ID: 34899762
thanks for your help
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I imagine that there are some, like me, who require a way of getting currency exchange rates for implementation in web project from time to time, so I thought I would share a solution that I have developed for this purpose. It turns out that Yaho…
3 proven steps to speed up Magento powered sites. The article focus is on optimizing time to first byte (TTFB), full page caching and configuring server for optimal performance.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
Suggested Courses

926 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question