• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1176
  • Last Modified:

why does this cURL function fail on $_GET variables with '+'?

using a basic, very helpful cURL script (see below) that I pulled off of the php.net cURL page comments, I tried parsing two pages

http://example.com/index.php?a=b

and

http://example.com/index.php?a=b+c

The first works, the second fails as a 'Bad Request'. It appears that any link with + in a get variable fails, whereas any other links I try work. I am curious what might be causing this behavior.

Here's a script that shows the output of the function for both of the cases above. (Substitute any page for the above url's and get the same problem using +. The actual url's are irrelevant.)

<?php

/*==================================
Get url content and response headers (given a url, follows all redirections on it and returned content and response headers of final url)

@return    array[0]    content
        array[1]    array of response headers
==================================*/
function get_url( $url,  $javascript_loop = 0, $timeout = 5 )
{
    $url = str_replace( "&amp;", "&", urldecode(trim($url)) );

    $cookie = tempnam ("/tmp", "CURLCOOKIE");
    $ch = curl_init();
    curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
    curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt( $ch, CURLOPT_ENCODING, "" );
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
    curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );    # required for https urls
    curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
    curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
    curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
    $content = curl_exec( $ch );
    $response = curl_getinfo( $ch );
    curl_close ( $ch );

    if ($response['http_code'] == 301 || $response['http_code'] == 302)
    {
        ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");

        if ( $headers = get_headers($response['url']) )
        {
            foreach( $headers as $value )
            {
                if ( substr( strtolower($value), 0, 9 ) == "location:" )
                    return get_url( trim( substr( $value, 9, strlen($value) ) ) );
            }
        }
    }

    if (    ( preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value) ) &&
            $javascript_loop < 5
    )
    {
        return get_url( $value[1], $javascript_loop+1 );
    }
    else
    {
        return array( $content, $response );
    }
}

$url="http://example.com/index.php?a=b";

$output = get_url($url,0,20);

print_r($output);

$url="http://example.com/index.php?a=b+c";

$output = get_url($url,0,20);

print_r($output);

?>

Open in new window

0
bitt3n
Asked:
bitt3n
  • 6
  • 5
2 Solutions
 
bitt3nAuthor Commented:
the specific error message returned is:

Bad Request

Your browser sent a request that this server could not understand.
The request line contained invalid characters following the protocol string.
0
 
Guy Hengel [angelIII / a3]Billing EngineerCommented:
you shall not pass "+" in the url, but "%2B" instead

aka url_encore("+")
http://php.net/manual/en/function.urlencode.php
0
 
Ray PaseurCommented:
All characters in the GET string, except the equal sign that ties keys to values, must be URLencoded().  This is not unique to CURL; it is a rule of HTTP.  The plus sign in a URL is decoded into a blank. So if you want to send a blank in a string like Hello World, you would send Hello+World.

You can see the effect in action here:
http://www.laprbass.com/RAY_bounce_get.php?a=b+c
0
Cloud Class® Course: Amazon Web Services - Basic

Are you thinking about creating an Amazon Web Services account for your business? Not sure where to start? In this course you’ll get an overview of the history of AWS and take a tour of their user interface.

 
Ray PaseurCommented:
0
 
bitt3nAuthor Commented:
ok I think I get it.. the function is decoding the url via

$url = str_replace( "&amp;", "&", urldecode(trim($url)) );

prior to curl_exec(), so if there's a '+' in the url, this gets decoded to ' ', which results in a malformed url. (any url containing '+' in the get variable string will cause the above script to return 'Bad Request')

I am afraid I do not follow the logic of decoding the url before calling curl_exec(). wouldn't cURL require the url to be encoded? the fact that a '+' in the url causes the bad request error seems to be a result of this decoding.
0
 
Ray PaseurCommented:
Just curious... Where did this code come from?  I have found lots of CURL examples (of widely varying quality) but did not see it on this page:
http://us.php.net/manual/en/book.curl.php

But a better question would be the practical one... What are you trying to do? What are the actual URLs and arguments you want to pass?  If you give us an accurate picture of the issues we can probably give you a concrete solution.

This sample uses POST, but the encoding would be the same whether you use GET or POST to send the arguments.  See lines 10-16.
<?php // RAY_curl_post_example.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO USE CURL POST TO START AN ASYNCHRONOUS PROCESS


function curl_post($url, $post_array, $timeout=2, $error_report=FALSE)
{
    // PREPARE THE POST STRING
    $post_string = '';
    foreach ($post_array as $key => $val)
    {
        $post_string .= urlencode($key) . '=' . urlencode($val) . '&';
    }
    $post_string = rtrim($post_string, '&');

    // PREPARE THE CURL CALL
    $curl = curl_init();
    curl_setopt( $curl, CURLOPT_URL,            $url         );
    curl_setopt( $curl, CURLOPT_HEADER,         FALSE        );
    curl_setopt( $curl, CURLOPT_POST,           TRUE         );
    curl_setopt( $curl, CURLOPT_POSTFIELDS,     $post_string );
    curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout     );
    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE         );

    // EXECUTE THE CURL CALL
    $htm = curl_exec($curl);
    $err = curl_errno($curl);
    $inf = curl_getinfo($curl);

    // ON FAILURE
    if (!$htm)
    {
        // PROCESS ERRORS HERE
        if ($error_report)
        {
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            echo "<pre>\n";
            var_dump($inf);
            echo "</pre>\n";
        }
        curl_close($curl);
        return FALSE;
    }

    // ON SUCCESS
    curl_close($curl);
    return $htm;
}


// USAGE EXAMPLE CREATES ASSOCIATIVE ARRAY OF KEY=>VALUE PAIRS
$args["name"]  = 'Ray';
$args["email"] = 'Ray.Paseur@Gmail.com';

// ACTIVATE THIS TO SEE THE ARRAY OF ARGS
// var_dump($args);

// SET THE URL
$url = "http://LAPRBass.com/RAY_bounce_post.php";

// CALL CURL TO POST THE DATA
$htm = curl_post($url, $args, 3, TRUE);

// SHOW WHAT CAME BACK, IF ANYTHING
if ($htm)
{
	echo "<pre>";
	echo htmlentities($htm);
}
else
{
    echo "NO RESPONSE YET FROM $url -- MAYBE BECAUSE IT IS RUNNING ASYNCHRONOUSLY";
}

Open in new window

0
 
bitt3nAuthor Commented:
the function in my script is the second comment on this page

http://php.net/manual/en/ref.curl.php

I started using it after researching an issue with cURL timing out on a slow page.

I want to retrieve price and availability data from a few sites who have granted me permission to do so when linking to products they sell. My girlfriend blogs about clothes and other frippery, and she wanted some way to get details such as 'in stock' and 'current price' for products she blogs about, so as to include these data with links to the relevant stores. I figured this would be an excuse for me to learn about cURL and using pearl style regexes, since I was only familiar with the POSIX kind, which I discovered have been deprecated since I last fiddled around with PHP some time ago.

The sites are generally friendly enough, granted my purpose, but much to my dismay, a lot of them offer an API for the data I want. Luckily not all of them do, so I've been tinkering around with cURL to see if it's worth using when necessary (but mostly just to see how it works and what I can do with it).

I came upon the above noted problem experimenting with cURL on the following bunch of get vars:

.asp?dept=1&page=2&sortorderrequested=&searchhandle_rhs=A%3dproduct+not+in+olet~B%3dproduct+not+in+olet~D%3d12~G%3d11221^2~K%3d4~L%3d1~M%3d31~N%3d2~&originalquery=product%20not%20in%20olet&sort_order_rhs=&sort_order=4%204&selectpagesize=12%2012&size=&colouranswerselected=&spellercorrections=&hd=&match=&fid=&productid=&productcolour=

(Obviously this is an ASP page, but I figured it wouldn't matter for parsing purposes.) When I tried parsing this page with a variation of the script provided in response to an earlier question

 
all I got back was an error page (customererror.php?applicationerror). Using the script from php.net posted in the original post above, I got the 'Bad Request' message I posted, which some tinkering revealed was on account of the '+' signs. Replacing these with %2B stopped the 'Bad Request' message, but just produced the same error page the original curl script I used (embedded in this comment) produced.

Strangely, *deleting* the '+' signs resulted in successful execution of the cURL via the php.net script (and only that script).

This site has made the data I need available to me directly, so there is no practical purpose for this exercise outside of learning what's going on. However, since that was my purpose to begin with, I'm curious both why deleting '+' would actually have a happy result, and why only in the php.net script (since the other script still fails with this new set of values).

I figured the more basic question to ask first is why '+' should cause the php.net script to return 'Bad Request', since '+' is a urlencoded ' ', which seems like it should be acceptable.

I understand now the error occurs when it gets decoded by

$url = str_replace( "&amp;", "&", urldecode(trim($url)) );

but I don't understand why it's being decoded. Presumably the fellow who wrote the function had some clear purpose for decoding the url, and the function otherwise appears to work admirably with every url I've provided.

Not knowing the answers to these questions isn't causing me any practical problems (so far), since if I replace '+' with '%2B' in other cases, the parsing works fine. At this point I'm just trying to appreciate what the logic is behind how the function works, because it is annoying to use code one doesn't understand.
0
 
bitt3nAuthor Commented:
edit: the error page was ASP of course, customerror.asp?Application_Error
0
 
Ray PaseurCommented:
Bad Request headers could come from any of a variety of conditions, including a malformed or unusable GET string.  Maybe the foreign server is not configured to accept GET requests that do not come from browsers?  Who knows?
http://en.wikipedia.org/wiki/List_of_HTTP_status_codes

It sounds like the design you want is called a "remote facade" in which each web site that you want to communicate with would have its own method to get the data.  The facade code would normalize the information from the other site, and present a consistent set of properties back to the main script.
0
 
bitt3nAuthor Commented:
Yes, I reckon that I'll probably never know why that particular .asp site accepts the GET vars with '+' through the browser, but only with '+' removed via cURL.

What about the function in the original post that decodes the url, which makes any GET vars with '+' in them malformed. Is there some obvious reason why this decoding is necessary that I am not seeing? I suspect the answer is obvious, but I don't see the reason.

What I've been doing for the sites that require cURL is creating a function for each site that parses the data (price, brand, title, etc.), and then returns those data for another function to handle. Is that what you mean by remote facade?
0
 
Ray PaseurCommented:
"creating a function for each site" -- exactly.  The remote facade is one of the classic design patterns from the Gang of Four.  Design patterns are considered some of the seminal thinking in modern computer science.
http://www.amazon.com/Design-Patterns-Elements-Reusable-Object-Oriented/dp/0201633612

I doubt if I can shed any light on why someone else's code does not work or contains unnecessary parts.  PHP.net does not test or "vet" the comments and while most are useful for learning something, some of them are dreck.  For example, this comment is wrong, or unnecessary and irrelevant at best:
http://www.php.net/manual/en/ref.curl.php#81651

I hope you have managed to get some code that works for you now, and that through this exercise you have learned that giving EE a URL like http://example.com/index.php?a=b is nowhere near as useful as giving us the actual URL!

Best of luck with your project, and remember to use urlencode() before you send text data over the HTTP, ~Ray
0
 
bitt3nAuthor Commented:
thanks for your help
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 6
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now