Solved

why does this cURL function fail on $_GET variables with '+'?

Posted on 2011-02-11
12
1,115 Views
Last Modified: 2012-06-27
using a basic, very helpful cURL script (see below) that I pulled off of the php.net cURL page comments, I tried parsing two pages

http://example.com/index.php?a=b

and

http://example.com/index.php?a=b+c

The first works, the second fails as a 'Bad Request'. It appears that any link with + in a get variable fails, whereas any other links I try work. I am curious what might be causing this behavior.

Here's a script that shows the output of the function for both of the cases above. (Substitute any page for the above url's and get the same problem using +. The actual url's are irrelevant.)

<?php

/*==================================
Get url content and response headers (given a url, follows all redirections on it and returned content and response headers of final url)

@return    array[0]    content
        array[1]    array of response headers
==================================*/
function get_url( $url,  $javascript_loop = 0, $timeout = 5 )
{
    $url = str_replace( "&amp;", "&", urldecode(trim($url)) );

    $cookie = tempnam ("/tmp", "CURLCOOKIE");
    $ch = curl_init();
    curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
    curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt( $ch, CURLOPT_ENCODING, "" );
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
    curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );    # required for https urls
    curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
    curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
    curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
    $content = curl_exec( $ch );
    $response = curl_getinfo( $ch );
    curl_close ( $ch );

    if ($response['http_code'] == 301 || $response['http_code'] == 302)
    {
        ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");

        if ( $headers = get_headers($response['url']) )
        {
            foreach( $headers as $value )
            {
                if ( substr( strtolower($value), 0, 9 ) == "location:" )
                    return get_url( trim( substr( $value, 9, strlen($value) ) ) );
            }
        }
    }

    if (    ( preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value) ) &&
            $javascript_loop < 5
    )
    {
        return get_url( $value[1], $javascript_loop+1 );
    }
    else
    {
        return array( $content, $response );
    }
}

$url="http://example.com/index.php?a=b";

$output = get_url($url,0,20);

print_r($output);

$url="http://example.com/index.php?a=b+c";

$output = get_url($url,0,20);

print_r($output);

?>

Open in new window

0
Comment
Question by:bitt3n
  • 6
  • 5
12 Comments
 

Author Comment

by:bitt3n
ID: 34874756
the specific error message returned is:

Bad Request

Your browser sent a request that this server could not understand.
The request line contained invalid characters following the protocol string.
0
 
LVL 142

Accepted Solution

by:
Guy Hengel [angelIII / a3] earned 250 total points
ID: 34874770
you shall not pass "+" in the url, but "%2B" instead

aka url_encore("+")
http://php.net/manual/en/function.urlencode.php
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 34878381
All characters in the GET string, except the equal sign that ties keys to values, must be URLencoded().  This is not unique to CURL; it is a rule of HTTP.  The plus sign in a URL is decoded into a blank. So if you want to send a blank in a string like Hello World, you would send Hello+World.

You can see the effect in action here:
http://www.laprbass.com/RAY_bounce_get.php?a=b+c
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 34878385
0
 

Author Comment

by:bitt3n
ID: 34880350
ok I think I get it.. the function is decoding the url via

$url = str_replace( "&amp;", "&", urldecode(trim($url)) );

prior to curl_exec(), so if there's a '+' in the url, this gets decoded to ' ', which results in a malformed url. (any url containing '+' in the get variable string will cause the above script to return 'Bad Request')

I am afraid I do not follow the logic of decoding the url before calling curl_exec(). wouldn't cURL require the url to be encoded? the fact that a '+' in the url causes the bad request error seems to be a result of this decoding.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 34880423
Just curious... Where did this code come from?  I have found lots of CURL examples (of widely varying quality) but did not see it on this page:
http://us.php.net/manual/en/book.curl.php

But a better question would be the practical one... What are you trying to do? What are the actual URLs and arguments you want to pass?  If you give us an accurate picture of the issues we can probably give you a concrete solution.

This sample uses POST, but the encoding would be the same whether you use GET or POST to send the arguments.  See lines 10-16.
<?php // RAY_curl_post_example.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO USE CURL POST TO START AN ASYNCHRONOUS PROCESS


function curl_post($url, $post_array, $timeout=2, $error_report=FALSE)
{
    // PREPARE THE POST STRING
    $post_string = '';
    foreach ($post_array as $key => $val)
    {
        $post_string .= urlencode($key) . '=' . urlencode($val) . '&';
    }
    $post_string = rtrim($post_string, '&');

    // PREPARE THE CURL CALL
    $curl = curl_init();
    curl_setopt( $curl, CURLOPT_URL,            $url         );
    curl_setopt( $curl, CURLOPT_HEADER,         FALSE        );
    curl_setopt( $curl, CURLOPT_POST,           TRUE         );
    curl_setopt( $curl, CURLOPT_POSTFIELDS,     $post_string );
    curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout     );
    curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE         );

    // EXECUTE THE CURL CALL
    $htm = curl_exec($curl);
    $err = curl_errno($curl);
    $inf = curl_getinfo($curl);

    // ON FAILURE
    if (!$htm)
    {
        // PROCESS ERRORS HERE
        if ($error_report)
        {
            echo "CURL FAIL: $url TIMEOUT=$timeout, CURL_ERRNO=$err";
            echo "<pre>\n";
            var_dump($inf);
            echo "</pre>\n";
        }
        curl_close($curl);
        return FALSE;
    }

    // ON SUCCESS
    curl_close($curl);
    return $htm;
}


// USAGE EXAMPLE CREATES ASSOCIATIVE ARRAY OF KEY=>VALUE PAIRS
$args["name"]  = 'Ray';
$args["email"] = 'Ray.Paseur@Gmail.com';

// ACTIVATE THIS TO SEE THE ARRAY OF ARGS
// var_dump($args);

// SET THE URL
$url = "http://LAPRBass.com/RAY_bounce_post.php";

// CALL CURL TO POST THE DATA
$htm = curl_post($url, $args, 3, TRUE);

// SHOW WHAT CAME BACK, IF ANYTHING
if ($htm)
{
	echo "<pre>";
	echo htmlentities($htm);
}
else
{
    echo "NO RESPONSE YET FROM $url -- MAYBE BECAUSE IT IS RUNNING ASYNCHRONOUSLY";
}

Open in new window

0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 

Author Comment

by:bitt3n
ID: 34883296
the function in my script is the second comment on this page

http://php.net/manual/en/ref.curl.php

I started using it after researching an issue with cURL timing out on a slow page.

I want to retrieve price and availability data from a few sites who have granted me permission to do so when linking to products they sell. My girlfriend blogs about clothes and other frippery, and she wanted some way to get details such as 'in stock' and 'current price' for products she blogs about, so as to include these data with links to the relevant stores. I figured this would be an excuse for me to learn about cURL and using pearl style regexes, since I was only familiar with the POSIX kind, which I discovered have been deprecated since I last fiddled around with PHP some time ago.

The sites are generally friendly enough, granted my purpose, but much to my dismay, a lot of them offer an API for the data I want. Luckily not all of them do, so I've been tinkering around with cURL to see if it's worth using when necessary (but mostly just to see how it works and what I can do with it).

I came upon the above noted problem experimenting with cURL on the following bunch of get vars:

.asp?dept=1&page=2&sortorderrequested=&searchhandle_rhs=A%3dproduct+not+in+olet~B%3dproduct+not+in+olet~D%3d12~G%3d11221^2~K%3d4~L%3d1~M%3d31~N%3d2~&originalquery=product%20not%20in%20olet&sort_order_rhs=&sort_order=4%204&selectpagesize=12%2012&size=&colouranswerselected=&spellercorrections=&hd=&match=&fid=&productid=&productcolour=

(Obviously this is an ASP page, but I figured it wouldn't matter for parsing purposes.) When I tried parsing this page with a variation of the script provided in response to an earlier question

 
all I got back was an error page (customererror.php?applicationerror). Using the script from php.net posted in the original post above, I got the 'Bad Request' message I posted, which some tinkering revealed was on account of the '+' signs. Replacing these with %2B stopped the 'Bad Request' message, but just produced the same error page the original curl script I used (embedded in this comment) produced.

Strangely, *deleting* the '+' signs resulted in successful execution of the cURL via the php.net script (and only that script).

This site has made the data I need available to me directly, so there is no practical purpose for this exercise outside of learning what's going on. However, since that was my purpose to begin with, I'm curious both why deleting '+' would actually have a happy result, and why only in the php.net script (since the other script still fails with this new set of values).

I figured the more basic question to ask first is why '+' should cause the php.net script to return 'Bad Request', since '+' is a urlencoded ' ', which seems like it should be acceptable.

I understand now the error occurs when it gets decoded by

$url = str_replace( "&amp;", "&", urldecode(trim($url)) );

but I don't understand why it's being decoded. Presumably the fellow who wrote the function had some clear purpose for decoding the url, and the function otherwise appears to work admirably with every url I've provided.

Not knowing the answers to these questions isn't causing me any practical problems (so far), since if I replace '+' with '%2B' in other cases, the parsing works fine. At this point I'm just trying to appreciate what the logic is behind how the function works, because it is annoying to use code one doesn't understand.
0
 

Author Comment

by:bitt3n
ID: 34883313
edit: the error page was ASP of course, customerror.asp?Application_Error
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 250 total points
ID: 34884177
Bad Request headers could come from any of a variety of conditions, including a malformed or unusable GET string.  Maybe the foreign server is not configured to accept GET requests that do not come from browsers?  Who knows?
http://en.wikipedia.org/wiki/List_of_HTTP_status_codes

It sounds like the design you want is called a "remote facade" in which each web site that you want to communicate with would have its own method to get the data.  The facade code would normalize the information from the other site, and present a consistent set of properties back to the main script.
0
 

Author Comment

by:bitt3n
ID: 34885070
Yes, I reckon that I'll probably never know why that particular .asp site accepts the GET vars with '+' through the browser, but only with '+' removed via cURL.

What about the function in the original post that decodes the url, which makes any GET vars with '+' in them malformed. Is there some obvious reason why this decoding is necessary that I am not seeing? I suspect the answer is obvious, but I don't see the reason.

What I've been doing for the sites that require cURL is creating a function for each site that parses the data (price, brand, title, etc.), and then returns those data for another function to handle. Is that what you mean by remote facade?
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 34887160
"creating a function for each site" -- exactly.  The remote facade is one of the classic design patterns from the Gang of Four.  Design patterns are considered some of the seminal thinking in modern computer science.
http://www.amazon.com/Design-Patterns-Elements-Reusable-Object-Oriented/dp/0201633612

I doubt if I can shed any light on why someone else's code does not work or contains unnecessary parts.  PHP.net does not test or "vet" the comments and while most are useful for learning something, some of them are dreck.  For example, this comment is wrong, or unnecessary and irrelevant at best:
http://www.php.net/manual/en/ref.curl.php#81651

I hope you have managed to get some code that works for you now, and that through this exercise you have learned that giving EE a URL like http://example.com/index.php?a=b is nowhere near as useful as giving us the actual URL!

Best of luck with your project, and remember to use urlencode() before you send text data over the HTTP, ~Ray
0
 

Author Closing Comment

by:bitt3n
ID: 34899762
thanks for your help
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Suggested Solutions

Consider the following scenario: You are working on a website and make something great - something that lets the server work with information submitted by your users. This could be anything, from a simple guestbook to a e-Money solution. But what…
Developers of all skill levels should learn to use current best practices when developing websites. However many developers, new and old, fall into the trap of using deprecated features because this is what so many tutorials and books tell them to u…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to count occurrences of each item in an array.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now