[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 380
  • Last Modified:

Trying to download a redirected url through php

I'm trying to scrape data from this url:
http://www.treasurydirect.gov/NP/BPDLogin?application=np

What is complicating the matter is that the page does a refresh back onto itself.
What I think is going on is that the initial request results in a cookie being set (JSESSIONID) and then the 2nd request returns the html that is ultimately displayed (this contains the data I want).

I've written some PHP code to set the cookie and issue the url, but it returns the first html page, which doesn't contain the data I need. Here's the code:

 
<HTML>
<BODY>

<?php
function get_data($url,$form_properties,$cookie)
{

/* IF CURL_EXEC RETURNS NOTHING and no errors are being generated, check
the selinux configuration or disable selinux completely.*/

	$timeout = 5;

	 
  if ($form_properties == "")
  {
	  $ch = curl_init();

	 
	  curl_setopt($ch,CURLOPT_URL,$url);
	  curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	  curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	  if ($cookie != "")
	  {
	  	curl_setopt($ch,CURLOPT_COOKIE,$cookie);
	  	//curl_setopt($ch,CURLOPT_COOKIEFILE,"cookies.txt");
	  	//curl_setopt($ch,CURLOPT_COOKIEJAR,"cookies.txt");
	  	
	  }
	  
	}
	else
	{
		define('POSTURL', $url);
		define('POSTVARS', $form_properties);
			
		$ch = curl_init(POSTURL);
		curl_setopt($ch, CURLOPT_POST      ,1);
		curl_setopt($ch, CURLOPT_POSTFIELDS    ,POSTVARS);
		curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
		curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);			
				 
	}


	curl_error($ch);
	$data = curl_exec($ch);

  	curl_close($ch);

  	echo "<BR>Length of data returned from url: ".strlen($data)."<BR>";
  	return $data;
	
}

$returned_content = get_data("http://www.treasurydirect.gov/NP/BPDLogin?application=np","","JSESSIONID=00013RqZE2nvtGydHtPSCWdTzt4:1388teu7f");

echo htmlspecialchars($returned_content);



?>
</BODY>
</HTML>

Open in new window


I may be off about the cookie thing but the cookie seems to be the only way that the second http request differs from the first.

Any suggestions on how to get this to work would be much appreciated.
0
opike
Asked:
opike
  • 6
  • 6
3 Solutions
 
zappafan2k2Commented:
Try adding this option:
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1)

Open in new window

Set this to TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).
0
 
Ray PaseurCommented:
That's an interesting site, and my guess is that the information on this page is completely unenforcable:
http://www.treasurydirect.gov/terms.htm

You might want to file a freedom-of-information request and ask for an API to retrieve this data.  In Washington, DC, where I live, public access to public records is a hot topic and the community is doing everything we can legally and technologically to force the Government to divulge its data sources, processes and conclusions.  Transparency is the only way to preserve a free society.

I'll tinker around with the site a little and see if I can get the HTML out of it.  Best, ~Ray
0
 
Ray PaseurCommented:
This seems to work just fine:
http://www.laprbass.com/RAY_temp_opike.php
<?php // RAY_temp_opike.php
error_reporting(E_ALL);

// THE URL FROM THE POST AT EE
$url = 'http://www.treasurydirect.gov/NP/BPDLogin?application=np';

// READ THE WEB PAGE
$htm = file_get_contents($url);

// MAKE OUR OUTPUT EASY TO READ
echo "<pre>";

// GET THE TITLE
$txt = strip_tags($htm, '<title>');
$rgx
= '#'            // REGEX DELIMITER - START
. '\<title\>'    // TITLE TAG WITH ANGLE BRACKETS ESCAPED - START
. '(.*?)'        // GROUP OF ANYTHING
. '\</title\>'   // TITLE TAG WITH ANGLE BRACKETS ESCAPED - END
. '#'            // REGEX DELIMITER - END
. 'is'           // CASE-INSENSITIVE, SINGLE LINE
;
preg_match($rgx, $txt, $arr);

// DISPLAY THE TITLE
echo
'<strong>'
. $arr[1]
. '</strong>'
. PHP_EOL
. PHP_EOL
;

// SHOW THE PAGE SOURCE
$src = htmlentities($htm);
echo $src;

Open in new window

0
Prepare for your VMware VCP6-DCV exam.

Josh Coen and Jason Langer have prepared the latest edition of VCP study guide. Both authors have been working in the IT field for more than a decade, and both hold VMware certifications. This 163-page guide covers all 10 of the exam blueprint sections.

 
opikeAuthor Commented:
"In Washington, DC, where I live, public access to public records is a hot topic and the community is doing everything we can legally and technologically to force the Government to divulge its data sources, processes and conclusions.  Transparency is the only way to preserve a free society."

That's a noble sentiment and I applaud you for it. Maybe you can stop by the Marriner Eccles building and have a chat with the Bernank about allowing an audit :).

Thanks for your code.... I take it file_get_contents() follows redirects?
0
 
opikeAuthor Commented:
@Ray - I tried using your code in my system and for me it returns the first html page (before the redirect). I even tried going to the treasury direct link first (in order to set the cookie) and then tried running your code, but that had no effect.
0
 
Ray PaseurCommented:
Did you try just clicking the link here:
http://www.laprbass.com/RAY_temp_opike.php

I see the html of the web page.  
http://www.treasurydirect.gov/NP/BPDLogin?application=np

BTW, it appears that it has not been updated since 5/12 which somewhat dilutes the argument that is is "Debt to the Penny."

0
 
opikeAuthor Commented:
Maybe there's a php configuration that affects this behavior?
0
 
Ray PaseurCommented:
Yes, there could be.  Do you have the output of this script (snippet)?  You might want to read this man page and then check to see if your PHP settings are right.
http://us.php.net/manual/en/function.file-get-contents.php
<?php phpinfo();

Open in new window

0
 
opikeAuthor Commented:
"Did you try just clicking the link here:
http://www.laprbass.com/RAY_temp_opike.php"

Yes, that link works but I need code that will work on my system here.

I'm aware of phpinfo() but I'm not sure which options I should be looking at. basedir and safe mode are both turned off but those apply to CURLOPT_FOLLOWLOCATION and I'm not sure if that matter here.
0
 
Ray PaseurCommented:
Read the man page for file_get_contents() - it has some hints that might explain the difference between my server and your server.  I am running Linux FastCGI PHP 5.3.6.  Safe mode is off.  Allow URL Fopen is on.
0
 
opikeAuthor Commented:
Success - I was able to get it to work with curl_exec by adding the following line:

curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");

0
 
Ray PaseurCommented:
Interesting... I am fairly sure that file_get_contents() uses a GET method request and does not provide a user agent, so maybe this is something that the foreign site looks for in POST requests.  Anyway, glad you got it working.
0
 
opikeAuthor Commented:
I figured out the solution but awarded the points in appreciation of the efforts made by the experts.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 6
  • 6
Tackle projects and never again get stuck behind a technical roadblock.
Join Now