PHP cURL JSP page

Hello,

I'm trying to scrape content on the following site

http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp

I'm trying to enter the text in the input field ex:AP13B1001 and hit submit button.

I'd really appreciate if you can help me in implementing this with cURL option in php.

Here is the code I'm trying with until now.

<?php



 
//ilk ekran bolumu
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$response = curl_exec($ch);
curl_close($ch);
list($header, $body) = explode("\r\n\r\n", $response, 2);
print_r($header);


 
  $fields = array(
            //post parameters to be sent to the other website
            'ptMaint:enterNo'=>'AP13B1001',
        );

//url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&';}
rtrim($fields_string,'&');
 
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt ($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);

curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
//curl_setopt($ch, CURLOPT_HTTPHEADER, array('Expect:'))

curl_setopt($ch, CURLOPT_POST, 1);
$data=curl_exec ($ch); 
echo $data;

?> 

Open in new window

VAMSICAAsked:
Who is Participating?
 
Julian HansenConnect With a Mentor Commented:
Have you tried using cUrl to post the value to the form url?

/PublicView/faces/jsf/publicview.jsp;jsessionid=FE21AFBECE9CF980659CB753D5CA7D2F

Get the page - pull the form action attributed and post the value to that?
0
 
Dave BaldwinFixer of ProblemsCommented:
cURL can't do that.  The submit button is actually a call to a javascript routine.  Unless you can figure out how to run the javascript, it will never work.
0
 
VAMSICAAuthor Commented:
Thank you. Wondering if you can suggest there any techniques besides cURL?


Thanks,
Vamsi.
0
Take Control of Web Hosting For Your Clients

As a web developer or IT admin, successfully managing multiple client accounts can be challenging. In this webinar we will look at the tools provided by Media Temple and Plesk to make managing your clients’ hosting easier.

 
Dave BaldwinFixer of ProblemsCommented:
Nothing that I know of will run the javascript on the page except a web browser.  Which is why they are doing that... to prevent you from 'scraping' their page.
0
 
VAMSICAAuthor Commented:
Thank you Dave.

Julian, appreciate looking into it. Haven't tried your suggestion. I'd be glad if you can explain more on how to extract jsessionid & post it.

Thank you Julian.
0
 
Julian HansenCommented:
Haven't time to code it but the principle is as follows

You get the jsession id like this
<?php
$page = file_get_contents('http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp');
$action = preg_match("/action=\".*jsessionid=([A-Z0-9]+)?\"/", $page, $matches);
?>

Open in new window


Now use cUrl with the jsessionid to post the value of the form to the page.

I have not tried this so not sure if it will work with their site but the principle is sound.
0
 
VAMSICAAuthor Commented:
Thank you Julian. Your earlier suggestion of sending jsessionid to form action is great. I'm close to getting the solution. Keep you posted.

Appreciate your time.
0
 
Ray PaseurCommented:
That site is a mess!
http://validator.w3.org/check?uri=http%3A%2F%2F220.227.242.169%3A9001%2FPublicView%2Ffaces%2Fjsf%2Fpublicview.jsp&charset=%28detect+automatically%29&doctype=Inline&group=0

But that aside, most of the time when a publisher wants to expose data for others to consume, the publisher will expose the data via an API.  Typically the API is stable and version-controlled so that consumers can depend on it.  It will produce XML (or more likely JSON, today) that is documented and easy to use.

If this project has any importance beyond a personal exercise, you should contact the publisher and ask for an API.  You should not "scrape" the web page.  There are several reasons why I recommend this.

1. You may be in violation of the copyright or terms of use for this site.  You do not want the police to show up on your doorstep.  You do not want a restraining order or a lawsuit.

2. Your dependency on the content of the web site will also create a dependency on the document structure.  With such a badly marked-up document, it's almost certain that the authors don't know what they are doing when it comes to semantic markup.  They will revise the document from time to time.  Each time they do that, your scraper script will break without notice and your application will fail.

3. Well-written APIs are commonplace today, and this is the way that internet resources share data.

4. If the publisher wants you to have the data, they will make an API!  If the publisher does not want you to have the data, they can prevent you from getting it.  It's an asymmetric struggle.  They can make changes with a few keystrokes that will cause you days of remedial work.  In other words, you can't win by scraping.

Executive summary: Just don't do that.  Contact the publisher and make formal arrangements to receive the data you need.
0
 
mrcoffee365Commented:
I do not know of a PHP package which does this, but if you're using Java, HtmlUnit will retrieve web pages and make an effort to execute javascript on the page:
http://htmlunit.sourceforge.net/
0
 
VAMSICAAuthor Commented:
Hi @Julian,

I've tried what you suggested but no luck yet.

Here is the code hope you can suggest me what went wrong.

<?php

$headers[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$headers[] = "Connection: close";
$headers[] = "Content-type: application/x-www-form-urlencoded, text/html";
$user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110422 Ubuntu/10.10 (maverick) Firefox/3.6.17";



//ilk ekran bolumu
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$response = curl_exec($ch);
curl_close($ch);
list($recheader, $body) = explode("\r\n\r\n", $response, 2);
$Jidpos   = strpos($recheader, 'JSESSIONID=');

$FindJId= substr($recheader, $Jidpos+11, strpos($recheader, ';',$Jidpos)-$Jidpos-11);


 
  $fields = array(
            //post parameters to be sent to the other website
            'ptMaint:enterNo'=>'ap13b1001',
        );

//url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&';}
rtrim($fields_string,'&');

$url="http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp;jsessionid=".$FindJId;
echo $url;

$ch = curl_init();

curl_setopt ($ch, CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST,count($fields));
curl_setopt($ch,CURLOPT_POSTFIELDS,$fields_string);

$data=curl_exec ($ch); 
echo $data;
curl_close($ch);
?> 

Open in new window

0
 
VAMSICAAuthor Commented:
@ Ray,

I understand your insights, I'm authorized to access the content, but they doesn't have any API yet, in the mean time this is the option.
0
 
mrcoffee365Connect With a Mentor Commented:
What went wrong?

Basically if you are not going to execute the javascript on the page, then you have to follow all of the accesses as if you were.  So perform each step, and have your code imitate those accesses.  

It helps to use a tool which displays what is sent and what is returned in the browser to try to figure out the right calls to make.

LiveHttpHeaders for Firefox is good.  Fiddler does a similar thing for IE, I believe.
0
 
VAMSICAConnect With a Mentor Author Commented:
Thank you mrcoffee365 for pointing into the right direction. Problem solved.

I've used chrome javascript console to identify the headers & fields submitted in the post action.

In this page JSF View state is being used, so along with jsessionid I parsed javax.faces.ViewState value and submitted sent in the request, (http://java-success.blogspot.in/2011/12/jsf-interview-questions-and-answers_09.html).

Thank you all for the help.

Cheers!
Vamsi,
0
 
VAMSICAAuthor Commented:
I've pointed out few things that are required for this scenario
0
All Courses

From novice to tech pro — start learning today.