Link to home
Start Free TrialLog in
Avatar of VAMSICA
VAMSICA

asked on

PHP cURL JSP page

Hello,

I'm trying to scrape content on the following site

http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp

I'm trying to enter the text in the input field ex:AP13B1001 and hit submit button.

I'd really appreciate if you can help me in implementing this with cURL option in php.

Here is the code I'm trying with until now.

<?php



 
//ilk ekran bolumu
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$response = curl_exec($ch);
curl_close($ch);
list($header, $body) = explode("\r\n\r\n", $response, 2);
print_r($header);


 
  $fields = array(
            //post parameters to be sent to the other website
            'ptMaint:enterNo'=>'AP13B1001',
        );

//url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&';}
rtrim($fields_string,'&');
 
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt ($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);

curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
//curl_setopt($ch, CURLOPT_HTTPHEADER, array('Expect:'))

curl_setopt($ch, CURLOPT_POST, 1);
$data=curl_exec ($ch); 
echo $data;

?> 

Open in new window

Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

cURL can't do that.  The submit button is actually a call to a javascript routine.  Unless you can figure out how to run the javascript, it will never work.
Avatar of VAMSICA
VAMSICA

ASKER

Thank you. Wondering if you can suggest there any techniques besides cURL?


Thanks,
Vamsi.
Nothing that I know of will run the javascript on the page except a web browser.  Which is why they are doing that... to prevent you from 'scraping' their page.
ASKER CERTIFIED SOLUTION
Avatar of Julian Hansen
Julian Hansen
Flag of South Africa image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of VAMSICA

ASKER

Thank you Dave.

Julian, appreciate looking into it. Haven't tried your suggestion. I'd be glad if you can explain more on how to extract jsessionid & post it.

Thank you Julian.
Haven't time to code it but the principle is as follows

You get the jsession id like this
<?php
$page = file_get_contents('http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp');
$action = preg_match("/action=\".*jsessionid=([A-Z0-9]+)?\"/", $page, $matches);
?>

Open in new window


Now use cUrl with the jsessionid to post the value of the form to the page.

I have not tried this so not sure if it will work with their site but the principle is sound.
Avatar of VAMSICA

ASKER

Thank you Julian. Your earlier suggestion of sending jsessionid to form action is great. I'm close to getting the solution. Keep you posted.

Appreciate your time.
That site is a mess!
http://validator.w3.org/check?uri=http%3A%2F%2F220.227.242.169%3A9001%2FPublicView%2Ffaces%2Fjsf%2Fpublicview.jsp&charset=%28detect+automatically%29&doctype=Inline&group=0

But that aside, most of the time when a publisher wants to expose data for others to consume, the publisher will expose the data via an API.  Typically the API is stable and version-controlled so that consumers can depend on it.  It will produce XML (or more likely JSON, today) that is documented and easy to use.

If this project has any importance beyond a personal exercise, you should contact the publisher and ask for an API.  You should not "scrape" the web page.  There are several reasons why I recommend this.

1. You may be in violation of the copyright or terms of use for this site.  You do not want the police to show up on your doorstep.  You do not want a restraining order or a lawsuit.

2. Your dependency on the content of the web site will also create a dependency on the document structure.  With such a badly marked-up document, it's almost certain that the authors don't know what they are doing when it comes to semantic markup.  They will revise the document from time to time.  Each time they do that, your scraper script will break without notice and your application will fail.

3. Well-written APIs are commonplace today, and this is the way that internet resources share data.

4. If the publisher wants you to have the data, they will make an API!  If the publisher does not want you to have the data, they can prevent you from getting it.  It's an asymmetric struggle.  They can make changes with a few keystrokes that will cause you days of remedial work.  In other words, you can't win by scraping.

Executive summary: Just don't do that.  Contact the publisher and make formal arrangements to receive the data you need.
I do not know of a PHP package which does this, but if you're using Java, HtmlUnit will retrieve web pages and make an effort to execute javascript on the page:
http://htmlunit.sourceforge.net/
Avatar of VAMSICA

ASKER

Hi @Julian,

I've tried what you suggested but no luck yet.

Here is the code hope you can suggest me what went wrong.

<?php

$headers[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$headers[] = "Connection: close";
$headers[] = "Content-type: application/x-www-form-urlencoded, text/html";
$user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110422 Ubuntu/10.10 (maverick) Firefox/3.6.17";



//ilk ekran bolumu
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$response = curl_exec($ch);
curl_close($ch);
list($recheader, $body) = explode("\r\n\r\n", $response, 2);
$Jidpos   = strpos($recheader, 'JSESSIONID=');

$FindJId= substr($recheader, $Jidpos+11, strpos($recheader, ';',$Jidpos)-$Jidpos-11);


 
  $fields = array(
            //post parameters to be sent to the other website
            'ptMaint:enterNo'=>'ap13b1001',
        );

//url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&';}
rtrim($fields_string,'&');

$url="http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp;jsessionid=".$FindJId;
echo $url;

$ch = curl_init();

curl_setopt ($ch, CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST,count($fields));
curl_setopt($ch,CURLOPT_POSTFIELDS,$fields_string);

$data=curl_exec ($ch); 
echo $data;
curl_close($ch);
?> 

Open in new window

Avatar of VAMSICA

ASKER

@ Ray,

I understand your insights, I'm authorized to access the content, but they doesn't have any API yet, in the mean time this is the option.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of VAMSICA

ASKER

I've pointed out few things that are required for this scenario