Solved

PHP cURL JSP page

Posted on 2014-01-23
14
2,409 Views
Last Modified: 2014-01-31
Hello,

I'm trying to scrape content on the following site

http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp

I'm trying to enter the text in the input field ex:AP13B1001 and hit submit button.

I'd really appreciate if you can help me in implementing this with cURL option in php.

Here is the code I'm trying with until now.

<?php



 
//ilk ekran bolumu
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$response = curl_exec($ch);
curl_close($ch);
list($header, $body) = explode("\r\n\r\n", $response, 2);
print_r($header);


 
  $fields = array(
            //post parameters to be sent to the other website
            'ptMaint:enterNo'=>'AP13B1001',
        );

//url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&';}
rtrim($fields_string,'&');
 
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt ($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);

curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
//curl_setopt($ch, CURLOPT_HTTPHEADER, array('Expect:'))

curl_setopt($ch, CURLOPT_POST, 1);
$data=curl_exec ($ch); 
echo $data;

?> 

Open in new window

0
Comment
Question by:VAMSICA
  • 7
  • 2
  • 2
  • +2
14 Comments
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 39805872
cURL can't do that.  The submit button is actually a call to a javascript routine.  Unless you can figure out how to run the javascript, it will never work.
0
 

Author Comment

by:VAMSICA
ID: 39805910
Thank you. Wondering if you can suggest there any techniques besides cURL?


Thanks,
Vamsi.
0
 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 39805919
Nothing that I know of will run the javascript on the page except a web browser.  Which is why they are doing that... to prevent you from 'scraping' their page.
0
Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

 
LVL 54

Accepted Solution

by:
Julian Hansen earned 250 total points
ID: 39805925
Have you tried using cUrl to post the value to the form url?

/PublicView/faces/jsf/publicview.jsp;jsessionid=FE21AFBECE9CF980659CB753D5CA7D2F

Get the page - pull the form action attributed and post the value to that?
0
 

Author Comment

by:VAMSICA
ID: 39805952
Thank you Dave.

Julian, appreciate looking into it. Haven't tried your suggestion. I'd be glad if you can explain more on how to extract jsessionid & post it.

Thank you Julian.
0
 
LVL 54

Expert Comment

by:Julian Hansen
ID: 39806111
Haven't time to code it but the principle is as follows

You get the jsession id like this
<?php
$page = file_get_contents('http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp');
$action = preg_match("/action=\".*jsessionid=([A-Z0-9]+)?\"/", $page, $matches);
?>

Open in new window


Now use cUrl with the jsessionid to post the value of the form to the page.

I have not tried this so not sure if it will work with their site but the principle is sound.
0
 

Author Comment

by:VAMSICA
ID: 39806138
Thank you Julian. Your earlier suggestion of sending jsessionid to form action is great. I'm close to getting the solution. Keep you posted.

Appreciate your time.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39806319
That site is a mess!
http://validator.w3.org/check?uri=http%3A%2F%2F220.227.242.169%3A9001%2FPublicView%2Ffaces%2Fjsf%2Fpublicview.jsp&charset=%28detect+automatically%29&doctype=Inline&group=0

But that aside, most of the time when a publisher wants to expose data for others to consume, the publisher will expose the data via an API.  Typically the API is stable and version-controlled so that consumers can depend on it.  It will produce XML (or more likely JSON, today) that is documented and easy to use.

If this project has any importance beyond a personal exercise, you should contact the publisher and ask for an API.  You should not "scrape" the web page.  There are several reasons why I recommend this.

1. You may be in violation of the copyright or terms of use for this site.  You do not want the police to show up on your doorstep.  You do not want a restraining order or a lawsuit.

2. Your dependency on the content of the web site will also create a dependency on the document structure.  With such a badly marked-up document, it's almost certain that the authors don't know what they are doing when it comes to semantic markup.  They will revise the document from time to time.  Each time they do that, your scraper script will break without notice and your application will fail.

3. Well-written APIs are commonplace today, and this is the way that internet resources share data.

4. If the publisher wants you to have the data, they will make an API!  If the publisher does not want you to have the data, they can prevent you from getting it.  It's an asymmetric struggle.  They can make changes with a few keystrokes that will cause you days of remedial work.  In other words, you can't win by scraping.

Executive summary: Just don't do that.  Contact the publisher and make formal arrangements to receive the data you need.
0
 
LVL 27

Expert Comment

by:mrcoffee365
ID: 39808716
I do not know of a PHP package which does this, but if you're using Java, HtmlUnit will retrieve web pages and make an effort to execute javascript on the page:
http://htmlunit.sourceforge.net/
0
 

Author Comment

by:VAMSICA
ID: 39811055
Hi @Julian,

I've tried what you suggested but no luck yet.

Here is the code hope you can suggest me what went wrong.

<?php

$headers[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$headers[] = "Connection: close";
$headers[] = "Content-type: application/x-www-form-urlencoded, text/html";
$user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110422 Ubuntu/10.10 (maverick) Firefox/3.6.17";



//ilk ekran bolumu
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$response = curl_exec($ch);
curl_close($ch);
list($recheader, $body) = explode("\r\n\r\n", $response, 2);
$Jidpos   = strpos($recheader, 'JSESSIONID=');

$FindJId= substr($recheader, $Jidpos+11, strpos($recheader, ';',$Jidpos)-$Jidpos-11);


 
  $fields = array(
            //post parameters to be sent to the other website
            'ptMaint:enterNo'=>'ap13b1001',
        );

//url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&';}
rtrim($fields_string,'&');

$url="http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp;jsessionid=".$FindJId;
echo $url;

$ch = curl_init();

curl_setopt ($ch, CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST,count($fields));
curl_setopt($ch,CURLOPT_POSTFIELDS,$fields_string);

$data=curl_exec ($ch); 
echo $data;
curl_close($ch);
?> 

Open in new window

0
 

Author Comment

by:VAMSICA
ID: 39811072
@ Ray,

I understand your insights, I'm authorized to access the content, but they doesn't have any API yet, in the mean time this is the option.
0
 
LVL 27

Assisted Solution

by:mrcoffee365
mrcoffee365 earned 250 total points
ID: 39811138
What went wrong?

Basically if you are not going to execute the javascript on the page, then you have to follow all of the accesses as if you were.  So perform each step, and have your code imitate those accesses.  

It helps to use a tool which displays what is sent and what is returned in the browser to try to figure out the right calls to make.

LiveHttpHeaders for Firefox is good.  Fiddler does a similar thing for IE, I believe.
0
 

Assisted Solution

by:VAMSICA
VAMSICA earned 0 total points
ID: 39811209
Thank you mrcoffee365 for pointing into the right direction. Problem solved.

I've used chrome javascript console to identify the headers & fields submitted in the post action.

In this page JSF View state is being used, so along with jsessionid I parsed javax.faces.ViewState value and submitted sent in the request, (http://java-success.blogspot.in/2011/12/jsf-interview-questions-and-answers_09.html).

Thank you all for the help.

Cheers!
Vamsi,
0
 

Author Closing Comment

by:VAMSICA
ID: 39823554
I've pointed out few things that are required for this scenario
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question