Solved

PHP cURL JSP page

Posted on 2014-01-23
14
2,349 Views
Last Modified: 2014-01-31
Hello,

I'm trying to scrape content on the following site

http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp

I'm trying to enter the text in the input field ex:AP13B1001 and hit submit button.

I'd really appreciate if you can help me in implementing this with cURL option in php.

Here is the code I'm trying with until now.

<?php



 
//ilk ekran bolumu
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$response = curl_exec($ch);
curl_close($ch);
list($header, $body) = explode("\r\n\r\n", $response, 2);
print_r($header);


 
  $fields = array(
            //post parameters to be sent to the other website
            'ptMaint:enterNo'=>'AP13B1001',
        );

//url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&';}
rtrim($fields_string,'&');
 
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt ($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);

curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
//curl_setopt($ch, CURLOPT_HTTPHEADER, array('Expect:'))

curl_setopt($ch, CURLOPT_POST, 1);
$data=curl_exec ($ch); 
echo $data;

?> 

Open in new window

0
Comment
Question by:VAMSICA
  • 7
  • 2
  • 2
  • +2
14 Comments
 
LVL 82

Expert Comment

by:Dave Baldwin
Comment Utility
cURL can't do that.  The submit button is actually a call to a javascript routine.  Unless you can figure out how to run the javascript, it will never work.
0
 

Author Comment

by:VAMSICA
Comment Utility
Thank you. Wondering if you can suggest there any techniques besides cURL?


Thanks,
Vamsi.
0
 
LVL 82

Expert Comment

by:Dave Baldwin
Comment Utility
Nothing that I know of will run the javascript on the page except a web browser.  Which is why they are doing that... to prevent you from 'scraping' their page.
0
 
LVL 51

Accepted Solution

by:
Julian Hansen earned 250 total points
Comment Utility
Have you tried using cUrl to post the value to the form url?

/PublicView/faces/jsf/publicview.jsp;jsessionid=FE21AFBECE9CF980659CB753D5CA7D2F

Get the page - pull the form action attributed and post the value to that?
0
 

Author Comment

by:VAMSICA
Comment Utility
Thank you Dave.

Julian, appreciate looking into it. Haven't tried your suggestion. I'd be glad if you can explain more on how to extract jsessionid & post it.

Thank you Julian.
0
 
LVL 51

Expert Comment

by:Julian Hansen
Comment Utility
Haven't time to code it but the principle is as follows

You get the jsession id like this
<?php
$page = file_get_contents('http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp');
$action = preg_match("/action=\".*jsessionid=([A-Z0-9]+)?\"/", $page, $matches);
?>

Open in new window


Now use cUrl with the jsessionid to post the value of the form to the page.

I have not tried this so not sure if it will work with their site but the principle is sound.
0
 

Author Comment

by:VAMSICA
Comment Utility
Thank you Julian. Your earlier suggestion of sending jsessionid to form action is great. I'm close to getting the solution. Keep you posted.

Appreciate your time.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
That site is a mess!
http://validator.w3.org/check?uri=http%3A%2F%2F220.227.242.169%3A9001%2FPublicView%2Ffaces%2Fjsf%2Fpublicview.jsp&charset=%28detect+automatically%29&doctype=Inline&group=0

But that aside, most of the time when a publisher wants to expose data for others to consume, the publisher will expose the data via an API.  Typically the API is stable and version-controlled so that consumers can depend on it.  It will produce XML (or more likely JSON, today) that is documented and easy to use.

If this project has any importance beyond a personal exercise, you should contact the publisher and ask for an API.  You should not "scrape" the web page.  There are several reasons why I recommend this.

1. You may be in violation of the copyright or terms of use for this site.  You do not want the police to show up on your doorstep.  You do not want a restraining order or a lawsuit.

2. Your dependency on the content of the web site will also create a dependency on the document structure.  With such a badly marked-up document, it's almost certain that the authors don't know what they are doing when it comes to semantic markup.  They will revise the document from time to time.  Each time they do that, your scraper script will break without notice and your application will fail.

3. Well-written APIs are commonplace today, and this is the way that internet resources share data.

4. If the publisher wants you to have the data, they will make an API!  If the publisher does not want you to have the data, they can prevent you from getting it.  It's an asymmetric struggle.  They can make changes with a few keystrokes that will cause you days of remedial work.  In other words, you can't win by scraping.

Executive summary: Just don't do that.  Contact the publisher and make formal arrangements to receive the data you need.
0
 
LVL 26

Expert Comment

by:mrcoffee365
Comment Utility
I do not know of a PHP package which does this, but if you're using Java, HtmlUnit will retrieve web pages and make an effort to execute javascript on the page:
http://htmlunit.sourceforge.net/
0
 

Author Comment

by:VAMSICA
Comment Utility
Hi @Julian,

I've tried what you suggested but no luck yet.

Here is the code hope you can suggest me what went wrong.

<?php

$headers[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$headers[] = "Connection: close";
$headers[] = "Content-type: application/x-www-form-urlencoded, text/html";
$user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110422 Ubuntu/10.10 (maverick) Firefox/3.6.17";



//ilk ekran bolumu
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,"http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$response = curl_exec($ch);
curl_close($ch);
list($recheader, $body) = explode("\r\n\r\n", $response, 2);
$Jidpos   = strpos($recheader, 'JSESSIONID=');

$FindJId= substr($recheader, $Jidpos+11, strpos($recheader, ';',$Jidpos)-$Jidpos-11);


 
  $fields = array(
            //post parameters to be sent to the other website
            'ptMaint:enterNo'=>'ap13b1001',
        );

//url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&';}
rtrim($fields_string,'&');

$url="http://220.227.242.169:9001/PublicView/faces/jsf/publicview.jsp;jsessionid=".$FindJId;
echo $url;

$ch = curl_init();

curl_setopt ($ch, CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST,count($fields));
curl_setopt($ch,CURLOPT_POSTFIELDS,$fields_string);

$data=curl_exec ($ch); 
echo $data;
curl_close($ch);
?> 

Open in new window

0
 

Author Comment

by:VAMSICA
Comment Utility
@ Ray,

I understand your insights, I'm authorized to access the content, but they doesn't have any API yet, in the mean time this is the option.
0
 
LVL 26

Assisted Solution

by:mrcoffee365
mrcoffee365 earned 250 total points
Comment Utility
What went wrong?

Basically if you are not going to execute the javascript on the page, then you have to follow all of the accesses as if you were.  So perform each step, and have your code imitate those accesses.  

It helps to use a tool which displays what is sent and what is returned in the browser to try to figure out the right calls to make.

LiveHttpHeaders for Firefox is good.  Fiddler does a similar thing for IE, I believe.
0
 

Assisted Solution

by:VAMSICA
VAMSICA earned 0 total points
Comment Utility
Thank you mrcoffee365 for pointing into the right direction. Problem solved.

I've used chrome javascript console to identify the headers & fields submitted in the post action.

In this page JSF View state is being used, so along with jsessionid I parsed javax.faces.ViewState value and submitted sent in the request, (http://java-success.blogspot.in/2011/12/jsf-interview-questions-and-answers_09.html).

Thank you all for the help.

Cheers!
Vamsi,
0
 

Author Closing Comment

by:VAMSICA
Comment Utility
I've pointed out few things that are required for this scenario
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

It is a general practice to get rid of old user profiles on a computer  in a LAN environment. As I have been working with a company in a LAN environment where users move from one place to some other place at times. This will make many user profil…
Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now