?
Solved

Screen Scraping Password Protected Sites

Posted on 2006-06-06
8
Medium Priority
?
440 Views
Last Modified: 2008-02-01
I need to do some screen scraping on a password protected site for which I have a valid login.  I create the URLConnection etc, and perform the POST operation for the URL in question, but the HTML I get back is the contents of the login page to which I have been redirected.  Setting up a PasswordAuthenticator using Authenticator.setDefault() did not help at all.  I am just curious what my general strategy for making this work should be?  I'm guessing I might need to actually perform the login, trap some cookies or something, remember them and then use them in requesting the resource in question.  Does this sound right?  I'm sure it is different depending on site, so any sort of general resource explaining how to do this would be perfect.
0
Comment
Question by:derekl
  • 2
  • 2
  • 2
  • +2
8 Comments
 
LVL 10

Expert Comment

by:radarsh
ID: 16846129
Hi derekl,

Try using the cookie that will appear in the URL once you login. See if you can
scrape the screen using that long URL.

________
radarsh
0
 
LVL 14

Expert Comment

by:hoomanv
ID: 16846267
yes first see the cookie convention used by the site --> those with headerName = Cookie
            for (int i = 1; ; i++) {
                String headerName = con.getHeaderFieldKey(i);
                String headerValue = con.getHeaderField(i);
                if(headerName == null)
                    break;
                System.out.println(headerName + " = " + headerValue);
            }

you can use the firefox extention "Live HTTP headers" and grab the cookie that is sent to site on login

// to send back the cookie thru java
con.setRequestProperty("Cookie", "  value ");
0
 
LVL 92

Accepted Solution

by:
objects earned 750 total points
ID: 16847404
try using httpunit or httpclient
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:derekl
ID: 16847938
Are HttpUnit or HttpClient standard Java APIs?
0
 
LVL 14

Expert Comment

by:hoomanv
ID: 16849388
0
 
LVL 30

Expert Comment

by:Mayank S
ID: 16849719
They are 3rd party, but quite commonly used in projects.
0
 

Author Comment

by:derekl
ID: 16854813
HttpClient is exactly what I needed objects!  Many thanks.

One more question if you don't mind.  Is there a Java package capable of reading in some source HTML and extracting the elements on a page in an OO fashion.  I'm thinking specifically of the FORM elements on a page.  I may write one myself if not.  Thanks in advance.
0
 
LVL 30

Expert Comment

by:Mayank S
ID: 16858792
There are many open-source HTML parsers:

http://java-source.net/open-source/html-parsers
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

For beginner Java programmers or at least those new to the Eclipse IDE, the following tutorial will show some (four) ways in which you can import your Java projects to your Eclipse workbench. Introduction While learning Java can be done with…
Java Flight Recorder and Java Mission Control together create a complete tool chain to continuously collect low level and detailed runtime information enabling after-the-fact incident analysis. Java Flight Recorder is a profiling and event collectio…
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.
This tutorial will introduce the viewer to VisualVM for the Java platform application. This video explains an example program and covers the Overview, Monitor, and Heap Dump tabs.
Suggested Courses
Course of the Month16 days, 15 hours left to enroll

864 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question