asked on

Screen Scraping Password Protected Sites

I need to do some screen scraping on a password protected site for which I have a valid login. I create the URLConnection etc, and perform the POST operation for the URL in question, but the HTML I get back is the contents of the login page to which I have been redirected. Setting up a PasswordAuthenticator using Authenticator.setDefault() did not help at all. I am just curious what my general strategy for making this work should be? I'm guessing I might need to actually perform the login, trap some cookies or something, remember them and then use them in requesting the resource in question. Does this sound right? I'm sure it is different depending on site, so any sort of general resource explaining how to do this would be perfect.

radarsh

Hi derekl,

Try using the cookie that will appear in the URL once you login. See if you can
scrape the screen using that long URL.

________
radarsh

hoomanv

yes first see the cookie convention used by the site --> those with headerName = Cookie
for (int i = 1; ; i++) {
String headerName = con.getHeaderFieldKey(i);
String headerValue = con.getHeaderField(i);
if(headerName == null)
break;
System.out.println(headerName + " = " + headerValue);
}

you can use the firefox extention "Live HTTP headers" and grab the cookie that is sent to site on login

// to send back the cookie thru java
con.setRequestProperty("Cookie", " value ");

ASKER CERTIFIED SOLUTION

Mick Barry

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

derekl

ASKER

Are HttpUnit or HttpClient standard Java APIs?

hoomanv

No
HttpClient: http://jakarta.apache.org/commons/httpclient/
HttpUnit: http://httpunit.sourceforge.net/

Mayank S

They are 3rd party, but quite commonly used in projects.

derekl

ASKER

HttpClient is exactly what I needed objects! Many thanks.

One more question if you don't mind. Is there a Java package capable of reading in some source HTML and extracting the elements on a page in an OO fashion. I'm thinking specifically of the FORM elements on a page. I may write one myself if not. Thanks in advance.

Mayank S

There are many open-source HTML parsers:

http://java-source.net/open-source/html-parsers