derekl
asked on
Screen Scraping Password Protected Sites
I need to do some screen scraping on a password protected site for which I have a valid login. I create the URLConnection etc, and perform the POST operation for the URL in question, but the HTML I get back is the contents of the login page to which I have been redirected. Setting up a PasswordAuthenticator using Authenticator.setDefault() did not help at all. I am just curious what my general strategy for making this work should be? I'm guessing I might need to actually perform the login, trap some cookies or something, remember them and then use them in requesting the resource in question. Does this sound right? I'm sure it is different depending on site, so any sort of general resource explaining how to do this would be perfect.
yes first see the cookie convention used by the site --> those with headerName = Cookie
for (int i = 1; ; i++) {
String headerName = con.getHeaderFieldKey(i);
String headerValue = con.getHeaderField(i);
if(headerName == null)
break;
System.out.println(headerN ame + " = " + headerValue);
}
you can use the firefox extention "Live HTTP headers" and grab the cookie that is sent to site on login
// to send back the cookie thru java
con.setRequestProperty("Co okie", " value ");
for (int i = 1; ; i++) {
String headerName = con.getHeaderFieldKey(i);
String headerValue = con.getHeaderField(i);
if(headerName == null)
break;
System.out.println(headerN
}
you can use the firefox extention "Live HTTP headers" and grab the cookie that is sent to site on login
// to send back the cookie thru java
con.setRequestProperty("Co
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Are HttpUnit or HttpClient standard Java APIs?
No
HttpClient: http://jakarta.apache.org/commons/httpclient/
HttpUnit: http://httpunit.sourceforge.net/
HttpClient: http://jakarta.apache.org/commons/httpclient/
HttpUnit: http://httpunit.sourceforge.net/
They are 3rd party, but quite commonly used in projects.
ASKER
HttpClient is exactly what I needed objects! Many thanks.
One more question if you don't mind. Is there a Java package capable of reading in some source HTML and extracting the elements on a page in an OO fashion. I'm thinking specifically of the FORM elements on a page. I may write one myself if not. Thanks in advance.
One more question if you don't mind. Is there a Java package capable of reading in some source HTML and extracting the elements on a page in an OO fashion. I'm thinking specifically of the FORM elements on a page. I may write one myself if not. Thanks in advance.
Try using the cookie that will appear in the URL once you login. See if you can
scrape the screen using that long URL.
________
radarsh