Link to home
Start Free TrialLog in
Avatar of derekl
derekl

asked on

Screen Scraping Password Protected Sites

I need to do some screen scraping on a password protected site for which I have a valid login.  I create the URLConnection etc, and perform the POST operation for the URL in question, but the HTML I get back is the contents of the login page to which I have been redirected.  Setting up a PasswordAuthenticator using Authenticator.setDefault() did not help at all.  I am just curious what my general strategy for making this work should be?  I'm guessing I might need to actually perform the login, trap some cookies or something, remember them and then use them in requesting the resource in question.  Does this sound right?  I'm sure it is different depending on site, so any sort of general resource explaining how to do this would be perfect.
Avatar of radarsh
radarsh

Hi derekl,

Try using the cookie that will appear in the URL once you login. See if you can
scrape the screen using that long URL.

________
radarsh
Avatar of hoomanv
yes first see the cookie convention used by the site --> those with headerName = Cookie
            for (int i = 1; ; i++) {
                String headerName = con.getHeaderFieldKey(i);
                String headerValue = con.getHeaderField(i);
                if(headerName == null)
                    break;
                System.out.println(headerName + " = " + headerValue);
            }

you can use the firefox extention "Live HTTP headers" and grab the cookie that is sent to site on login

// to send back the cookie thru java
con.setRequestProperty("Cookie", "  value ");
ASKER CERTIFIED SOLUTION
Avatar of Mick Barry
Mick Barry
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of derekl

ASKER

Are HttpUnit or HttpClient standard Java APIs?
They are 3rd party, but quite commonly used in projects.
Avatar of derekl

ASKER

HttpClient is exactly what I needed objects!  Many thanks.

One more question if you don't mind.  Is there a Java package capable of reading in some source HTML and extracting the elements on a page in an OO fashion.  I'm thinking specifically of the FORM elements on a page.  I may write one myself if not.  Thanks in advance.
There are many open-source HTML parsers:

http://java-source.net/open-source/html-parsers