Link to home
Start Free TrialLog in
Avatar of mcorsi62
mcorsi62

asked on

Writing a script to act as a mozilla browser for screen scraping

I have a script which scrapes some files from a website. The site needs a username and password to login. The script ran fine for a while, but now - even though the site recognizes me on the browser when I log in through the cookie - it does not recognize  the script as being authorized to view that page.

Here is a excerpt from the script:

my $cookie_jar = HTTP::Cookies::Netscape->new(
       file => "C:\\Documents and Settings\\mcorsi.SPEARREPORT\\Application Data\\Mozilla\\Profiles\\default\\wx9k36hh.slt\\cookies.txt",
   );
   my $browser = LWP::UserAgent->new;
   $browser->cookie_jar( $cookie_jar );
my $response = $browser->get("http://www.somewebsite.com");


The response from the script is that I need to be a registered user to view that page even though when I go to the exact same URL with the  mozilla browser I receive the correct page displayed.

Any ideas?
Avatar of Adam314
Adam314

Are you sure the cookie file you reference is the one the mozilla browser is using?
If you open the cookies.txt file, do you see the cookie for the site you are going to?
Avatar of mcorsi62

ASKER

Yes - it looks like the site is using a second method of verifying the request came from a browser. By default LWP:useragent does not send any headers in the request. Can anyone tell me what the default header paramaters are that are normally sent from mozilla?
If you FireFox,you can use the LiveHTTPHeaders add-on to see all headers sent.  I'm not sure about other mozilla browsers.
It looks like FireFox sends these:
Host, User-Agent, Accept, Accept-Language, Accept-Encoding, Accept-Charset, Keep-Alive, Connection
I think this is the issue:

The top line is from the mozilla browser (against my own website), the next line is from the perl script. If anyone knows LWP really well, can you tell me what objects to use to sync these two requests so they look the same?
Mozilla:
71.235.224.80 - - [05/Feb/2008:16:02:20 -0500] "GET /Build/Homepage.php HTTP/1.1" 200 3286 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.13) Gecko/20060414"
Script:
71.235.224.80 - - [05/Feb/2008:16:03:11 -0500] "GET /Build/Homepage.php HTTP/1.1" 200 3286 "-" "libwww-perl/5.806"
Ok - i have the the two requests (or at least as much of the requests as my webserver logs) looking the same. What other parameters are sent in a request that i might be missing. My mozilla browser can get onto the webpage through the cookie, but my script cannot. The site must be looking for some other information in the request header file in order to determine this is a request from a browser. Any ideas?
Thanks adam - now i just need to see those values. Could you send yours and i will try to extrapolate from there?
Nvm Adam - I am going to switch to firefox - thanks for the help. If you pointed down the path to solving this one, i will award you the points! ;-)
The part that is different is the useragent.  I'm guessing from your later post you figured that out.
Do you have access to the Homepage.php?

The cookie should be enough for the server to allow the script.  I'm not sure what else it would be looking for.
Adam - where did you find those request parameters in firefox?
It is looking for other params to ensure it is a browser and not an LWP agent (that is my assumption)
ASKER CERTIFIED SOLUTION
Avatar of Adam314
Adam314

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Adam - great work! Your firefox add-on was just the ticket need to make this script work. Thanks a lot!