Writing a script to act as a mozilla browser for screen scraping

I have a script which scrapes some files from a website. The site needs a username and password to login. The script ran fine for a while, but now - even though the site recognizes me on the browser when I log in through the cookie - it does not recognize  the script as being authorized to view that page.

Here is a excerpt from the script:

my $cookie_jar = HTTP::Cookies::Netscape->new(
       file => "C:\\Documents and Settings\\mcorsi.SPEARREPORT\\Application Data\\Mozilla\\Profiles\\default\\wx9k36hh.slt\\cookies.txt",
   );
   my $browser = LWP::UserAgent->new;
   $browser->cookie_jar( $cookie_jar );
my $response = $browser->get("http://www.somewebsite.com");


The response from the script is that I need to be a registered user to view that page even though when I go to the exact same URL with the  mozilla browser I receive the correct page displayed.

Any ideas?
mcorsi62Asked:
Who is Participating?
 
Adam314Connect With a Mentor Commented:
You need to install the addon, from here: https://addons.mozilla.org/en-US/firefox/addon/3829  (click install now)
Then under the tools menu, you will get a Live HTTP Headers option, select that
This will cause a new window to open up.  Click on the Headers tab.
Once open, whenever you go to a page, the headers will show up there.
It will be in the format like:
    First a line stating method (eg: GET or POST) followed by the page
    Then there will be all of the headers the browser sent, and their values
    Then a blank line
    Then the status from the webserver
    Then the headers sent by the webserver, and their values
    Then a thin line
This repeats for every page you go to.  
0
 
Adam314Commented:
Are you sure the cookie file you reference is the one the mozilla browser is using?
If you open the cookies.txt file, do you see the cookie for the site you are going to?
0
 
mcorsi62Author Commented:
Yes - it looks like the site is using a second method of verifying the request came from a browser. By default LWP:useragent does not send any headers in the request. Can anyone tell me what the default header paramaters are that are normally sent from mozilla?
0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

 
Adam314Commented:
If you FireFox,you can use the LiveHTTPHeaders add-on to see all headers sent.  I'm not sure about other mozilla browsers.
It looks like FireFox sends these:
Host, User-Agent, Accept, Accept-Language, Accept-Encoding, Accept-Charset, Keep-Alive, Connection
0
 
mcorsi62Author Commented:
I think this is the issue:

The top line is from the mozilla browser (against my own website), the next line is from the perl script. If anyone knows LWP really well, can you tell me what objects to use to sync these two requests so they look the same?
Mozilla:
71.235.224.80 - - [05/Feb/2008:16:02:20 -0500] "GET /Build/Homepage.php HTTP/1.1" 200 3286 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.13) Gecko/20060414"
Script:
71.235.224.80 - - [05/Feb/2008:16:03:11 -0500] "GET /Build/Homepage.php HTTP/1.1" 200 3286 "-" "libwww-perl/5.806"
0
 
mcorsi62Author Commented:
Ok - i have the the two requests (or at least as much of the requests as my webserver logs) looking the same. What other parameters are sent in a request that i might be missing. My mozilla browser can get onto the webpage through the cookie, but my script cannot. The site must be looking for some other information in the request header file in order to determine this is a request from a browser. Any ideas?
0
 
mcorsi62Author Commented:
Thanks adam - now i just need to see those values. Could you send yours and i will try to extrapolate from there?
0
 
mcorsi62Author Commented:
Nvm Adam - I am going to switch to firefox - thanks for the help. If you pointed down the path to solving this one, i will award you the points! ;-)
0
 
Adam314Commented:
The part that is different is the useragent.  I'm guessing from your later post you figured that out.
Do you have access to the Homepage.php?

The cookie should be enough for the server to allow the script.  I'm not sure what else it would be looking for.
0
 
mcorsi62Author Commented:
Adam - where did you find those request parameters in firefox?
0
 
mcorsi62Author Commented:
It is looking for other params to ensure it is a browser and not an LWP agent (that is my assumption)
0
 
mcorsi62Author Commented:
Adam - great work! Your firefox add-on was just the ticket need to make this script work. Thanks a lot!
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.