Writing a script to act as a mozilla browser for screen scraping

I have a script which scrapes some files from a website. The site needs a username and password to login. The script ran fine for a while, but now - even though the site recognizes me on the browser when I log in through the cookie - it does not recognize  the script as being authorized to view that page.

Here is a excerpt from the script:

my $cookie_jar = HTTP::Cookies::Netscape->new(
       file => "C:\\Documents and Settings\\mcorsi.SPEARREPORT\\Application Data\\Mozilla\\Profiles\\default\\wx9k36hh.slt\\cookies.txt",
   );
   my $browser = LWP::UserAgent->new;
   $browser->cookie_jar( $cookie_jar );
my $response = $browser->get("http://www.somewebsite.com");


The response from the script is that I need to be a registered user to view that page even though when I go to the exact same URL with the  mozilla browser I receive the correct page displayed.

Any ideas?
mcorsi62Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Adam314Commented:
Are you sure the cookie file you reference is the one the mozilla browser is using?
If you open the cookies.txt file, do you see the cookie for the site you are going to?
0
mcorsi62Author Commented:
Yes - it looks like the site is using a second method of verifying the request came from a browser. By default LWP:useragent does not send any headers in the request. Can anyone tell me what the default header paramaters are that are normally sent from mozilla?
0
Adam314Commented:
If you FireFox,you can use the LiveHTTPHeaders add-on to see all headers sent.  I'm not sure about other mozilla browsers.
It looks like FireFox sends these:
Host, User-Agent, Accept, Accept-Language, Accept-Encoding, Accept-Charset, Keep-Alive, Connection
0
Exploring SharePoint 2016

Explore SharePoint 2016, the web-based, collaborative platform that integrates with Microsoft Office to provide intranets, secure document management, and collaboration so you can develop your online and offline capabilities.

mcorsi62Author Commented:
I think this is the issue:

The top line is from the mozilla browser (against my own website), the next line is from the perl script. If anyone knows LWP really well, can you tell me what objects to use to sync these two requests so they look the same?
Mozilla:
71.235.224.80 - - [05/Feb/2008:16:02:20 -0500] "GET /Build/Homepage.php HTTP/1.1" 200 3286 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.13) Gecko/20060414"
Script:
71.235.224.80 - - [05/Feb/2008:16:03:11 -0500] "GET /Build/Homepage.php HTTP/1.1" 200 3286 "-" "libwww-perl/5.806"
0
mcorsi62Author Commented:
Ok - i have the the two requests (or at least as much of the requests as my webserver logs) looking the same. What other parameters are sent in a request that i might be missing. My mozilla browser can get onto the webpage through the cookie, but my script cannot. The site must be looking for some other information in the request header file in order to determine this is a request from a browser. Any ideas?
0
mcorsi62Author Commented:
Thanks adam - now i just need to see those values. Could you send yours and i will try to extrapolate from there?
0
mcorsi62Author Commented:
Nvm Adam - I am going to switch to firefox - thanks for the help. If you pointed down the path to solving this one, i will award you the points! ;-)
0
Adam314Commented:
The part that is different is the useragent.  I'm guessing from your later post you figured that out.
Do you have access to the Homepage.php?

The cookie should be enough for the server to allow the script.  I'm not sure what else it would be looking for.
0
mcorsi62Author Commented:
Adam - where did you find those request parameters in firefox?
0
mcorsi62Author Commented:
It is looking for other params to ensure it is a browser and not an LWP agent (that is my assumption)
0
Adam314Commented:
You need to install the addon, from here: https://addons.mozilla.org/en-US/firefox/addon/3829  (click install now)
Then under the tools menu, you will get a Live HTTP Headers option, select that
This will cause a new window to open up.  Click on the Headers tab.
Once open, whenever you go to a page, the headers will show up there.
It will be in the format like:
    First a line stating method (eg: GET or POST) followed by the page
    Then there will be all of the headers the browser sent, and their values
    Then a blank line
    Then the status from the webserver
    Then the headers sent by the webserver, and their values
    Then a thin line
This repeats for every page you go to.  
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
mcorsi62Author Commented:
Adam - great work! Your firefox add-on was just the ticket need to make this script work. Thanks a lot!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.