mcorsi62
asked on
Writing a script to act as a mozilla browser for screen scraping
I have a script which scrapes some files from a website. The site needs a username and password to login. The script ran fine for a while, but now - even though the site recognizes me on the browser when I log in through the cookie - it does not recognize the script as being authorized to view that page.
Here is a excerpt from the script:
my $cookie_jar = HTTP::Cookies::Netscape->new(
file => "C:\\Documents and Settings\\mcorsi.SPEARREPO RT\\Applic ation Data\\Mozilla\\Profiles\\d efault\\wx 9k36hh.slt \\cookies. txt",
);
my $browser = LWP::UserAgent->new;
$browser->cookie_jar( $cookie_jar );
my $response = $browser->get("http://www.somewebsite.com");
The response from the script is that I need to be a registered user to view that page even though when I go to the exact same URL with the mozilla browser I receive the correct page displayed.
Any ideas?
Here is a excerpt from the script:
my $cookie_jar = HTTP::Cookies::Netscape->new(
file => "C:\\Documents and Settings\\mcorsi.SPEARREPO
);
my $browser = LWP::UserAgent->new;
$browser->cookie_jar( $cookie_jar );
my $response = $browser->get("http://www.somewebsite.com");
The response from the script is that I need to be a registered user to view that page even though when I go to the exact same URL with the mozilla browser I receive the correct page displayed.
Any ideas?
ASKER
Yes - it looks like the site is using a second method of verifying the request came from a browser. By default LWP:useragent does not send any headers in the request. Can anyone tell me what the default header paramaters are that are normally sent from mozilla?
If you FireFox,you can use the LiveHTTPHeaders add-on to see all headers sent. I'm not sure about other mozilla browsers.
It looks like FireFox sends these:
Host, User-Agent, Accept, Accept-Language, Accept-Encoding, Accept-Charset, Keep-Alive, Connection
It looks like FireFox sends these:
Host, User-Agent, Accept, Accept-Language, Accept-Encoding, Accept-Charset, Keep-Alive, Connection
ASKER
I think this is the issue:
The top line is from the mozilla browser (against my own website), the next line is from the perl script. If anyone knows LWP really well, can you tell me what objects to use to sync these two requests so they look the same?
Mozilla:
71.235.224.80 - - [05/Feb/2008:16:02:20 -0500] "GET /Build/Homepage.php HTTP/1.1" 200 3286 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.13) Gecko/20060414"
Script:
71.235.224.80 - - [05/Feb/2008:16:03:11 -0500] "GET /Build/Homepage.php HTTP/1.1" 200 3286 "-" "libwww-perl/5.806"
The top line is from the mozilla browser (against my own website), the next line is from the perl script. If anyone knows LWP really well, can you tell me what objects to use to sync these two requests so they look the same?
Mozilla:
71.235.224.80 - - [05/Feb/2008:16:02:20 -0500] "GET /Build/Homepage.php HTTP/1.1" 200 3286 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.13) Gecko/20060414"
Script:
71.235.224.80 - - [05/Feb/2008:16:03:11 -0500] "GET /Build/Homepage.php HTTP/1.1" 200 3286 "-" "libwww-perl/5.806"
ASKER
Ok - i have the the two requests (or at least as much of the requests as my webserver logs) looking the same. What other parameters are sent in a request that i might be missing. My mozilla browser can get onto the webpage through the cookie, but my script cannot. The site must be looking for some other information in the request header file in order to determine this is a request from a browser. Any ideas?
ASKER
Thanks adam - now i just need to see those values. Could you send yours and i will try to extrapolate from there?
ASKER
Nvm Adam - I am going to switch to firefox - thanks for the help. If you pointed down the path to solving this one, i will award you the points! ;-)
The part that is different is the useragent. I'm guessing from your later post you figured that out.
Do you have access to the Homepage.php?
The cookie should be enough for the server to allow the script. I'm not sure what else it would be looking for.
Do you have access to the Homepage.php?
The cookie should be enough for the server to allow the script. I'm not sure what else it would be looking for.
ASKER
Adam - where did you find those request parameters in firefox?
ASKER
It is looking for other params to ensure it is a browser and not an LWP agent (that is my assumption)
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Adam - great work! Your firefox add-on was just the ticket need to make this script work. Thanks a lot!
If you open the cookies.txt file, do you see the cookie for the site you are going to?