GorGor1
asked on
copy text from source on a web page
This is not going to be easy to explain. Plus, it will probably be very challenging (if it's even possible)for even the expert Perl programmer, thus the high point value. Might I add that I'm a Perl idiot, so even if I get the code from here, I won't be able to decipher it very easily to see how it works. What I need to do is to, with perl, open a web page, go from there to another page by following a link on the 1st page, read specific text on the second page, then write the copied text to a file. Now getting more specific and stating it more clearly:
1) I need to open a specific web page on the web. This is a password protected page, but I can put my username:password in the URL anyway (since I DO have one).
2)Now that I'm on page 1, there is a link on that page that I need to follow (please tell me this is possible). The second page is actually on a different web server that the 1st page is linking (referring) to. (The 2nd page can only be accessed by first accessing the 1st page). There is a link on page1 that will take me directly to page2. This is the ONLY way to get to page2 otherwise it won't work.
3)Once I'm on page2 (if the above is even possible), I need to read the source code of page2 (if this is possible). I need to copy two text's (variables) from the code. The code for page2 will be in the format of:
<b>Username: username1</b> and <b>Password: password1</b>
I need to copy the text "username1" and the text "password1" and write them to a *.js file. The *.js file will have a specific format that I need to specify in the Perl file. "username1" and "password1" will be inserted in the appropriate places. I also need to know how to insert the copied text's(variables) where I need them.
3)Finally, I need to automatically run this Perl file every few hours. I am running Red Hat Linux 6.x and I am still unsure where to place this file and how to tell Linux to run the file once every specified # of minutes or hours.
I don't know if this all is possible using Perl or any other language, but it IS possible to do by hand (using internet explorer, my brain, my fingers, and a text editor), so one way or another it has to be possible to automate the process.
If it isn't possible with Perl, I am open to any suggestions on how I can do this (combination of javascript and some other program or programming language). I'm putting up a lot of points in hopes that I will get a good answer. Please remember that I'm not too good at programming and I'll probably need some hand-holding through this process. Thank you!
Tim
1) I need to open a specific web page on the web. This is a password protected page, but I can put my username:password in the URL anyway (since I DO have one).
2)Now that I'm on page 1, there is a link on that page that I need to follow (please tell me this is possible). The second page is actually on a different web server that the 1st page is linking (referring) to. (The 2nd page can only be accessed by first accessing the 1st page). There is a link on page1 that will take me directly to page2. This is the ONLY way to get to page2 otherwise it won't work.
3)Once I'm on page2 (if the above is even possible), I need to read the source code of page2 (if this is possible). I need to copy two text's (variables) from the code. The code for page2 will be in the format of:
<b>Username: username1</b> and <b>Password: password1</b>
I need to copy the text "username1" and the text "password1" and write them to a *.js file. The *.js file will have a specific format that I need to specify in the Perl file. "username1" and "password1" will be inserted in the appropriate places. I also need to know how to insert the copied text's(variables) where I need them.
3)Finally, I need to automatically run this Perl file every few hours. I am running Red Hat Linux 6.x and I am still unsure where to place this file and how to tell Linux to run the file once every specified # of minutes or hours.
I don't know if this all is possible using Perl or any other language, but it IS possible to do by hand (using internet explorer, my brain, my fingers, and a text editor), so one way or another it has to be possible to automate the process.
If it isn't possible with Perl, I am open to any suggestions on how I can do this (combination of javascript and some other program or programming language). I'm putting up a lot of points in hopes that I will get a good answer. Please remember that I'm not too good at programming and I'll probably need some hand-holding through this process. Thank you!
Tim
Use LWP::Simple to get 1-st page like this:
$_ = get("http://www.you.com/dir?params");
parse $_ to find your link (I assume it has something consistent about it, like say title)
Get page 2 same way. Parse to extract username and password
write .js file
man crontab to get info how to run cron jobs. Usually crontab < mycron where mycron - schedule file.
$_ = get("http://www.you.com/dir?params");
parse $_ to find your link (I assume it has something consistent about it, like say title)
Get page 2 same way. Parse to extract username and password
write .js file
man crontab to get info how to run cron jobs. Usually crontab < mycron where mycron - schedule file.
ASKER
I don't understand most of what you guys are talking about. I don't know how to write much code in Perl and I have no idea of the format for these commands you are giving me.
The link on page1 that goes to page2 will have the source code [<A HREF="http://www.website.com/" TARGET="new"><IMG SRC="images/image.gif" WIDTH="120" HEIGHT="22" BORDER="0"></A><br>]
Please give me more specific examples how to do what I need to do. Thanks!
Tim
The link on page1 that goes to page2 will have the source code [<A HREF="http://www.website.com/" TARGET="new"><IMG SRC="images/image.gif" WIDTH="120" HEIGHT="22" BORDER="0"></A><br>]
Please give me more specific examples how to do what I need to do. Thanks!
Tim
If page2 is on a different server and the link looks like what you've posted above then page2 is simply checking the HTTP_REFERER value passed in the header. You should be able to fake the header variable and go directly to page 2.
Here's sample code:
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$req = GET 'http://www.page2.com';
$req->push_header(HTTP_REF ERER => 'http://www.page1.com/somepage.html');
$page = $ua->request($req)->as_str ing;
($username) = $page =~ <b>Username:\s(.*?)<\/b>/i ;
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i ;
print "$username\n";
print "$password\n";
Here's sample code:
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$req = GET 'http://www.page2.com';
$req->push_header(HTTP_REF
$page = $ua->request($req)->as_str
($username) = $page =~ <b>Username:\s(.*?)<\/b>/i
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i
print "$username\n";
print "$password\n";
ASKER
Let me see if I understand the code above. The code above SHOULD go directly to page2 by passing the referer to page2's server. It will read the 'username1' and 'password1' variables (by looking for the appropriate 'keywords' on the page) and then print them. I'm not too clear on the print commands. Where is it printing them? To the shell? If this is what it is doing and everything above the print commands works, how would I open a *.js file and print them to the *.js file (with text around them)instead of to the shell? Thanks again!
Tim
Tim
You need to create a template javascript file explicitly defining where you want the username and password to go. You then read the template, replace the sections you need to replace and write out the final file.
E.g.,
template.js
-----------
<script language="javascript">
document.write("--perl script will insert username here--");
document.write("--perl script will insert password here--");
</script>
Here's the perl script to create a file based on the template above. It would replace the print commands I posted earlier.
open(TEMPLATE, "< template.js");
open(FINAL, "> myscript.js");
while (<TEMPLATE>) {
s/--perl script will insert username here--/$username/g;
s/--perl script will insert password here--/$password/g;
print FINAL $_;
}
close TEMPLATE;
close FINAL;
So putting it all together:
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$req = GET 'http://www.page2.com';
$req->push_header(HTTP_REF ERER => 'http://www.page1.com/somepage.html');
$page = $ua->request($req)->as_str ing;
($username) = $page =~ <b>Username:\s(.*?)<\/b>/i ;
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i ;
open(TEMPLATE, "< template.js");
open(FINAL, "> myscript.js");
while (<TEMPLATE>) {
s/--perl script will insert username here--/$username/g;
s/--perl script will insert password here--/$password/g;
print FINAL $_;
}
close TEMPLATE;
close FINAL;
E.g.,
template.js
-----------
<script language="javascript">
document.write("--perl script will insert username here--");
document.write("--perl script will insert password here--");
</script>
Here's the perl script to create a file based on the template above. It would replace the print commands I posted earlier.
open(TEMPLATE, "< template.js");
open(FINAL, "> myscript.js");
while (<TEMPLATE>) {
s/--perl script will insert username here--/$username/g;
s/--perl script will insert password here--/$password/g;
print FINAL $_;
}
close TEMPLATE;
close FINAL;
So putting it all together:
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$req = GET 'http://www.page2.com';
$req->push_header(HTTP_REF
$page = $ua->request($req)->as_str
($username) = $page =~ <b>Username:\s(.*?)<\/b>/i
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i
open(TEMPLATE, "< template.js");
open(FINAL, "> myscript.js");
while (<TEMPLATE>) {
s/--perl script will insert username here--/$username/g;
s/--perl script will insert password here--/$password/g;
print FINAL $_;
}
close TEMPLATE;
close FINAL;
ASKER
I'm getting the following error when I try to run the script:
Can't locate HTTP/Request/Common.pm in @INC <@INC contains: /usr/lib/perl5/5.00503/i38 6-linux /usr/lib/perl5/5.00503 /usr/lib/perl5/site_perl/5 .005/i386- linux /usr/lib/perl5/site_perl/5 .005 .> at getuserpass.pl line 3.
BEGIN failed--compilation aborted at getuserpass.pl line 3.
Where this is my script:
#!/usr/bin/perl
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$req = GET 'http://www.site2.com/';
$req->push_header(HTTP_REF ERER => 'http://www.site1.com/');
$page = $ua->request($req)->as_str ing;
($username) = $page =~ <b>Username:\s(.*?)<\/b>/i ;
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i ;
print "$username\n";
print "$password\n";
exit;
See any reason why I'm getting this error?
Can't locate HTTP/Request/Common.pm in @INC <@INC contains: /usr/lib/perl5/5.00503/i38
BEGIN failed--compilation aborted at getuserpass.pl line 3.
Where this is my script:
#!/usr/bin/perl
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$req = GET 'http://www.site2.com/';
$req->push_header(HTTP_REF
$page = $ua->request($req)->as_str
($username) = $page =~ <b>Username:\s(.*?)<\/b>/i
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i
print "$username\n";
print "$password\n";
exit;
See any reason why I'm getting this error?
You don't have the libwww module installed. Here's a link for the CPAN bundle:
http://theoryx5.uwinnipeg.ca/mod_perl/cpan-search?modinfo=451
http://theoryx5.uwinnipeg.ca/mod_perl/cpan-search?modinfo=451
ASKER
Ok, I intalled about 6 packages so now my Perl is up to date. When I tried to run the code above (no need to paste it here again), I got an error (I'm still just trying to get username and password to be printed to the shell before I go any further):
Bareword found where operator expected at getuserpass.pl line 14, near "<b>Username"
<Missing operator before Username?>
syntax error at getuserpass.pl line 14, near "<b>Username"
Bareword found where operator expected at getuserpass.pl line 15, near "<b>Password"
<Missing operator before Password?>
syntax error at getuserpass.pl line 15, near "<b>Password"
Execution of getuserpass.pl aborted due to compilation errors.
Hmmmm....any ideas?
Bareword found where operator expected at getuserpass.pl line 14, near "<b>Username"
<Missing operator before Username?>
syntax error at getuserpass.pl line 14, near "<b>Username"
Bareword found where operator expected at getuserpass.pl line 15, near "<b>Password"
<Missing operator before Password?>
syntax error at getuserpass.pl line 15, near "<b>Password"
Execution of getuserpass.pl aborted due to compilation errors.
Hmmmm....any ideas?
Sorry, it was a typo. I didn't have access to a copy and paste on the system I was on and made a mistake transferring my script to EE.
The line is missing a /.
#!/usr/bin/perl
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$req = GET 'http://www.site2.com/';
$req->push_header(HTTP_REF ERER => 'http://www.site1.com/');
$page = $ua->request($req)->as_str ing;
($username) = $page =~ /<b>Username:\s(.*?)<\/b>/ i;
($password) = $page =~ /<b>Password:\s(.*?)<\/b>/ i;
print "$username\n";
print "$password\n";
The line is missing a /.
#!/usr/bin/perl
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$req = GET 'http://www.site2.com/';
$req->push_header(HTTP_REF
$page = $ua->request($req)->as_str
($username) = $page =~ /<b>Username:\s(.*?)<\/b>/
($password) = $page =~ /<b>Password:\s(.*?)<\/b>/
print "$username\n";
print "$password\n";
ASKER
I fixed it and tried it again. Nothing is being printed to the shell. How do I troubleshoot and test to make sure that I am getting a connection to "http://www.site2.com"? Thanks again!
Add a print $page.
print "$page\n\n";
print "$username\n";
print "$password\n";
print "$page\n\n";
print "$username\n";
print "$password\n";
ASKER
I fixed it and tried it again. Nothing is being printed to the shell. How do I troubleshoot and test to make sure that I am getting a connection to "http://www.site2.com"? Thanks again!
ASKER
Here's where I've gotten. If I go to http://site1.com/ and go to http://site2.com/ manually, I can view page2. However, when I run the perl script, it appears that the HTTP referer site1.com is not working since I get the "no authorization" page instead of the page I'm looking for.
Do you have the full URL of the page listed as the referer:
http://www.site1.com/default.html?anyparameters=whatever
It should be the full URL.
http://www.site1.com/default.html?anyparameters=whatever
It should be the full URL.
I realize you don't want to give out specific URLs and passwords, but without them it could be very difficult to get this to work. The fact that you don't know perl well isn't going to make it easier. :-)
ASKER
The link that brings me from http://www.site1.com/members/page.html to site2.com is simply "http://www.site2.com/". Site2's default index page is a cgi script that checks to make sure the referer is valid. If not valid, it returns a page "Not Authorized".
Is it possible to "click" on a link on a page using perl? I know it sounds funny, but there's no other way to ask it. I can read the source of the page that contains the link to http://www.site2.com/ and then read the link as a variable. I'm just wondering if there's any way to follow links with perl. I'm going to work on it more tomorrow. Any other ideas? Sorry that I'm being a total pain, but I really need this script to work. Thanks again.
Is it possible to "click" on a link on a page using perl? I know it sounds funny, but there's no other way to ask it. I can read the source of the page that contains the link to http://www.site2.com/ and then read the link as a variable. I'm just wondering if there's any way to follow links with perl. I'm going to work on it more tomorrow. Any other ideas? Sorry that I'm being a total pain, but I really need this script to work. Thanks again.
ASKER
I forgot to add that if the referer IS valid, then a page loads that posts a username and password that will work in the future. This username/password changes a couple times/day.
P.S. Please don't give up on me, we're getting very close. I can feel it :o) Thanks!
P.S. Please don't give up on me, we're getting very close. I can feel it :o) Thanks!
In a browser, when you click on an external link (a link to a site not within the domain of the current page), the only information that should be transmitted to the receiving page is HTTP_REFERER. Adding the HTTP_REFERER header should be equivalent to "clicking" on a link to an outside domain.
If the sites were within the same domain, it could be looking for a cookie-- but you said the sites were not under the same domain.
Authentication information shouldn't transfer across domains.
I really don't think you've got the referer right. But I'm just guessing. Without specifics, it's going to be very hard to figure out exactly what it's looking for.
If the sites were within the same domain, it could be looking for a cookie-- but you said the sites were not under the same domain.
Authentication information shouldn't transfer across domains.
I really don't think you've got the referer right. But I'm just guessing. Without specifics, it's going to be very hard to figure out exactly what it's looking for.
In the example, you posted the referer would be:
$req->push_header(HTTP_REF ERER => 'http://www.site1.com/members/page.html');
$req->push_header(HTTP_REF
ASKER
Clockwatcher....If you give me your e-mail address, I will give you all the information that is needed to get this to work. I don't want to post it here. After everything works, I will post the generic answer here and accept the answer :o)
My email is mark@yahright.com. I don't know if I'll be able to get to it tonight, as I've got a bunch of real work to get finished by tomorrow. But, if I get a chance, I'll look at it.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Here's the answer I am accepting from ClockWatcher:
#!/usr/bin/perl
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$ua->agent('Mozilla/4.0 (compatible; MSIE 5.0; Windows 95; DigExt)');
$req = GET 'http://www.site2.com/';
$req->push_header(REFERER => 'http://www.site1.com/page1.html');
$page = $ua->request($req)->as_str ing;
($username) = $page =~ /<b>Username:\s(.*?)<\/b>/ i;
($password) = $page =~ /<b>Password:\s(.*?)<\/b>/ i;
print "$username\n";
print "$password\n";
exit;
-------------------------- ---------
Works like a charm!!!
#!/usr/bin/perl
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$ua->agent('Mozilla/4.0 (compatible; MSIE 5.0; Windows 95; DigExt)');
$req = GET 'http://www.site2.com/';
$req->push_header(REFERER => 'http://www.site1.com/page1.html');
$page = $ua->request($req)->as_str
($username) = $page =~ /<b>Username:\s(.*?)<\/b>/
($password) = $page =~ /<b>Password:\s(.*?)<\/b>/
print "$username\n";
print "$password\n";
exit;
--------------------------
Works like a charm!!!
2) How would one identify the appropriate link on page 1?
3) crontab