Link to home
Start Free TrialLog in
Avatar of GorGor1
GorGor1

asked on

copy text from source on a web page

This is not going to be easy to explain.  Plus, it will probably be very challenging (if it's even possible)for even the expert Perl programmer, thus the high point value.  Might I add that I'm a Perl idiot, so even if I get the code from here, I won't be able to decipher it very easily to see how it works.  What I need to do is to, with perl, open a web page, go from there to another page by following a link on the 1st page, read specific text on the second page, then write the copied text to a file.  Now getting more specific and stating it more clearly:

1)  I need to open a specific web page on the web.  This is a password protected page, but I can put my username:password in the URL anyway (since I DO have one).

2)Now that I'm on page 1, there is a link on that page that I need to follow (please tell me this is possible).  The second page is actually on a different web server that the 1st page is linking (referring) to.  (The 2nd page can only be accessed by first accessing the 1st page).  There is a link on page1 that will take me directly to page2.  This is the ONLY way to get to page2 otherwise it won't work.

3)Once I'm on page2 (if the above is even possible), I need to read the source code of page2 (if this is possible).  I need to copy two text's (variables) from the code.  The code for page2 will be in the format of:
<b>Username: username1</b> and <b>Password: password1</b>

I need to copy the text "username1" and the text "password1" and write them to a *.js file.  The *.js file will have a specific format that I need to specify in the Perl file.  "username1" and "password1" will be inserted in the appropriate places.  I also need to know how to insert the copied text's(variables) where I need them.

3)Finally, I need to automatically run this Perl file every few hours.  I am running Red Hat Linux 6.x and I am still unsure where to place this file and how to tell Linux to run the file once every specified # of minutes or hours.

I don't know if this all is possible using Perl or any other language, but it IS possible to do by hand (using internet explorer, my brain, my fingers, and a text editor), so one way or another it has to be possible to automate the process.

If it isn't possible with Perl, I am open to any suggestions on how I can do this (combination of javascript and some other program or programming language). I'm putting up a lot of points in hopes that I will get a good answer.  Please remember that I'm not too good at programming and I'll probably need some hand-holding through this process.  Thank you!

Tim
Avatar of ozo
ozo
Flag of United States of America image

1) use LWP::UserAgent;
2) How would one identify the appropriate link on page 1?
3) crontab
Avatar of malec
malec

Use LWP::Simple to get 1-st page like this:
$_ = get("http://www.you.com/dir?params");

parse $_ to find your link (I assume it has something consistent about it, like say title)

Get page 2 same way. Parse to extract username and password
 write .js file

man crontab to get info how to run cron jobs. Usually crontab < mycron where mycron - schedule file.
Avatar of GorGor1

ASKER

I don't understand most of what you guys are talking about.  I don't know how to write much code in Perl and I have no idea of the format for these commands you are giving me.

The link on page1 that goes to page2 will have the source code [<A HREF="http://www.website.com/" TARGET="new"><IMG SRC="images/image.gif" WIDTH="120" HEIGHT="22" BORDER="0"></A><br>]

Please give me more specific examples how to do what I need to do.  Thanks!

Tim
If page2 is on a different server and the link looks like what you've posted above then page2 is simply checking the HTTP_REFERER value passed in the header.  You should be able to fake the header variable and go directly to page 2.

Here's sample code:

use HTTP::Request::Common qw(GET);
use LWP::UserAgent;

$ua = new LWP::UserAgent;

$req = GET 'http://www.page2.com';

$req->push_header(HTTP_REFERER => 'http://www.page1.com/somepage.html');

$page = $ua->request($req)->as_string;

($username) = $page =~ <b>Username:\s(.*?)<\/b>/i;
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i;

print "$username\n";
print "$password\n";
Avatar of GorGor1

ASKER

Let me see if I understand the code above.  The code above SHOULD go directly to page2 by passing the referer to page2's server.  It will read the 'username1' and 'password1' variables (by looking for the appropriate 'keywords' on the page) and then print them.  I'm not too clear on the print commands.  Where is it printing them?  To the shell?  If this is what it is doing and everything above the print commands works, how would I open a *.js file and print them to the *.js file (with text around them)instead of to the shell?  Thanks again!

Tim
You need to create a template javascript file explicitly defining where you want the username and password to go.  You then read the template, replace the sections you need to replace and write out the final file.

E.g.,

template.js
-----------

<script language="javascript">
  document.write("--perl script will insert username here--");
  document.write("--perl script will insert password here--");
</script>

Here's the perl script to create a file based on the template above.  It would replace the print commands I posted earlier.

open(TEMPLATE, "< template.js");
open(FINAL, "> myscript.js");

while (<TEMPLATE>) {
  s/--perl script will insert username here--/$username/g;
  s/--perl script will insert password here--/$password/g;
  print FINAL $_;
}

close TEMPLATE;
close FINAL;


So putting it all together:

use HTTP::Request::Common qw(GET);
use LWP::UserAgent;

$ua = new LWP::UserAgent;

$req = GET 'http://www.page2.com';

$req->push_header(HTTP_REFERER => 'http://www.page1.com/somepage.html');

$page = $ua->request($req)->as_string;

($username) = $page =~ <b>Username:\s(.*?)<\/b>/i;
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i;

open(TEMPLATE, "< template.js");
open(FINAL, "> myscript.js");

while (<TEMPLATE>) {
  s/--perl script will insert username here--/$username/g;
  s/--perl script will insert password here--/$password/g;
  print FINAL $_;
}

close TEMPLATE;
close FINAL;
Avatar of GorGor1

ASKER

I'm getting the following error when I try to run the script:

Can't locate HTTP/Request/Common.pm in @INC <@INC contains: /usr/lib/perl5/5.00503/i386-linux /usr/lib/perl5/5.00503 /usr/lib/perl5/site_perl/5.005/i386-linux /usr/lib/perl5/site_perl/5.005 .> at getuserpass.pl line 3.
BEGIN failed--compilation aborted at getuserpass.pl line 3.

Where this is my script:

#!/usr/bin/perl

use HTTP::Request::Common qw(GET);
use LWP::UserAgent;

$ua = new LWP::UserAgent;

$req = GET 'http://www.site2.com/';

$req->push_header(HTTP_REFERER => 'http://www.site1.com/');

$page = $ua->request($req)->as_string;

($username) = $page =~ <b>Username:\s(.*?)<\/b>/i;
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i;

print "$username\n";
print "$password\n";


exit;

See any reason why I'm getting this error?
You don't have the libwww module installed.  Here's a link for the CPAN bundle:

  http://theoryx5.uwinnipeg.ca/mod_perl/cpan-search?modinfo=451
Avatar of GorGor1

ASKER

Ok, I intalled about 6 packages so now my Perl is up to date.  When I tried to run the code above (no need to paste it here again), I got an error (I'm still just trying to get username and password to be printed to the shell before I go any further):


Bareword found where operator expected at getuserpass.pl line 14, near "<b>Username"
      <Missing operator before Username?>
syntax error at getuserpass.pl line 14, near "<b>Username"
Bareword found where operator expected at getuserpass.pl line 15, near "<b>Password"
      <Missing operator before Password?>
syntax error at getuserpass.pl line 15, near "<b>Password"
Execution of getuserpass.pl aborted due to compilation errors.


Hmmmm....any ideas?
Sorry, it was a typo.  I didn't have access to a copy and paste on the system I was on and made a mistake transferring my script to EE.

The line is missing a /.

#!/usr/bin/perl

use HTTP::Request::Common qw(GET);
use LWP::UserAgent;

$ua = new LWP::UserAgent;

$req = GET 'http://www.site2.com/';

$req->push_header(HTTP_REFERER => 'http://www.site1.com/');

$page = $ua->request($req)->as_string;

($username) = $page =~ /<b>Username:\s(.*?)<\/b>/i;
($password) = $page =~ /<b>Password:\s(.*?)<\/b>/i;

print "$username\n";
print "$password\n";
Avatar of GorGor1

ASKER

I fixed it and tried it again.  Nothing is being printed to the shell.  How do I troubleshoot and test to make sure that I am getting a connection to "http://www.site2.com"?  Thanks again!
Add a print $page.

print "$page\n\n";
print "$username\n";
print "$password\n";
Avatar of GorGor1

ASKER

I fixed it and tried it again.  Nothing is being printed to the shell.  How do I troubleshoot and test to make sure that I am getting a connection to "http://www.site2.com"?  Thanks again!
Avatar of GorGor1

ASKER

Here's where I've gotten.  If I go to http://site1.com/ and go to http://site2.com/ manually, I can view page2.  However, when I run the perl script, it appears that the HTTP referer site1.com is not working since I get the "no authorization" page instead of the page I'm looking for.
Do you have the full URL of the page listed as the referer:

  http://www.site1.com/default.html?anyparameters=whatever

It should be the full URL.
I realize you don't want to give out specific URLs and passwords, but without them it could be very difficult to get this to work.  The fact that you don't know perl well isn't going to make it easier. :-)
Avatar of GorGor1

ASKER

The link that brings me from http://www.site1.com/members/page.html to site2.com is simply "http://www.site2.com/".  Site2's default index page is a cgi script that checks to make sure the referer is valid.  If not valid, it returns a page "Not Authorized".

Is it possible to "click" on a link on a page using perl?  I know it sounds funny, but there's no other way to ask it.  I can read the source of the page that contains the link to http://www.site2.com/ and then read the link as a variable.  I'm just wondering if there's any way to follow links with perl.  I'm going to work on it more tomorrow.  Any other ideas?  Sorry that I'm being a total pain, but I really need this script to work.  Thanks again.
Avatar of GorGor1

ASKER

I forgot to add that if the referer IS valid, then a page loads that posts a username and password that will work in the future.  This username/password changes a couple times/day.

P.S.  Please don't give up on me, we're getting very close.  I can feel it  :o)  Thanks!
In a browser, when you click on an external link (a link to a site not within the domain of the current page), the only information that should be transmitted to the receiving page is HTTP_REFERER.  Adding the HTTP_REFERER header should be equivalent to "clicking" on a link to an outside domain.

If the sites were within the same domain, it could be looking for a cookie-- but you said the sites were not under the same domain.

Authentication information shouldn't transfer across domains.

I really don't think you've got the referer right.  But I'm just guessing.  Without specifics, it's going to be very hard to figure out exactly what it's looking for.
In the example, you posted the referer would be:

$req->push_header(HTTP_REFERER => 'http://www.site1.com/members/page.html');
Avatar of GorGor1

ASKER

Clockwatcher....If you give me your e-mail address, I will give you all the information that is needed to get this to work. I don't want to post it here.  After everything works, I will post the generic answer here and accept the answer  :o)
My email is mark@yahright.com.  I don't know if I'll be able to get to it tonight, as I've got a bunch of real work to get finished by tomorrow.  But, if I get a chance, I'll look at it.
ASKER CERTIFIED SOLUTION
Avatar of clockwatcher
clockwatcher

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of GorGor1

ASKER

Here's the answer I am accepting from ClockWatcher:

#!/usr/bin/perl
 
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
 
$ua = new LWP::UserAgent;
 
$ua->agent('Mozilla/4.0 (compatible; MSIE 5.0; Windows 95; DigExt)');
 
$req = GET 'http://www.site2.com/';
$req->push_header(REFERER => 'http://www.site1.com/page1.html');
 
$page = $ua->request($req)->as_string;
 
($username) = $page =~ /<b>Username:\s(.*?)<\/b>/i;
($password) = $page =~ /<b>Password:\s(.*?)<\/b>/i;
 
print "$username\n";
print "$password\n";

exit;
-----------------------------------
Works like a charm!!!