?
Solved

copy text from source on a web page

Posted on 2000-01-30
24
Medium Priority
?
232 Views
Last Modified: 2008-03-06
This is not going to be easy to explain.  Plus, it will probably be very challenging (if it's even possible)for even the expert Perl programmer, thus the high point value.  Might I add that I'm a Perl idiot, so even if I get the code from here, I won't be able to decipher it very easily to see how it works.  What I need to do is to, with perl, open a web page, go from there to another page by following a link on the 1st page, read specific text on the second page, then write the copied text to a file.  Now getting more specific and stating it more clearly:

1)  I need to open a specific web page on the web.  This is a password protected page, but I can put my username:password in the URL anyway (since I DO have one).

2)Now that I'm on page 1, there is a link on that page that I need to follow (please tell me this is possible).  The second page is actually on a different web server that the 1st page is linking (referring) to.  (The 2nd page can only be accessed by first accessing the 1st page).  There is a link on page1 that will take me directly to page2.  This is the ONLY way to get to page2 otherwise it won't work.

3)Once I'm on page2 (if the above is even possible), I need to read the source code of page2 (if this is possible).  I need to copy two text's (variables) from the code.  The code for page2 will be in the format of:
<b>Username: username1</b> and <b>Password: password1</b>

I need to copy the text "username1" and the text "password1" and write them to a *.js file.  The *.js file will have a specific format that I need to specify in the Perl file.  "username1" and "password1" will be inserted in the appropriate places.  I also need to know how to insert the copied text's(variables) where I need them.

3)Finally, I need to automatically run this Perl file every few hours.  I am running Red Hat Linux 6.x and I am still unsure where to place this file and how to tell Linux to run the file once every specified # of minutes or hours.

I don't know if this all is possible using Perl or any other language, but it IS possible to do by hand (using internet explorer, my brain, my fingers, and a text editor), so one way or another it has to be possible to automate the process.

If it isn't possible with Perl, I am open to any suggestions on how I can do this (combination of javascript and some other program or programming language). I'm putting up a lot of points in hopes that I will get a good answer.  Please remember that I'm not too good at programming and I'll probably need some hand-holding through this process.  Thank you!

Tim
0
Comment
Question by:GorGor1
  • 11
  • 11
  • +1
24 Comments
 
LVL 85

Expert Comment

by:ozo
ID: 2458583
1) use LWP::UserAgent;
2) How would one identify the appropriate link on page 1?
3) crontab
0
 
LVL 1

Expert Comment

by:malec
ID: 2462925
Use LWP::Simple to get 1-st page like this:
$_ = get("http://www.you.com/dir?params");

parse $_ to find your link (I assume it has something consistent about it, like say title)

Get page 2 same way. Parse to extract username and password
 write .js file

man crontab to get info how to run cron jobs. Usually crontab < mycron where mycron - schedule file.
0
 
LVL 1

Author Comment

by:GorGor1
ID: 2469881
I don't understand most of what you guys are talking about.  I don't know how to write much code in Perl and I have no idea of the format for these commands you are giving me.

The link on page1 that goes to page2 will have the source code [<A HREF="http://www.website.com/" TARGET="new"><IMG SRC="images/image.gif" WIDTH="120" HEIGHT="22" BORDER="0"></A><br>]

Please give me more specific examples how to do what I need to do.  Thanks!

Tim
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
LVL 25

Expert Comment

by:clockwatcher
ID: 2476605
If page2 is on a different server and the link looks like what you've posted above then page2 is simply checking the HTTP_REFERER value passed in the header.  You should be able to fake the header variable and go directly to page 2.

Here's sample code:

use HTTP::Request::Common qw(GET);
use LWP::UserAgent;

$ua = new LWP::UserAgent;

$req = GET 'http://www.page2.com';

$req->push_header(HTTP_REFERER => 'http://www.page1.com/somepage.html');

$page = $ua->request($req)->as_string;

($username) = $page =~ <b>Username:\s(.*?)<\/b>/i;
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i;

print "$username\n";
print "$password\n";
0
 
LVL 1

Author Comment

by:GorGor1
ID: 2478390
Let me see if I understand the code above.  The code above SHOULD go directly to page2 by passing the referer to page2's server.  It will read the 'username1' and 'password1' variables (by looking for the appropriate 'keywords' on the page) and then print them.  I'm not too clear on the print commands.  Where is it printing them?  To the shell?  If this is what it is doing and everything above the print commands works, how would I open a *.js file and print them to the *.js file (with text around them)instead of to the shell?  Thanks again!

Tim
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 2479598
You need to create a template javascript file explicitly defining where you want the username and password to go.  You then read the template, replace the sections you need to replace and write out the final file.

E.g.,

template.js
-----------

<script language="javascript">
  document.write("--perl script will insert username here--");
  document.write("--perl script will insert password here--");
</script>

Here's the perl script to create a file based on the template above.  It would replace the print commands I posted earlier.

open(TEMPLATE, "< template.js");
open(FINAL, "> myscript.js");

while (<TEMPLATE>) {
  s/--perl script will insert username here--/$username/g;
  s/--perl script will insert password here--/$password/g;
  print FINAL $_;
}

close TEMPLATE;
close FINAL;


So putting it all together:

use HTTP::Request::Common qw(GET);
use LWP::UserAgent;

$ua = new LWP::UserAgent;

$req = GET 'http://www.page2.com';

$req->push_header(HTTP_REFERER => 'http://www.page1.com/somepage.html');

$page = $ua->request($req)->as_string;

($username) = $page =~ <b>Username:\s(.*?)<\/b>/i;
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i;

open(TEMPLATE, "< template.js");
open(FINAL, "> myscript.js");

while (<TEMPLATE>) {
  s/--perl script will insert username here--/$username/g;
  s/--perl script will insert password here--/$password/g;
  print FINAL $_;
}

close TEMPLATE;
close FINAL;
0
 
LVL 1

Author Comment

by:GorGor1
ID: 2480480
I'm getting the following error when I try to run the script:

Can't locate HTTP/Request/Common.pm in @INC <@INC contains: /usr/lib/perl5/5.00503/i386-linux /usr/lib/perl5/5.00503 /usr/lib/perl5/site_perl/5.005/i386-linux /usr/lib/perl5/site_perl/5.005 .> at getuserpass.pl line 3.
BEGIN failed--compilation aborted at getuserpass.pl line 3.

Where this is my script:

#!/usr/bin/perl

use HTTP::Request::Common qw(GET);
use LWP::UserAgent;

$ua = new LWP::UserAgent;

$req = GET 'http://www.site2.com/';

$req->push_header(HTTP_REFERER => 'http://www.site1.com/');

$page = $ua->request($req)->as_string;

($username) = $page =~ <b>Username:\s(.*?)<\/b>/i;
($password) = $page =~ <b>Password:\s(.*?)<\/b>/i;

print "$username\n";
print "$password\n";


exit;

See any reason why I'm getting this error?
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 2480626
You don't have the libwww module installed.  Here's a link for the CPAN bundle:

  http://theoryx5.uwinnipeg.ca/mod_perl/cpan-search?modinfo=451
0
 
LVL 1

Author Comment

by:GorGor1
ID: 2480812
Ok, I intalled about 6 packages so now my Perl is up to date.  When I tried to run the code above (no need to paste it here again), I got an error (I'm still just trying to get username and password to be printed to the shell before I go any further):


Bareword found where operator expected at getuserpass.pl line 14, near "<b>Username"
      <Missing operator before Username?>
syntax error at getuserpass.pl line 14, near "<b>Username"
Bareword found where operator expected at getuserpass.pl line 15, near "<b>Password"
      <Missing operator before Password?>
syntax error at getuserpass.pl line 15, near "<b>Password"
Execution of getuserpass.pl aborted due to compilation errors.


Hmmmm....any ideas?
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 2480830
Sorry, it was a typo.  I didn't have access to a copy and paste on the system I was on and made a mistake transferring my script to EE.

The line is missing a /.

#!/usr/bin/perl

use HTTP::Request::Common qw(GET);
use LWP::UserAgent;

$ua = new LWP::UserAgent;

$req = GET 'http://www.site2.com/';

$req->push_header(HTTP_REFERER => 'http://www.site1.com/');

$page = $ua->request($req)->as_string;

($username) = $page =~ /<b>Username:\s(.*?)<\/b>/i;
($password) = $page =~ /<b>Password:\s(.*?)<\/b>/i;

print "$username\n";
print "$password\n";
0
 
LVL 1

Author Comment

by:GorGor1
ID: 2480882
I fixed it and tried it again.  Nothing is being printed to the shell.  How do I troubleshoot and test to make sure that I am getting a connection to "http://www.site2.com"?  Thanks again!
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 2480889
Add a print $page.

print "$page\n\n";
print "$username\n";
print "$password\n";
0
 
LVL 1

Author Comment

by:GorGor1
ID: 2480892
I fixed it and tried it again.  Nothing is being printed to the shell.  How do I troubleshoot and test to make sure that I am getting a connection to "http://www.site2.com"?  Thanks again!
0
 
LVL 1

Author Comment

by:GorGor1
ID: 2480910
Here's where I've gotten.  If I go to http://site1.com/ and go to http://site2.com/ manually, I can view page2.  However, when I run the perl script, it appears that the HTTP referer site1.com is not working since I get the "no authorization" page instead of the page I'm looking for.
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 2480930
Do you have the full URL of the page listed as the referer:

  http://www.site1.com/default.html?anyparameters=whatever

It should be the full URL.
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 2480940
I realize you don't want to give out specific URLs and passwords, but without them it could be very difficult to get this to work.  The fact that you don't know perl well isn't going to make it easier. :-)
0
 
LVL 1

Author Comment

by:GorGor1
ID: 2480957
The link that brings me from http://www.site1.com/members/page.html to site2.com is simply "http://www.site2.com/".  Site2's default index page is a cgi script that checks to make sure the referer is valid.  If not valid, it returns a page "Not Authorized".

Is it possible to "click" on a link on a page using perl?  I know it sounds funny, but there's no other way to ask it.  I can read the source of the page that contains the link to http://www.site2.com/ and then read the link as a variable.  I'm just wondering if there's any way to follow links with perl.  I'm going to work on it more tomorrow.  Any other ideas?  Sorry that I'm being a total pain, but I really need this script to work.  Thanks again.
0
 
LVL 1

Author Comment

by:GorGor1
ID: 2480976
I forgot to add that if the referer IS valid, then a page loads that posts a username and password that will work in the future.  This username/password changes a couple times/day.

P.S.  Please don't give up on me, we're getting very close.  I can feel it  :o)  Thanks!
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 2480992
In a browser, when you click on an external link (a link to a site not within the domain of the current page), the only information that should be transmitted to the receiving page is HTTP_REFERER.  Adding the HTTP_REFERER header should be equivalent to "clicking" on a link to an outside domain.

If the sites were within the same domain, it could be looking for a cookie-- but you said the sites were not under the same domain.

Authentication information shouldn't transfer across domains.

I really don't think you've got the referer right.  But I'm just guessing.  Without specifics, it's going to be very hard to figure out exactly what it's looking for.
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 2481012
In the example, you posted the referer would be:

$req->push_header(HTTP_REFERER => 'http://www.site1.com/members/page.html');
0
 
LVL 1

Author Comment

by:GorGor1
ID: 2481673
Clockwatcher....If you give me your e-mail address, I will give you all the information that is needed to get this to work. I don't want to post it here.  After everything works, I will post the generic answer here and accept the answer  :o)
0
 
LVL 25

Expert Comment

by:clockwatcher
ID: 2483778
My email is mark@yahright.com.  I don't know if I'll be able to get to it tonight, as I've got a bunch of real work to get finished by tomorrow.  But, if I get a chance, I'll look at it.
0
 
LVL 25

Accepted Solution

by:
clockwatcher earned 1200 total points
ID: 2484236
It's looking for a bare REFERER header variable.  The HTTP_ gets added automatically.

#!/usr/bin/perl

use HTTP::Request::Common qw(GET);
use LWP::UserAgent;

$ua = new LWP::UserAgent;

$req = GET 'http://www.site2.com/';

$req->push_header(REFERER => 'http://www.site1.com/');

$page = $ua->request($req)->as_string;

($username) = $page =~ /<b>Username:\s(.*?)<\/b>/i;
($password) = $page =~ /<b>Password:\s(.*?)<\/b>/i;

print "$username\n";
print "$password\n";
0
 
LVL 1

Author Comment

by:GorGor1
ID: 2484442
Here's the answer I am accepting from ClockWatcher:

#!/usr/bin/perl
 
use HTTP::Request::Common qw(GET);
use LWP::UserAgent;
 
$ua = new LWP::UserAgent;
 
$ua->agent('Mozilla/4.0 (compatible; MSIE 5.0; Windows 95; DigExt)');
 
$req = GET 'http://www.site2.com/';
$req->push_header(REFERER => 'http://www.site1.com/page1.html');
 
$page = $ua->request($req)->as_string;
 
($username) = $page =~ /<b>Username:\s(.*?)<\/b>/i;
($password) = $page =~ /<b>Password:\s(.*?)<\/b>/i;
 
print "$username\n";
print "$password\n";

exit;
-----------------------------------
Works like a charm!!!
 

0

Featured Post

[Webinar] Improve your customer journey

A positive customer journey is important in attracting and retaining business. To improve this experience, you can use Google Maps APIs to increase checkout conversions, boost user engagement, and optimize order fulfillment. Learn how in this webinar presented by Dito.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

601 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question