?
Solved

perl parse webpage

Posted on 2007-12-06
5
Medium Priority
?
816 Views
Last Modified: 2008-02-01
i am trying to count the number of occurances of a the following sentence in the source code of a webpage.

<!-- changed logic:noMatch to logic:notEqual to prevent similar symbols from displaying together -->

my script looks like this but it just runs and runs, i have to cancel it.


use LWP::Simple;
my $content = get('http://www.website');
my $count= 0;
$count++ while $content =~/\s+changed logic:noMatch to logic:notEqual to prevent similar symbols from displaying together\s/;
  print $count;
0
Comment
Question by:mcgilljd
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
5 Comments
 
LVL 39

Expert Comment

by:Adam314
ID: 20421644
How long do you give it?  It might be taking a while to get the webpage.

Add a few print statements so you can see where it hangs...  What output do you get from this?

use LWP::Simple;
$|=1;
print "Getting page...\n";
my $content = get('http://www.website');
my $count= 0;
print "Got page: " . length($content) . " bytes\n";
 
print "Checking for line...\n";
$count++ while $content =~/\s+changed logic:noMatch to logic:notEqual to prevent similar symbols from displaying together\s/;
print "count=$count\n";

Open in new window

0
 

Author Comment

by:mcgilljd
ID: 20421977
Getting page...
Got page: 2582373 bytes
Checking for line...

then it hangs

the line should occur about 2100 times
0
 
LVL 39

Accepted Solution

by:
Adam314 earned 2000 total points
ID: 20422605
You need a /g on the end of your regex:

$count++ while $content =~/\s+changed logic:noMatch to logic:notEqual to prevent similar symbols from displaying together\s/g;

Open in new window

0
 

Author Comment

by:mcgilljd
ID: 20422746
i changed code alittle to this.

Now it just counts infinetly if the search string is found.

Maybe i need to split content into lines?
print "Checking for line...\n";
while ($content =~/\s+changed logic:noMatch to logic:notEqual to prevent similar symbols from displaying together\s/)
		{
		$count++ ;
			
		print "count=$count\n";
		}					

Open in new window

0
 
LVL 39

Expert Comment

by:Adam314
ID: 20422888
The /g should cause it to count properly.  If the message is on multiple lines, you might need /s also.
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question