?
Solved

Perl script to rip texts from html file

Posted on 2010-01-11
3
Medium Priority
?
308 Views
Last Modified: 2012-05-08
Request is same as my old resolved EE question http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_24537501.html .
This was working great until recently, Problem occured after updating the application with latest version. I think something has changed within the html page ie tables or some. I have attached file below to show you what results i used to get and the current ones.

Problems facing now:
    1) Endpoints missing
    2) Bytes calculations wrong


error-received-when-running-scri.txt
NTF.txt
Data-before-updates.htm
Results-before-updates.xls
Data-after-updates.htm
Results-after-updates.xls
0
Comment
Question by:kavlins
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 28

Expert Comment

by:FishMonger
ID: 26288891
This is a prime example of why using a series of regex's to parse html is very fragile and in almost all cases is the wrong approach.

You should look at using a module designed for parsing html, such as HTML::Parser.
http://search.cpan.org/~gaas/HTML-Parser-3.64/Parser.pm

Other module choices can be found here.
http://search.cpan.org/modlist/World_Wide_Web/HTML
0
 
LVL 6

Author Comment

by:kavlins
ID: 26288980
i am a novice in Perl, so will take time to grasp those...
0
 
LVL 6

Accepted Solution

by:
kavlins earned 0 total points
ID: 26457395
Figured it out with help from another source.
Added </SPAN\> to the line $Text =~ /\>([^<]+)\<\/SPAN\>\<\/A\>/;
everything works as it used to be..
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question