Solved

retrieving only the html what I want

Posted on 2000-04-15
14
186 Views
Last Modified: 2010-03-05
Hello, I want to copy in one of my sites the headline news of other of my sites. I'm trying to use a simple and efective code but... I cannot retry exactly what I want.
Seen the examples below you'll understand my problem...

######################################
FILES I'M USING
#######################################
file: cord.pl
#######################################

#!/usr/bin/perl
use English;
use CGI;
use integer;
require LWP::UserAgent;
#####################
$ua = new LWP::UserAgent;
$the_cgi = CGI->new;
$lookup = new HTTP::Request 'GET', "http://www.elcordillerano.com.ar/hoy/bariloche.htm";
$response = $ua->request($lookup);
@lines = split (/<br>/, $response->content);
$up = "dotted\"\>\<font FACE=\"Arial\"\>\<b\>\<a"; # condition to print
$numheads = "30"; # number of headlines

print "Content-type: text/html\n\n";
$i = 0;
foreach $line (@lines)
 {
   if ($line =~ /$up/ && $i < $numheads)
    {
        print $line;      
            $i += 1;
      }    
  }
exit;
#######################################
file: cord.shtml (SSI)
#######################################
<html>
<head>
<title>Untitled</title>
<BASE href="http://www.elcordillerano.com.ar">
</head>
<body>
<!--#exec cgi="/cgi-local/cord.pl" -->
</body>
</html>
#######################################

WHAT I HAVE until now....
#####################################
<html>
<head>
<title>Untitled</title>
<BASE href="http://www.elcordillerano.com.ar">
</head>
<body>
<html>
####### Here starts SSI (cord.pl) ######
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="AUTHOR" content="Juan Pablo Duprez">
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">
<style>
<!--A{text-decoration:none}
A:hover {color: "#F88a50"}-->
</style>
<title>Bariloche</title>
</head>

<body vlink="#0000FF" alink="#0000FF">
<div align="center"><center>

<table border="0" cellpadding="0" cellspacing="0" width="100%" bgcolor="#FFFFFF">
  <tr>
    <td width="100%" colspan="2" bgcolor="#FFFFFF" height="65"><font color="#FFFF00"><strong><big><p
    align="right"></big><img src="imagenes/encabezados.jpg" alt="encabezados.jpg (4861 bytes)"
    width="105" height="25"><big></p>
    </big></strong></font><hr size="1" noshade color="#000000">
    </td>
  </tr>
  <tr>
    <td width="100%" colspan="2" bgcolor="#000000"><p align="center"><a name="Principio"><font
    face="Verdana" color="#FFC800"><strong><big><big>BARILOCHE</big></big></strong></font></a></td>
  </tr>
  <tr>
    <td width="6%" style="border-bottom: thin dotted"><p align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Una vez más, el SOYEM denunció">Una vez más, el SOYEM denunció
    grave situación financiera en el municipio</a></b></font></td>
  </tr>
  <tr>
    <td width="6%" style="border-bottom: thin dotted" align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Prometen que asfaltarán la Ruta 23">Prometen que asfaltarán la Ruta 23
    entre Bariloche y Pilcaniyeu</a></b></font></td>
  </tr>
  ............other headlines.........
  <tr>
    <td width="6%" style="border-bottom: thin dotted" align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Salud: profesionales deberán">Salud: profesionales deberán
</body>
</html>

The last one is truncated (there is another line isn't here, in the original there is a <br>)

WHAT I want from SSI....
#####################################
<li><a href="#Una vez más, el SOYEM denunció">Una vez más, el SOYEM denunció grave situación financiera en el municipio</a><br>
<li><a href="#Prometen que asfaltarán la Ruta 23">Prometen que asfaltarán la Ruta 23 entre Bariloche y Pilcaniyeu</a><br>
<li>............other headlines.........
<li><a href="#Salud: profesionales deberán">Salud: profesionales deberán xxxxxxxx xxxxxxx xxxxxxx xxxx</a><br>

THE LAST LINE WITH THE TEXT COMPLETE...



Could you help me, please.
Thanks.
0
Comment
Question by:milen
  • 8
  • 5
14 Comments
 

Author Comment

by:milen
Comment Utility
Adjusted points from 50 to 100
0
 

Author Comment

by:milen
Comment Utility
Adjusted points from 100 to 200
0
 
LVL 2

Expert Comment

by:garfld
Comment Utility
This is a simple script that extracts the top story from cnn. Maybe you can modify it to get what you want. This script works.

use LWP::Simple;
use Text::Wrap;

use CGI qw(:standard);

$form = CGI->new ();

#$tick = $form -> param(tick);

$news = get ("http://www.cnn.com");
$news =~ s/^.*<H3><A href=//s ;
$news =~ s/FULL STORY.*$//s;
$news =~ s/<[^>]+>//g;
$news =~ s/^.*>//s;

#open (DATA, ">c:/perl/html.txt") ;
#print DATA $news;
#close DATA;

print wrap('', '', $news );

Betsy
0
 

Author Comment

by:milen
Comment Utility
garfld,
my script also works as well. My problem is modifying it to retrieve only what I want...

I think all the job is with the $line =~ s/'s filters but I don't know well who to work on it.

If you cannot help me with it, do you have any link to help me understand this filters?

Thanks,
Fito

0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
print "<li>",$line=~m{(<a\s*href.*?)(?:</a>|$)}im,"</a><br>";
0
 

Author Comment

by:milen
Comment Utility
Hi ozo,
you're very near...

#####################
I'm trying with:
#####################
#!/usr/bin/perl
use English;
use CGI;
use integer;
require LWP::UserAgent;
$ua = new LWP::UserAgent;
$the_cgi = CGI->new;
$site = "http://www.elcordillerano.com.ar/hoy/bariloche.htm";
$lookup = new HTTP::Request 'GET', "$site";
$response = $ua->request($lookup);
@lines = split (/<br>/, $response->content);
$up = "dotted\"\>\<font FACE=\"Arial\"\>\<b\>\<a"; # condition
$numheads = "10"; # number of headlines
print "Content-type: text/html\n\n";
$i = 0;
foreach $line (@lines)
 {
   if ($line =~ /$up/ && $i < $numheads)
    {
    print "<li>",$line=~m{(<a\s*href.*?)(?:</a>|$)}im,"</a><br>";
      $i += 1;
    }    
 }
exit;


###############################
WHAT I GET...
###############################
<body>
<li><a href="#Aunque el invierno no sea nevador, familias de">Aunque el invierno no sea nevador, (INCOMPLETE)</a><br>
<li><a href="#Los &quot;blancos&quot; de la UCR piensan que &quot;el">Los &quot;blancos&quot; de
(INCOMPLETE)</a><br>
<li><a href="#Dimes y diretes">Dimes y diretes</a><br>

(HERE LACKS FULL HEADLINE)

<li><a href="#Esgrimen razones para asfaltar la Ruta 23">Esgrimen razones para asfaltar la Ruta (INCOMPLETE)</a><br>
</body>

###############################
WHAT I WANT...
###############################
<li><a href="#Aunque el invierno no sea nevador, familias de">Aunque el invierno no sea nevador,
    familias de<br>
      barrios carenciados piden más leña y querosén</a><br>
<li><a href="#Los &quot;blancos&quot; de la UCR piensan que &quot;el">Los &quot;blancos&quot; de
    la UCR piensan que &quot;el<br>
    Gobierno está en un lugar y los radicales en otro&quot; </a><br>
<li><a href="#Dimes y diretes">Dimes y diretes</a><br>
<li><a href="#Gas zona oeste: analizarán la manera">Gas zona oeste: analizarán la manera<br>
    de pagar las cuotas de instalación</a><br>
<li><a href="#Esgrimen razones para asfaltar la Ruta 23">Esgrimen razones para asfaltar la Ruta
    23</a><br>
</body>
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
#then How about changing
@lines = split (/<br>/, $response->content);
#to
@lines = split ('</a>', $response->content);
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:milen
Comment Utility
Now appears the 4th headline (which was written with a ":" in between...)
but the others are still incomplete...

0
 
LVL 84

Accepted Solution

by:
ozo earned 200 total points
Comment Utility
print "<li>",$line=~m{(<a\s*href.*)}is,"</a><br>\n";
0
 

Author Comment

by:milen
Comment Utility
Yes!!!!  Great ozo!!!
It works perfectly!!!
I'm sorry for the last question...
what if I don't want to break the sentences just the original text is? (how can I skip the <br> and make a long sentence with each headline?)

By the way in your tip:
"$line=~m{(<a\s*href.*)}is"
what is last "is" for?

of course points are yours ;-)
0
 

Author Comment

by:milen
Comment Utility
Sorry, ozo...
I would like to change my last question for other more important:
if I want to change news anchors?
like:
<a href="#whatever">
to:
<a href="javascript:rt2func(#whatever)">
0
 

Author Comment

by:milen
Comment Utility
Thanks ozo !!!!
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
/i   Do case-insensitive pattern matching.
/s   allow /./ to match \n

foreach( grep {/$up/ && $i++ < $numheads} split /<\/a>/i, $response->content ){
    s/\s*<br>\s*/ /g;
    s/.*<a\s*href="([^"]*)"(.*)/<li><a href="javascript:rt2func($1)"$2<\/a><br>\n/is;
    print;
}    
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
/i   Do case-insensitive pattern matching.
/s   allow /./ to match \n
see `perldoc perlre`
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now