Link to home
Start Free TrialLog in
Avatar of milen
milen

asked on

retrieving only the html what I want

Hello, I want to copy in one of my sites the headline news of other of my sites. I'm trying to use a simple and efective code but... I cannot retry exactly what I want.
Seen the examples below you'll understand my problem...

######################################
FILES I'M USING
#######################################
file: cord.pl
#######################################

#!/usr/bin/perl
use English;
use CGI;
use integer;
require LWP::UserAgent;
#####################
$ua = new LWP::UserAgent;
$the_cgi = CGI->new;
$lookup = new HTTP::Request 'GET', "http://www.elcordillerano.com.ar/hoy/bariloche.htm";
$response = $ua->request($lookup);
@lines = split (/<br>/, $response->content);
$up = "dotted\"\>\<font FACE=\"Arial\"\>\<b\>\<a"; # condition to print
$numheads = "30"; # number of headlines

print "Content-type: text/html\n\n";
$i = 0;
foreach $line (@lines)
 {
   if ($line =~ /$up/ && $i < $numheads)
    {
        print $line;      
            $i += 1;
      }    
  }
exit;
#######################################
file: cord.shtml (SSI)
#######################################
<html>
<head>
<title>Untitled</title>
<BASE href="http://www.elcordillerano.com.ar">
</head>
<body>
<!--#exec cgi="/cgi-local/cord.pl" -->
</body>
</html>
#######################################

WHAT I HAVE until now....
#####################################
<html>
<head>
<title>Untitled</title>
<BASE href="http://www.elcordillerano.com.ar">
</head>
<body>
<html>
####### Here starts SSI (cord.pl) ######
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="AUTHOR" content="Juan Pablo Duprez">
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">
<style>
<!--A{text-decoration:none}
A:hover {color: "#F88a50"}-->
</style>
<title>Bariloche</title>
</head>

<body vlink="#0000FF" alink="#0000FF">
<div align="center"><center>

<table border="0" cellpadding="0" cellspacing="0" width="100%" bgcolor="#FFFFFF">
  <tr>
    <td width="100%" colspan="2" bgcolor="#FFFFFF" height="65"><font color="#FFFF00"><strong><big><p
    align="right"></big><img src="imagenes/encabezados.jpg" alt="encabezados.jpg (4861 bytes)"
    width="105" height="25"><big></p>
    </big></strong></font><hr size="1" noshade color="#000000">
    </td>
  </tr>
  <tr>
    <td width="100%" colspan="2" bgcolor="#000000"><p align="center"><a name="Principio"><font
    face="Verdana" color="#FFC800"><strong><big><big>BARILOCHE</big></big></strong></font></a></td>
  </tr>
  <tr>
    <td width="6%" style="border-bottom: thin dotted"><p align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Una vez más, el SOYEM denunció">Una vez más, el SOYEM denunció
    grave situación financiera en el municipio</a></b></font></td>
  </tr>
  <tr>
    <td width="6%" style="border-bottom: thin dotted" align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Prometen que asfaltarán la Ruta 23">Prometen que asfaltarán la Ruta 23
    entre Bariloche y Pilcaniyeu</a></b></font></td>
  </tr>
  ............other headlines.........
  <tr>
    <td width="6%" style="border-bottom: thin dotted" align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Salud: profesionales deberán">Salud: profesionales deberán
</body>
</html>

The last one is truncated (there is another line isn't here, in the original there is a <br>)

WHAT I want from SSI....
#####################################
<li><a href="#Una vez más, el SOYEM denunció">Una vez más, el SOYEM denunció grave situación financiera en el municipio</a><br>
<li><a href="#Prometen que asfaltarán la Ruta 23">Prometen que asfaltarán la Ruta 23 entre Bariloche y Pilcaniyeu</a><br>
<li>............other headlines.........
<li><a href="#Salud: profesionales deberán">Salud: profesionales deberán xxxxxxxx xxxxxxx xxxxxxx xxxx</a><br>

THE LAST LINE WITH THE TEXT COMPLETE...



Could you help me, please.
Thanks.
Avatar of milen
milen

ASKER

Adjusted points from 50 to 100
Avatar of milen

ASKER

Adjusted points from 100 to 200
This is a simple script that extracts the top story from cnn. Maybe you can modify it to get what you want. This script works.

use LWP::Simple;
use Text::Wrap;

use CGI qw(:standard);

$form = CGI->new ();

#$tick = $form -> param(tick);

$news = get ("http://www.cnn.com");
$news =~ s/^.*<H3><A href=//s ;
$news =~ s/FULL STORY.*$//s;
$news =~ s/<[^>]+>//g;
$news =~ s/^.*>//s;

#open (DATA, ">c:/perl/html.txt") ;
#print DATA $news;
#close DATA;

print wrap('', '', $news );

Betsy
Avatar of milen

ASKER

garfld,
my script also works as well. My problem is modifying it to retrieve only what I want...

I think all the job is with the $line =~ s/'s filters but I don't know well who to work on it.

If you cannot help me with it, do you have any link to help me understand this filters?

Thanks,
Fito

Avatar of ozo
print "<li>",$line=~m{(<a\s*href.*?)(?:</a>|$)}im,"</a><br>";
Avatar of milen

ASKER

Hi ozo,
you're very near...

#####################
I'm trying with:
#####################
#!/usr/bin/perl
use English;
use CGI;
use integer;
require LWP::UserAgent;
$ua = new LWP::UserAgent;
$the_cgi = CGI->new;
$site = "http://www.elcordillerano.com.ar/hoy/bariloche.htm";
$lookup = new HTTP::Request 'GET', "$site";
$response = $ua->request($lookup);
@lines = split (/<br>/, $response->content);
$up = "dotted\"\>\<font FACE=\"Arial\"\>\<b\>\<a"; # condition
$numheads = "10"; # number of headlines
print "Content-type: text/html\n\n";
$i = 0;
foreach $line (@lines)
 {
   if ($line =~ /$up/ && $i < $numheads)
    {
    print "<li>",$line=~m{(<a\s*href.*?)(?:</a>|$)}im,"</a><br>";
      $i += 1;
    }    
 }
exit;


###############################
WHAT I GET...
###############################
<body>
<li><a href="#Aunque el invierno no sea nevador, familias de">Aunque el invierno no sea nevador, (INCOMPLETE)</a><br>
<li><a href="#Los &quot;blancos&quot; de la UCR piensan que &quot;el">Los &quot;blancos&quot; de
(INCOMPLETE)</a><br>
<li><a href="#Dimes y diretes">Dimes y diretes</a><br>

(HERE LACKS FULL HEADLINE)

<li><a href="#Esgrimen razones para asfaltar la Ruta 23">Esgrimen razones para asfaltar la Ruta (INCOMPLETE)</a><br>
</body>

###############################
WHAT I WANT...
###############################
<li><a href="#Aunque el invierno no sea nevador, familias de">Aunque el invierno no sea nevador,
    familias de<br>
      barrios carenciados piden más leña y querosén</a><br>
<li><a href="#Los &quot;blancos&quot; de la UCR piensan que &quot;el">Los &quot;blancos&quot; de
    la UCR piensan que &quot;el<br>
    Gobierno está en un lugar y los radicales en otro&quot; </a><br>
<li><a href="#Dimes y diretes">Dimes y diretes</a><br>
<li><a href="#Gas zona oeste: analizarán la manera">Gas zona oeste: analizarán la manera<br>
    de pagar las cuotas de instalación</a><br>
<li><a href="#Esgrimen razones para asfaltar la Ruta 23">Esgrimen razones para asfaltar la Ruta
    23</a><br>
</body>
#then How about changing
@lines = split (/<br>/, $response->content);
#to
@lines = split ('</a>', $response->content);
Avatar of milen

ASKER

Now appears the 4th headline (which was written with a ":" in between...)
but the others are still incomplete...

ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of milen

ASKER

Yes!!!!  Great ozo!!!
It works perfectly!!!
I'm sorry for the last question...
what if I don't want to break the sentences just the original text is? (how can I skip the <br> and make a long sentence with each headline?)

By the way in your tip:
"$line=~m{(<a\s*href.*)}is"
what is last "is" for?

of course points are yours ;-)
Avatar of milen

ASKER

Sorry, ozo...
I would like to change my last question for other more important:
if I want to change news anchors?
like:
<a href="#whatever">
to:
<a href="javascript:rt2func(#whatever)">
Avatar of milen

ASKER

Thanks ozo !!!!
/i   Do case-insensitive pattern matching.
/s   allow /./ to match \n

foreach( grep {/$up/ && $i++ < $numheads} split /<\/a>/i, $response->content ){
    s/\s*<br>\s*/ /g;
    s/.*<a\s*href="([^"]*)"(.*)/<li><a href="javascript:rt2func($1)"$2<\/a><br>\n/is;
    print;
}    
/i   Do case-insensitive pattern matching.
/s   allow /./ to match \n
see `perldoc perlre`