Solved

retrieving only the html what I want

Posted on 2000-04-15
14
190 Views
Last Modified: 2010-03-05
Hello, I want to copy in one of my sites the headline news of other of my sites. I'm trying to use a simple and efective code but... I cannot retry exactly what I want.
Seen the examples below you'll understand my problem...

######################################
FILES I'M USING
#######################################
file: cord.pl
#######################################

#!/usr/bin/perl
use English;
use CGI;
use integer;
require LWP::UserAgent;
#####################
$ua = new LWP::UserAgent;
$the_cgi = CGI->new;
$lookup = new HTTP::Request 'GET', "http://www.elcordillerano.com.ar/hoy/bariloche.htm";
$response = $ua->request($lookup);
@lines = split (/<br>/, $response->content);
$up = "dotted\"\>\<font FACE=\"Arial\"\>\<b\>\<a"; # condition to print
$numheads = "30"; # number of headlines

print "Content-type: text/html\n\n";
$i = 0;
foreach $line (@lines)
 {
   if ($line =~ /$up/ && $i < $numheads)
    {
        print $line;      
            $i += 1;
      }    
  }
exit;
#######################################
file: cord.shtml (SSI)
#######################################
<html>
<head>
<title>Untitled</title>
<BASE href="http://www.elcordillerano.com.ar">
</head>
<body>
<!--#exec cgi="/cgi-local/cord.pl" -->
</body>
</html>
#######################################

WHAT I HAVE until now....
#####################################
<html>
<head>
<title>Untitled</title>
<BASE href="http://www.elcordillerano.com.ar">
</head>
<body>
<html>
####### Here starts SSI (cord.pl) ######
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="AUTHOR" content="Juan Pablo Duprez">
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">
<style>
<!--A{text-decoration:none}
A:hover {color: "#F88a50"}-->
</style>
<title>Bariloche</title>
</head>

<body vlink="#0000FF" alink="#0000FF">
<div align="center"><center>

<table border="0" cellpadding="0" cellspacing="0" width="100%" bgcolor="#FFFFFF">
  <tr>
    <td width="100%" colspan="2" bgcolor="#FFFFFF" height="65"><font color="#FFFF00"><strong><big><p
    align="right"></big><img src="imagenes/encabezados.jpg" alt="encabezados.jpg (4861 bytes)"
    width="105" height="25"><big></p>
    </big></strong></font><hr size="1" noshade color="#000000">
    </td>
  </tr>
  <tr>
    <td width="100%" colspan="2" bgcolor="#000000"><p align="center"><a name="Principio"><font
    face="Verdana" color="#FFC800"><strong><big><big>BARILOCHE</big></big></strong></font></a></td>
  </tr>
  <tr>
    <td width="6%" style="border-bottom: thin dotted"><p align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Una vez más, el SOYEM denunció">Una vez más, el SOYEM denunció
    grave situación financiera en el municipio</a></b></font></td>
  </tr>
  <tr>
    <td width="6%" style="border-bottom: thin dotted" align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Prometen que asfaltarán la Ruta 23">Prometen que asfaltarán la Ruta 23
    entre Bariloche y Pilcaniyeu</a></b></font></td>
  </tr>
  ............other headlines.........
  <tr>
    <td width="6%" style="border-bottom: thin dotted" align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Salud: profesionales deberán">Salud: profesionales deberán
</body>
</html>

The last one is truncated (there is another line isn't here, in the original there is a <br>)

WHAT I want from SSI....
#####################################
<li><a href="#Una vez más, el SOYEM denunció">Una vez más, el SOYEM denunció grave situación financiera en el municipio</a><br>
<li><a href="#Prometen que asfaltarán la Ruta 23">Prometen que asfaltarán la Ruta 23 entre Bariloche y Pilcaniyeu</a><br>
<li>............other headlines.........
<li><a href="#Salud: profesionales deberán">Salud: profesionales deberán xxxxxxxx xxxxxxx xxxxxxx xxxx</a><br>

THE LAST LINE WITH THE TEXT COMPLETE...



Could you help me, please.
Thanks.
0
Comment
Question by:milen
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 5
14 Comments
 

Author Comment

by:milen
ID: 2719160
Adjusted points from 50 to 100
0
 

Author Comment

by:milen
ID: 2720325
Adjusted points from 100 to 200
0
 
LVL 2

Expert Comment

by:garfld
ID: 2721388
This is a simple script that extracts the top story from cnn. Maybe you can modify it to get what you want. This script works.

use LWP::Simple;
use Text::Wrap;

use CGI qw(:standard);

$form = CGI->new ();

#$tick = $form -> param(tick);

$news = get ("http://www.cnn.com");
$news =~ s/^.*<H3><A href=//s ;
$news =~ s/FULL STORY.*$//s;
$news =~ s/<[^>]+>//g;
$news =~ s/^.*>//s;

#open (DATA, ">c:/perl/html.txt") ;
#print DATA $news;
#close DATA;

print wrap('', '', $news );

Betsy
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:milen
ID: 2722285
garfld,
my script also works as well. My problem is modifying it to retrieve only what I want...

I think all the job is with the $line =~ s/'s filters but I don't know well who to work on it.

If you cannot help me with it, do you have any link to help me understand this filters?

Thanks,
Fito

0
 
LVL 84

Expert Comment

by:ozo
ID: 2722397
print "<li>",$line=~m{(<a\s*href.*?)(?:</a>|$)}im,"</a><br>";
0
 

Author Comment

by:milen
ID: 2722613
Hi ozo,
you're very near...

#####################
I'm trying with:
#####################
#!/usr/bin/perl
use English;
use CGI;
use integer;
require LWP::UserAgent;
$ua = new LWP::UserAgent;
$the_cgi = CGI->new;
$site = "http://www.elcordillerano.com.ar/hoy/bariloche.htm";
$lookup = new HTTP::Request 'GET', "$site";
$response = $ua->request($lookup);
@lines = split (/<br>/, $response->content);
$up = "dotted\"\>\<font FACE=\"Arial\"\>\<b\>\<a"; # condition
$numheads = "10"; # number of headlines
print "Content-type: text/html\n\n";
$i = 0;
foreach $line (@lines)
 {
   if ($line =~ /$up/ && $i < $numheads)
    {
    print "<li>",$line=~m{(<a\s*href.*?)(?:</a>|$)}im,"</a><br>";
      $i += 1;
    }    
 }
exit;


###############################
WHAT I GET...
###############################
<body>
<li><a href="#Aunque el invierno no sea nevador, familias de">Aunque el invierno no sea nevador, (INCOMPLETE)</a><br>
<li><a href="#Los &quot;blancos&quot; de la UCR piensan que &quot;el">Los &quot;blancos&quot; de
(INCOMPLETE)</a><br>
<li><a href="#Dimes y diretes">Dimes y diretes</a><br>

(HERE LACKS FULL HEADLINE)

<li><a href="#Esgrimen razones para asfaltar la Ruta 23">Esgrimen razones para asfaltar la Ruta (INCOMPLETE)</a><br>
</body>

###############################
WHAT I WANT...
###############################
<li><a href="#Aunque el invierno no sea nevador, familias de">Aunque el invierno no sea nevador,
    familias de<br>
      barrios carenciados piden más leña y querosén</a><br>
<li><a href="#Los &quot;blancos&quot; de la UCR piensan que &quot;el">Los &quot;blancos&quot; de
    la UCR piensan que &quot;el<br>
    Gobierno está en un lugar y los radicales en otro&quot; </a><br>
<li><a href="#Dimes y diretes">Dimes y diretes</a><br>
<li><a href="#Gas zona oeste: analizarán la manera">Gas zona oeste: analizarán la manera<br>
    de pagar las cuotas de instalación</a><br>
<li><a href="#Esgrimen razones para asfaltar la Ruta 23">Esgrimen razones para asfaltar la Ruta
    23</a><br>
</body>
0
 
LVL 84

Expert Comment

by:ozo
ID: 2723799
#then How about changing
@lines = split (/<br>/, $response->content);
#to
@lines = split ('</a>', $response->content);
0
 

Author Comment

by:milen
ID: 2723964
Now appears the 4th headline (which was written with a ":" in between...)
but the others are still incomplete...

0
 
LVL 84

Accepted Solution

by:
ozo earned 200 total points
ID: 2724061
print "<li>",$line=~m{(<a\s*href.*)}is,"</a><br>\n";
0
 

Author Comment

by:milen
ID: 2724125
Yes!!!!  Great ozo!!!
It works perfectly!!!
I'm sorry for the last question...
what if I don't want to break the sentences just the original text is? (how can I skip the <br> and make a long sentence with each headline?)

By the way in your tip:
"$line=~m{(<a\s*href.*)}is"
what is last "is" for?

of course points are yours ;-)
0
 

Author Comment

by:milen
ID: 2726337
Sorry, ozo...
I would like to change my last question for other more important:
if I want to change news anchors?
like:
<a href="#whatever">
to:
<a href="javascript:rt2func(#whatever)">
0
 

Author Comment

by:milen
ID: 2727074
Thanks ozo !!!!
0
 
LVL 84

Expert Comment

by:ozo
ID: 2731638
/i   Do case-insensitive pattern matching.
/s   allow /./ to match \n

foreach( grep {/$up/ && $i++ < $numheads} split /<\/a>/i, $response->content ){
    s/\s*<br>\s*/ /g;
    s/.*<a\s*href="([^"]*)"(.*)/<li><a href="javascript:rt2func($1)"$2<\/a><br>\n/is;
    print;
}    
0
 
LVL 84

Expert Comment

by:ozo
ID: 2731647
/i   Do case-insensitive pattern matching.
/s   allow /./ to match \n
see `perldoc perlre`
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

735 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question