Solved

retrieving only the html what I want

Posted on 2000-04-15
14
187 Views
Last Modified: 2010-03-05
Hello, I want to copy in one of my sites the headline news of other of my sites. I'm trying to use a simple and efective code but... I cannot retry exactly what I want.
Seen the examples below you'll understand my problem...

######################################
FILES I'M USING
#######################################
file: cord.pl
#######################################

#!/usr/bin/perl
use English;
use CGI;
use integer;
require LWP::UserAgent;
#####################
$ua = new LWP::UserAgent;
$the_cgi = CGI->new;
$lookup = new HTTP::Request 'GET', "http://www.elcordillerano.com.ar/hoy/bariloche.htm";
$response = $ua->request($lookup);
@lines = split (/<br>/, $response->content);
$up = "dotted\"\>\<font FACE=\"Arial\"\>\<b\>\<a"; # condition to print
$numheads = "30"; # number of headlines

print "Content-type: text/html\n\n";
$i = 0;
foreach $line (@lines)
 {
   if ($line =~ /$up/ && $i < $numheads)
    {
        print $line;      
            $i += 1;
      }    
  }
exit;
#######################################
file: cord.shtml (SSI)
#######################################
<html>
<head>
<title>Untitled</title>
<BASE href="http://www.elcordillerano.com.ar">
</head>
<body>
<!--#exec cgi="/cgi-local/cord.pl" -->
</body>
</html>
#######################################

WHAT I HAVE until now....
#####################################
<html>
<head>
<title>Untitled</title>
<BASE href="http://www.elcordillerano.com.ar">
</head>
<body>
<html>
####### Here starts SSI (cord.pl) ######
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="AUTHOR" content="Juan Pablo Duprez">
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">
<style>
<!--A{text-decoration:none}
A:hover {color: "#F88a50"}-->
</style>
<title>Bariloche</title>
</head>

<body vlink="#0000FF" alink="#0000FF">
<div align="center"><center>

<table border="0" cellpadding="0" cellspacing="0" width="100%" bgcolor="#FFFFFF">
  <tr>
    <td width="100%" colspan="2" bgcolor="#FFFFFF" height="65"><font color="#FFFF00"><strong><big><p
    align="right"></big><img src="imagenes/encabezados.jpg" alt="encabezados.jpg (4861 bytes)"
    width="105" height="25"><big></p>
    </big></strong></font><hr size="1" noshade color="#000000">
    </td>
  </tr>
  <tr>
    <td width="100%" colspan="2" bgcolor="#000000"><p align="center"><a name="Principio"><font
    face="Verdana" color="#FFC800"><strong><big><big>BARILOCHE</big></big></strong></font></a></td>
  </tr>
  <tr>
    <td width="6%" style="border-bottom: thin dotted"><p align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Una vez más, el SOYEM denunció">Una vez más, el SOYEM denunció
    grave situación financiera en el municipio</a></b></font></td>
  </tr>
  <tr>
    <td width="6%" style="border-bottom: thin dotted" align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Prometen que asfaltarán la Ruta 23">Prometen que asfaltarán la Ruta 23
    entre Bariloche y Pilcaniyeu</a></b></font></td>
  </tr>
  ............other headlines.........
  <tr>
    <td width="6%" style="border-bottom: thin dotted" align="center"><img
    src="imagenes/flecha_tit.gif" alt="flecha_tit.gif (356 bytes)" width="15" height="15"></td>
    <td width="94%" style="border-bottom: thin dotted"><font FACE="Arial"><b><a
    href="#Salud: profesionales deberán">Salud: profesionales deberán
</body>
</html>

The last one is truncated (there is another line isn't here, in the original there is a <br>)

WHAT I want from SSI....
#####################################
<li><a href="#Una vez más, el SOYEM denunció">Una vez más, el SOYEM denunció grave situación financiera en el municipio</a><br>
<li><a href="#Prometen que asfaltarán la Ruta 23">Prometen que asfaltarán la Ruta 23 entre Bariloche y Pilcaniyeu</a><br>
<li>............other headlines.........
<li><a href="#Salud: profesionales deberán">Salud: profesionales deberán xxxxxxxx xxxxxxx xxxxxxx xxxx</a><br>

THE LAST LINE WITH THE TEXT COMPLETE...



Could you help me, please.
Thanks.
0
Comment
Question by:milen
  • 8
  • 5
14 Comments
 

Author Comment

by:milen
ID: 2719160
Adjusted points from 50 to 100
0
 

Author Comment

by:milen
ID: 2720325
Adjusted points from 100 to 200
0
 
LVL 2

Expert Comment

by:garfld
ID: 2721388
This is a simple script that extracts the top story from cnn. Maybe you can modify it to get what you want. This script works.

use LWP::Simple;
use Text::Wrap;

use CGI qw(:standard);

$form = CGI->new ();

#$tick = $form -> param(tick);

$news = get ("http://www.cnn.com");
$news =~ s/^.*<H3><A href=//s ;
$news =~ s/FULL STORY.*$//s;
$news =~ s/<[^>]+>//g;
$news =~ s/^.*>//s;

#open (DATA, ">c:/perl/html.txt") ;
#print DATA $news;
#close DATA;

print wrap('', '', $news );

Betsy
0
Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

 

Author Comment

by:milen
ID: 2722285
garfld,
my script also works as well. My problem is modifying it to retrieve only what I want...

I think all the job is with the $line =~ s/'s filters but I don't know well who to work on it.

If you cannot help me with it, do you have any link to help me understand this filters?

Thanks,
Fito

0
 
LVL 84

Expert Comment

by:ozo
ID: 2722397
print "<li>",$line=~m{(<a\s*href.*?)(?:</a>|$)}im,"</a><br>";
0
 

Author Comment

by:milen
ID: 2722613
Hi ozo,
you're very near...

#####################
I'm trying with:
#####################
#!/usr/bin/perl
use English;
use CGI;
use integer;
require LWP::UserAgent;
$ua = new LWP::UserAgent;
$the_cgi = CGI->new;
$site = "http://www.elcordillerano.com.ar/hoy/bariloche.htm";
$lookup = new HTTP::Request 'GET', "$site";
$response = $ua->request($lookup);
@lines = split (/<br>/, $response->content);
$up = "dotted\"\>\<font FACE=\"Arial\"\>\<b\>\<a"; # condition
$numheads = "10"; # number of headlines
print "Content-type: text/html\n\n";
$i = 0;
foreach $line (@lines)
 {
   if ($line =~ /$up/ && $i < $numheads)
    {
    print "<li>",$line=~m{(<a\s*href.*?)(?:</a>|$)}im,"</a><br>";
      $i += 1;
    }    
 }
exit;


###############################
WHAT I GET...
###############################
<body>
<li><a href="#Aunque el invierno no sea nevador, familias de">Aunque el invierno no sea nevador, (INCOMPLETE)</a><br>
<li><a href="#Los &quot;blancos&quot; de la UCR piensan que &quot;el">Los &quot;blancos&quot; de
(INCOMPLETE)</a><br>
<li><a href="#Dimes y diretes">Dimes y diretes</a><br>

(HERE LACKS FULL HEADLINE)

<li><a href="#Esgrimen razones para asfaltar la Ruta 23">Esgrimen razones para asfaltar la Ruta (INCOMPLETE)</a><br>
</body>

###############################
WHAT I WANT...
###############################
<li><a href="#Aunque el invierno no sea nevador, familias de">Aunque el invierno no sea nevador,
    familias de<br>
      barrios carenciados piden más leña y querosén</a><br>
<li><a href="#Los &quot;blancos&quot; de la UCR piensan que &quot;el">Los &quot;blancos&quot; de
    la UCR piensan que &quot;el<br>
    Gobierno está en un lugar y los radicales en otro&quot; </a><br>
<li><a href="#Dimes y diretes">Dimes y diretes</a><br>
<li><a href="#Gas zona oeste: analizarán la manera">Gas zona oeste: analizarán la manera<br>
    de pagar las cuotas de instalación</a><br>
<li><a href="#Esgrimen razones para asfaltar la Ruta 23">Esgrimen razones para asfaltar la Ruta
    23</a><br>
</body>
0
 
LVL 84

Expert Comment

by:ozo
ID: 2723799
#then How about changing
@lines = split (/<br>/, $response->content);
#to
@lines = split ('</a>', $response->content);
0
 

Author Comment

by:milen
ID: 2723964
Now appears the 4th headline (which was written with a ":" in between...)
but the others are still incomplete...

0
 
LVL 84

Accepted Solution

by:
ozo earned 200 total points
ID: 2724061
print "<li>",$line=~m{(<a\s*href.*)}is,"</a><br>\n";
0
 

Author Comment

by:milen
ID: 2724125
Yes!!!!  Great ozo!!!
It works perfectly!!!
I'm sorry for the last question...
what if I don't want to break the sentences just the original text is? (how can I skip the <br> and make a long sentence with each headline?)

By the way in your tip:
"$line=~m{(<a\s*href.*)}is"
what is last "is" for?

of course points are yours ;-)
0
 

Author Comment

by:milen
ID: 2726337
Sorry, ozo...
I would like to change my last question for other more important:
if I want to change news anchors?
like:
<a href="#whatever">
to:
<a href="javascript:rt2func(#whatever)">
0
 

Author Comment

by:milen
ID: 2727074
Thanks ozo !!!!
0
 
LVL 84

Expert Comment

by:ozo
ID: 2731638
/i   Do case-insensitive pattern matching.
/s   allow /./ to match \n

foreach( grep {/$up/ && $i++ < $numheads} split /<\/a>/i, $response->content ){
    s/\s*<br>\s*/ /g;
    s/.*<a\s*href="([^"]*)"(.*)/<li><a href="javascript:rt2func($1)"$2<\/a><br>\n/is;
    print;
}    
0
 
LVL 84

Expert Comment

by:ozo
ID: 2731647
/i   Do case-insensitive pattern matching.
/s   allow /./ to match \n
see `perldoc perlre`
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Existing perl code to be changed for ftp to sftp handling 14 187
Perl Script - Remove column of data based on column value 2 51
Perl modules on linux ec2 3 104
Perl Untar File 1 41
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This Micro Tutorial will teach you how to censor certain areas of your screen. The example in this video will show a little boy's face being blurred. This will be demonstrated using Adobe Premiere Pro CS6.

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now