links scraper

i want to extract Aaliyah from the anchor tag, basically the text between teh anchor tags  <a > text</a>

basically a webpage has bunch of such td entries and i want to extract the text between them using php/curl  or maybe using perl ...i think all the td entres have a class - v55 so maybe i can use that as a selecting criteria.

<td width="33%" align="left" valign="top">^M
<a class="v55" href=-Aaliyah-1123.html> Aaliyah</a> <span class=v11>(5ft 5in)</s
pan>
</td>


all the extracted text strings can be dumped to a file like

text1
text2
text3

and so on
VlearnsAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

R-ByterCommented:
Get a look at the solution here and tell me if it helps.

http://www.dreamincode.net/forums/topic/84389-having-trouble-with-preg-match-all/

Regards
Justin MathewsCommented:
Assuming your html source is in source.html try this Perl command. The output will be in output.txt:

perl -e '$/=undef;open IN, "source.html";$lines=<IN>;print "$1\n" while $lines =~ s/<a\s*class="v55"[^>]+>([^<]+)<//is' >output.txt
Ray PaseurCommented:
What is the URL of the web page you want to scrape?  I'll show you how to do it.
C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

VlearnsAuthor Commented:
.celebheights.com/s/A.html >A.html`;"
#!/usr/bin/perl
my $cmd = `curl 'http://www.celebheights.com/s/A.html' > A.html`;
$cmd = `perl -e '$/=undef;open IN, 'A.html';$lines=<IN>;print "$1\n" while $lines =~ s/<a\s*class="v55"[^>]+>([^<]+)<//is' >output.txt`;


the above did not work

basically i am trying to get all celeb names beginning with A in a file A.txt

same for B, C etc

Justin MathewsCommented:
There is mismatched single quote. Try this:

#!/usr/bin/perl
my $cmd = `curl 'http://www.celebheights.com/s/A.html' > A.html`;
$cmd = `perl -e '$/=undef;open IN, "A.html";$lines=<IN>;print "$1\n" while $lines =~ s/<a\s*class="v55"[^>]+>([^<]+)<//is' >output.txt`;
VlearnsAuthor Commented:
nothing in output.txt ...
Justin MathewsCommented:
Can you post the content of A.html?
VlearnsAuthor Commented:
A.html
A.html
Justin MathewsCommented:
The class name is v11 and not v55 as you stated. So the script should be:

#!/usr/bin/perl
my $cmd = `curl 'http://www.celebheights.com/s/A.html' > A.html`;
$cmd = `perl -e '$/=undef;open IN, "A.html";$lines=<IN>;print "$1\n" while $lines =~ s/<a\s*class="v11"[^>]+>([^<]+)<//is' >output.txt`;
Ray PaseurCommented:
This tests out correctly.  You might want to trim() the names - it looks like the single names have a blank in front of them.
<?php // RAY_temp_vlearns.php
error_reporting(E_ALL);
echo "<pre>";


// SCRAPE ANCHOR TAG TEXT INFORMATION OUT OF THIS:
$url = 'http://www.celebheights.com/s/A.html';


$str = file_get_contents($url);

// STANDARDIZE WHITESPACE
$str = preg_replace('/\s\s+/', ' ', $str);

// PREPARE A REGEX
$rgx
= '#'                             // THE REGEX DELIMITER
. '(\<a class="v11" href=.*?\>)'  // GROUP 1: THE OPEN ANCHOR TAG
. '(.*?)'                         // GROUP 2: THE UNGREEDY "MATCH ANY STRING"
. '(\</a\>)'                      // GROUP 3: THE END ANCHOR TAG
. '#'                             // THE REGEX DELIMITER
. 'is'                            // CASE-INSENSITIVE, TREAT STRING AS A SINLE LINE
;

// DECLOP THE HTML
preg_match_all($rgx, $str, $arr);

// ACTIVATE THIS AND USE 'VIEW SOURCE' TO SEE THE ALL THE REGEX OUTPUT
// var_dump($arr);

// CREATE A FILE FROM GROUP 2
$new = implode(PHP_EOL , $arr[2]);
echo $new;

Open in new window

VlearnsAuthor Commented:
Output is still blank, does it work for you?

The class name is v11 and not v55 as you stated. So the script should be:

#!/usr/bin/perl
my $cmd = `curl 'http://www.celebheights.com/s/A.html' > A.html`;
$cmd = `perl -e '$/=undef;open IN, "A.html";$lines=<IN>;print "$1\n" while $lines =~ s/<a\s*class="v11"[^>]+>([^<]+)<//is' >output.tx
VlearnsAuthor Commented:
the php script works, but it adds a <pre> tag to the first entry...see attached...
VlearnsAuthor Commented:
output for A
output
VlearnsAuthor Commented:
output for B
outputB
Justin MathewsCommented:
I am getting the output. Try giving the full path to A.html like:
#!/usr/bin/perl
my $cmd = `curl 'http://www.celebheights.com/s/A.html' > A.html`;
$cmd = `perl -e '$/=undef;open IN, "/home/mydir/A.html";$lines=<IN>;print "$1\n" while $lines =~ s/<a\s*class="v11"[^>]+>([^<]+)<//is' >output.txt
VlearnsAuthor Commented:
can i  automate your php script so that i can get all files from A to Z as files A.txt, B.txt upto Z?

thanks!
Ray PaseurCommented:
Sure, you can do that.  And you can remove the "<pre>" tag if you want.  It's on line 3.
Ray PaseurCommented:

<?php // RAY_temp_vlearns.php
error_reporting(E_ALL);


// SCRAPE ANCHOR TAG TEXT INFORMATION OUT OF THIS:
$url = 'http://www.celebheights.com/s/??.html';


$alphabet = range('A', 'Z');
$celebs = array();
foreach ($alphabet as $letter)
{
    $current_url = str_replace('??', $letter, $url);

    $str = file_get_contents($current_url);

    // STANDARDIZE WHITESPACE
    $str = preg_replace('/\s\s+/', ' ', $str);

    // PREPARE A REGEX
    $rgx
    = '#'                             // THE REGEX DELIMITER
    . '(\<a class="v11" href=.*?\>)'  // GROUP 1: THE OPEN ANCHOR TAG
    . '(.*?)'                         // GROUP 2: THE UNGREEDY "MATCH ANY STRING"
    . '(\</a\>)'                      // GROUP 3: THE END ANCHOR TAG
    . '#'                             // THE REGEX DELIMITER
    . 'is'                            // CASE-INSENSITIVE, TREAT STRING AS A SINLE LINE
    ;

    // DECLOP THE HTML
    preg_match_all($rgx, $str, $arr);

    // GATHER THE RESULTS
    $celebs = array_merge($celebs, $arr[2]);

    set_time_limit(10);
    echo "<br/>$letter" . PHP_EOL;
}

// SHOW THE WORK PRODUCT
print_r($celebs);

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
VlearnsAuthor Commented:
thanska  lot ray. can we write the celebs by name to a file, say A.html B.html, i want to create a php based webpage, that has tabs on the top, for A,B,C etc and when i click on that tab, i want to load the corresponding file, A.html from the disk and display it

thanks

Ray PaseurCommented:
The last part about writing to a file...

Use implode(PHP_EOL, $celebs); to create a string.
Use file_put_contents() to write the string to your file.

Hope that wraps it up for you, ~Ray
Ray PaseurCommented:
... Or write the file for each letter.  In any case, the function is file_put_contents() documented here:
http://us.php.net/manual/en/function.file-put-contents.php
VlearnsAuthor Commented:
Thanks!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.