Extracting links from a web page

Posted on 2004-11-24
Last Modified: 2006-11-17
I have a series of links to content that I'm wanting to format for my site.  I am removing the graphics that are in the pages and then just extracting the links on each of the pages to dsiplay as I want.

Two issues though..,

It seems that there is a cookie or session value that is set when the top level page is displayed.  This allows the user to access the content without having to login everytime.  I know that I can get the settings from the content provider and just re-establish the setting, but I would like to find a more elegant way if possible.  Right now when I access the top level page, as you can see below, the url params have the authorization necessary to bypass the login.

If you click on the link below that I'm using in my variable $link, you'll see that you can then click on the links on that page and access the content without having to log in.

With the code I have below I can extract the links that I want with two exceptions.  I also get links that point to either a pdf or html file with additional content points.  (I want to place this seperate from the other links)
Secondly I also get a couple of footer links.  I can't seem to get rid of these.

You'll notice that when you view the page from the link below, you won't see one of the offending footer links, (home and privacy statement), but you'll see that the others are missing.

Here is my code:

$link            =      file_get_contents('');
$pattern            =      '(<img\ssrc=+".*>|<IMG\sSRC=+".*>)';
$replacement      =      '';
$link2            =      preg_replace($pattern, $replacement, $link);
$link2            =      str_replace('../', '', $link2);
$link2            =      str_replace('<a href="/courses/', '<a href="', $link2);

// Match all the links on the page
$pattern      =      '(<a\shref=+".*>)';
preg_match_all($pattern, $link2, $matches);
//echo $link2;

foreach($matches[0] as $links){
      print $links.'<br>';

Currently when you click the link above and then click one of the links on that page you will go to a page that contains the video and the content from the pdf or html link and the header and footer that I'm wanting to get rid of so I can format the content to fit my site.  

What I want to have at this point is the links that show up on the page by themselves, and when clicked a video that plays by itself.
Question by:alexhogan
    LVL 25

    Accepted Solution

    Firstly, you don't need to strip the images to avoid the image links - image urls are in src attributes, not hrefs, so you can use that to skip them.

    Some of your replacements are likely to do odd things too, for example a link to ../../index.php would get ../ substituted twice by your code, giving a bad URL. You might do well to take a look at the realpath() and parse_url() functions and use its output to rebuild the links in the shape you want them.

    You can add the i modifier to make preg patterns case insensitive, e.g. /img/i will also match IMG.

    Note that patterns used with preg functions should be contained within //.

    This pattern looks wrong:

    $pattern     =     '(<a\shref=+".*>)';

    The + in there suggests that you're likely to encounter more than one = char! You're also matching too much if you're just wanting to grab the link URL itself. this should work better:

    $pattern = '/<a\s+href\s*=\s*"(.*)"/';

    If you want to skip links to particular resource types, just filter them before output:

    foreach ($matches[0] as $link) {
      if (preg_match('/.pdf$/', $link)
         $pdflinks[] = $link;
      elseif (preg_match('/.html$/', $link)
         $weblinks[] = $link;
        $otherlinks[] = $link;

    After this you'll end up with 3 arrays of links to PDFs, HTML pages and other stuff.
    LVL 25

    Expert Comment

    Another observation: AUTH_LOGIN=TRUE isn't very secure is it!
    LVL 8

    Author Comment

    Another observation: AUTH_LOGIN=TRUE isn't very secure is it!

    Nope.., but that's what they gave me...
    LVL 8

    Author Comment

    If you want to skip links to particular resource types, just filter them before output:

    That makes some sense...

    Everything that they have is in relative urls..,  I'll have to identify the directory structure in the filter.

    Featured Post

    Looking for New Ways to Advertise?

    Engage with tech pros in our community with native advertising, as a Vendor Expert, and more.

    Join & Write a Comment

    Both Easy and Powerful How easy is PHP? (  Very easy.  It has been described as "a programming language even my grandmother can use." How powerful is PHP?  http://en.wikiped…
    Generating table dynamically is the most common issue faced by php developers.... So it seems there is a need of an article that explains the basic concept of generating tables dynamically. It just requires a basic knowledge of html and little maths…
    The viewer will learn how to count occurrences of each item in an array.
    This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

    745 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    16 Experts available now in Live!

    Get 1:1 Help Now