[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now


Extracting links from a web page

Posted on 2004-11-24
Medium Priority
Last Modified: 2006-11-17
I have a series of links to content that I'm wanting to format for my site.  I am removing the graphics that are in the pages and then just extracting the links on each of the pages to dsiplay as I want.

Two issues though..,

It seems that there is a cookie or session value that is set when the top level page is displayed.  This allows the user to access the content without having to login everytime.  I know that I can get the settings from the content provider and just re-establish the setting, but I would like to find a more elegant way if possible.  Right now when I access the top level page, as you can see below, the url params have the authorization necessary to bypass the login.

If you click on the link below that I'm using in my variable $link, you'll see that you can then click on the links on that page and access the content without having to log in.

With the code I have below I can extract the links that I want with two exceptions.  I also get links that point to either a pdf or html file with additional content points.  (I want to place this seperate from the other links)
Secondly I also get a couple of footer links.  I can't seem to get rid of these.

You'll notice that when you view the page from the link below, you won't see one of the offending footer links, (home and privacy statement), but you'll see that the others are missing.

Here is my code:

$link            =      file_get_contents('http://www.webedse.com/authorize/auth_wo_login.cfm?AUTH_LOGIN=TRUE&COMPANY_ID=1771&COURSE_ID=130');
$pattern            =      '(<img\ssrc=+".*>|<IMG\sSRC=+".*>)';
$replacement      =      '';
$link2            =      preg_replace($pattern, $replacement, $link);
$link2            =      str_replace('../', 'http://www.webedse.com/', $link2);
$link2            =      str_replace('<a href="/courses/', '<a href="http://www.webedse.com/courses/', $link2);

// Match all the links on the page
$pattern      =      '(<a\shref=+".*>)';
preg_match_all($pattern, $link2, $matches);
//echo $link2;

foreach($matches[0] as $links){
      print $links.'<br>';

Currently when you click the link above and then click one of the links on that page you will go to a page that contains the video and the content from the pdf or html link and the header and footer that I'm wanting to get rid of so I can format the content to fit my site.  

What I want to have at this point is the links that show up on the page by themselves, and when clicked a video that plays by itself.
Question by:alexhogan
  • 2
  • 2
LVL 25

Accepted Solution

Marcus Bointon earned 2000 total points
ID: 12673412
Firstly, you don't need to strip the images to avoid the image links - image urls are in src attributes, not hrefs, so you can use that to skip them.

Some of your replacements are likely to do odd things too, for example a link to ../../index.php would get ../ substituted twice by your code, giving a bad URL. You might do well to take a look at the realpath() and parse_url() functions and use its output to rebuild the links in the shape you want them.

You can add the i modifier to make preg patterns case insensitive, e.g. /img/i will also match IMG.

Note that patterns used with preg functions should be contained within //.

This pattern looks wrong:

$pattern     =     '(<a\shref=+".*>)';

The + in there suggests that you're likely to encounter more than one = char! You're also matching too much if you're just wanting to grab the link URL itself. this should work better:

$pattern = '/<a\s+href\s*=\s*"(.*)"/';

If you want to skip links to particular resource types, just filter them before output:

foreach ($matches[0] as $link) {
  if (preg_match('/.pdf$/', $link)
     $pdflinks[] = $link;
  elseif (preg_match('/.html$/', $link)
     $weblinks[] = $link;
    $otherlinks[] = $link;

After this you'll end up with 3 arrays of links to PDFs, HTML pages and other stuff.
LVL 25

Expert Comment

by:Marcus Bointon
ID: 12673416
Another observation: AUTH_LOGIN=TRUE isn't very secure is it!

Author Comment

ID: 12696237
Another observation: AUTH_LOGIN=TRUE isn't very secure is it!

Nope.., but that's what they gave me...

Author Comment

ID: 12696281
If you want to skip links to particular resource types, just filter them before output:

That makes some sense...

Everything that they have is in relative urls..,  I'll have to identify the directory structure in the filter.

Featured Post

Upgrade your Question Security!

Add Premium security features to your question to ensure its privacy or anonymity. Learn more about your ability to control Question Security today.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article discusses how to create an extensible mechanism for linked drop downs.
This article discusses how to implement server side field validation and display customized error messages to the client.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
Suggested Courses
Course of the Month19 days, 6 hours left to enroll

834 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question