Extracting links from a web page

I have a series of links to content that I'm wanting to format for my site.  I am removing the graphics that are in the pages and then just extracting the links on each of the pages to dsiplay as I want.


Two issues though..,

One...
It seems that there is a cookie or session value that is set when the top level page is displayed.  This allows the user to access the content without having to login everytime.  I know that I can get the settings from the content provider and just re-establish the setting, but I would like to find a more elegant way if possible.  Right now when I access the top level page, as you can see below, the url params have the authorization necessary to bypass the login.

If you click on the link below that I'm using in my variable $link, you'll see that you can then click on the links on that page and access the content without having to log in.

Two...
With the code I have below I can extract the links that I want with two exceptions.  I also get links that point to either a pdf or html file with additional content points.  (I want to place this seperate from the other links)
Secondly I also get a couple of footer links.  I can't seem to get rid of these.

You'll notice that when you view the page from the link below, you won't see one of the offending footer links, (home and privacy statement), but you'll see that the others are missing.



Here is my code:

$link            =      file_get_contents('http://www.webedse.com/authorize/auth_wo_login.cfm?AUTH_LOGIN=TRUE&COMPANY_ID=1771&COURSE_ID=130');
$pattern            =      '(<img\ssrc=+".*>|<IMG\sSRC=+".*>)';
$replacement      =      '';
$link2            =      preg_replace($pattern, $replacement, $link);
$link2            =      str_replace('../', 'http://www.webedse.com/', $link2);
$link2            =      str_replace('<a href="/courses/', '<a href="http://www.webedse.com/courses/', $link2);


// Match all the links on the page
$pattern      =      '(<a\shref=+".*>)';
preg_match_all($pattern, $link2, $matches);
//echo $link2;

foreach($matches[0] as $links){
      print $links.'<br>';
}

Currently when you click the link above and then click one of the links on that page you will go to a page that contains the video and the content from the pdf or html link and the header and footer that I'm wanting to get rid of so I can format the content to fit my site.  

What I want to have at this point is the links that show up on the page by themselves, and when clicked a video that plays by itself.
LVL 8
alexhoganAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Marcus BointonCommented:
Firstly, you don't need to strip the images to avoid the image links - image urls are in src attributes, not hrefs, so you can use that to skip them.

Some of your replacements are likely to do odd things too, for example a link to ../../index.php would get ../ substituted twice by your code, giving a bad URL. You might do well to take a look at the realpath() and parse_url() functions and use its output to rebuild the links in the shape you want them.

You can add the i modifier to make preg patterns case insensitive, e.g. /img/i will also match IMG.

Note that patterns used with preg functions should be contained within //.

This pattern looks wrong:

$pattern     =     '(<a\shref=+".*>)';

The + in there suggests that you're likely to encounter more than one = char! You're also matching too much if you're just wanting to grab the link URL itself. this should work better:

$pattern = '/<a\s+href\s*=\s*"(.*)"/';

If you want to skip links to particular resource types, just filter them before output:

foreach ($matches[0] as $link) {
  if (preg_match('/.pdf$/', $link)
     $pdflinks[] = $link;
  elseif (preg_match('/.html$/', $link)
     $weblinks[] = $link;
  else
    $otherlinks[] = $link;
}

After this you'll end up with 3 arrays of links to PDFs, HTML pages and other stuff.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Marcus BointonCommented:
Another observation: AUTH_LOGIN=TRUE isn't very secure is it!
0
alexhoganAuthor Commented:
[snip]
Another observation: AUTH_LOGIN=TRUE isn't very secure is it!
[/snip]

Nope.., but that's what they gave me...
0
alexhoganAuthor Commented:
[snip]
If you want to skip links to particular resource types, just filter them before output:
[/snip]

That makes some sense...

Everything that they have is in relative urls..,  I'll have to identify the directory structure in the filter.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.