Link to home
Start Free TrialLog in
Avatar of alexhogan
alexhogan

asked on

Extracting links from a web page

I have a series of links to content that I'm wanting to format for my site.  I am removing the graphics that are in the pages and then just extracting the links on each of the pages to dsiplay as I want.


Two issues though..,

One...
It seems that there is a cookie or session value that is set when the top level page is displayed.  This allows the user to access the content without having to login everytime.  I know that I can get the settings from the content provider and just re-establish the setting, but I would like to find a more elegant way if possible.  Right now when I access the top level page, as you can see below, the url params have the authorization necessary to bypass the login.

If you click on the link below that I'm using in my variable $link, you'll see that you can then click on the links on that page and access the content without having to log in.

Two...
With the code I have below I can extract the links that I want with two exceptions.  I also get links that point to either a pdf or html file with additional content points.  (I want to place this seperate from the other links)
Secondly I also get a couple of footer links.  I can't seem to get rid of these.

You'll notice that when you view the page from the link below, you won't see one of the offending footer links, (home and privacy statement), but you'll see that the others are missing.



Here is my code:

$link            =      file_get_contents('http://www.webedse.com/authorize/auth_wo_login.cfm?AUTH_LOGIN=TRUE&COMPANY_ID=1771&COURSE_ID=130');
$pattern            =      '(<img\ssrc=+".*>|<IMG\sSRC=+".*>)';
$replacement      =      '';
$link2            =      preg_replace($pattern, $replacement, $link);
$link2            =      str_replace('../', 'http://www.webedse.com/', $link2);
$link2            =      str_replace('<a href="/courses/', '<a href="http://www.webedse.com/courses/', $link2);


// Match all the links on the page
$pattern      =      '(<a\shref=+".*>)';
preg_match_all($pattern, $link2, $matches);
//echo $link2;

foreach($matches[0] as $links){
      print $links.'<br>';
}

Currently when you click the link above and then click one of the links on that page you will go to a page that contains the video and the content from the pdf or html link and the header and footer that I'm wanting to get rid of so I can format the content to fit my site.  

What I want to have at this point is the links that show up on the page by themselves, and when clicked a video that plays by itself.
ASKER CERTIFIED SOLUTION
Avatar of Marcus Bointon
Marcus Bointon
Flag of France image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Another observation: AUTH_LOGIN=TRUE isn't very secure is it!
Avatar of alexhogan
alexhogan

ASKER

[snip]
Another observation: AUTH_LOGIN=TRUE isn't very secure is it!
[/snip]

Nope.., but that's what they gave me...
[snip]
If you want to skip links to particular resource types, just filter them before output:
[/snip]

That makes some sense...

Everything that they have is in relative urls..,  I'll have to identify the directory structure in the filter.