• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 271
  • Last Modified:

php link checker

i need to run a link checker, and i have about 2000 links to check, how would i check so many links?
i am trying to check if the links are active (and not showing a 404 error or something like that) i think a fopen() function is used for that but not sure....


any help is appreciated
0
noobie
Asked:
noobie
  • 5
  • 4
  • 2
  • +2
1 Solution
 
VGRCommented:
easy
1) a loop for your X links, say $links[$i] is the current
2) access URI $links[$i] and check the result for 404 or anything else
3) memorize $i if wrong URI, else NOP
4) loop

something like this :
<?
// inits
$badlinks=0;
$bad=array();
// loop through $links[] (beforehand filled in by you)
for ($i=1;$i<count($links);$i++) {
  // try to access that link
  $isgood=CheckURI($links[$i]]);
  // memorize result
  if (! $isgood) $bad[]=$i;
}
// display bad links
for ($i=1;$i<$badlinks;$i++) echo "bad link '".$links[$bad[$i]]."' (index=$i)<BR>";
// done

function CheckURI($parurl) {
  // inits
  $result=TRUE;
  // try to get URI
  $filename = "$parurl";
  $tobec=TRUE;
  $fd = @fopen ($filename, "r");
  if ($fd) { // si page trouvée
    while ((!feof ($fd))and($tobec)) {
      $ligne= fgets($fd, 4096);
      if (!(strpos($ligne,'[404] Not Found')===false)) $tobec=FALSE; // stop as soon as this is encountered
      $contents []=$ligne;
    } // while lecture bloquante
    fclose ($fd);
    if ($tobec) { // file entirely read OK (note that we could stop after X first lines, the '404' message is not at the 345th line...
      // nothing, result is TRUE already
      // this block is in case you want to log anything like "last correct date where found the URI was OK"
    } else { // we stopped before the end : 404 found
      $result=FALSE;
    }
  } else { // page not found
    $result=FALSE;
  } // if page trouvée ou non
  return $result;
} // CheckURI Boolean Function
?>
0
 
noobieAuthor Commented:
so how would this script work?
what do i have to do? create a data file?
0
 
HatembenCommented:
is your links in database or text file ?
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
noobieAuthor Commented:
well the links are in this format:
filename.php?go=Download&id=1
........
filename.php?go=Download&id=9999

first...(they skip numbers.)
second..i want to generate the links (all of the id's are in a database)
third...i want to check them if they are active (if they are returning 404 errors)

thanks alot..
anyone that helps me complete this gets 500 points.
0
 
VGRCommented:
just do this at the begin of the script (not tested by the way)

$links=array();
$links[]='http://www.netscape.com';
$links[]='http://www.badlink.zob';
$links[]='http://www.experts-exchange.com';

and you'll see...

you just have to get your links in an array called $links (how surprising :/ ) and test the script... :/
0
 
noobieAuthor Commented:
wait so i have to do:
$links=array();
$links[]='http://www.mydomain.com';

?
and it will list all of the links on the site? (there are many pages...for example filename.php?page=1-20)
0
 
Morph007x2bCommented:
0
 
VGRCommented:
Well noobie, you wrote "i need to run a link checker, and i have about 2000 links to check, how would i check so many links?" so I supposed that you had this list of links :/

Don't you ?

call this list $links[] and my code will become crystal clear ;-)

In a word : yes, do

<?
$links=array();
$links[]='http://www.netscape.com';
$links[]='http://www.badlink.zob';
$links[]='http://www.experts-exchange.com';

// inits
$badlinks=0;
$bad=array();
// loop through $links[] (beforehand filled in by you)
for ($i=1;$i<count($links);$i++) {
 // try to access that link
 $isgood=CheckURI($links[$i]]);
 // memorize result
 if (! $isgood) $bad[]=$i;
}
// display bad links
for ($i=1;$i<$badlinks;$i++) echo "bad link '".$links[$bad[$i]]."' (index=$i)<BR>";
// done

function CheckURI($parurl) {
 // inits
 $result=TRUE;
 // try to get URI
 $filename = "$parurl";
 $tobec=TRUE;
 $fd = @fopen ($filename, "r");
 if ($fd) { // si page trouvie
   while ((!feof ($fd))and($tobec)) {
     $ligne= fgets($fd, 4096);
     if (!(strpos($ligne,'[404] Not Found')===false)) $tobec=FALSE; // stop as soon as this is encountered
     $contents []=$ligne;
   } // while lecture bloquante
   fclose ($fd);
   if ($tobec) { // file entirely read OK (note that we could stop after X first lines, the '404' message is not at the 345th line...
     // nothing, result is TRUE already
     // this block is in case you want to log anything like "last correct date where found the URI was OK"
   } else { // we stopped before the end : 404 found
     $result=FALSE;
   }
 } else { // page not found
   $result=FALSE;
 } // if page trouvie ou non
 return $result;
} // CheckURI Boolean Function
?>

I don't guarantee it typo-free or error-free, but it's 85% minimum what you'll need at the end.
0
 
VGRCommented:
OK, I TESTED IT AND IT WORKS

I had some typos and minor errors (thigs forgotten)


So now the code is
<?
$links=array();
$links[1]='http://www.netscape.com';
$links[2]='http://www.badlink.zob';
$links[3]='http://www.experts-exchange.com';

//test
$DEBUGTEST=1;
if ($DEBUGTEST==1) echo count($links)." links in input<BR>";
//
// inits
$badlinks=0;
$bad=array();
// loop through $links[] (beforehand filled in by you)
for ($i=1;$i<=count($links);$i++) {
// try to access that link
$isgood=CheckURI($links[$i]);
if ($DEBUGTEST==1) echo "link $i '".$links[$i]."' is ".(($isgood)?'OK':'KO')."<BR>";
// memorize result
if (! $isgood) $bad[]=$i;
}
// display bad links
$badlinks=count($bad);
//test
if ($DEBUGTEST==1) echo "$badlinks bad links found<BR>";
//
for ($i=0;$i<$badlinks;$i++) echo "bad link '".$links[$bad[$i]]."' (index=$i)<BR>";
// done

function CheckURI($parurl) {
// inits
$result=TRUE;
// try to get URI
$filename = "$parurl";
$tobec=TRUE;
$fd = @fopen ($filename, "r");
if ($fd) { // si page trouvie
  while ((!feof ($fd))and($tobec)) {
    $ligne= fgets($fd, 4096);
    if (!(strpos($ligne,'[404] Not Found')===false)) $tobec=FALSE; // stop as soon as this is encountered
    $contents []=$ligne;
  } // while lecture bloquante
  fclose ($fd);
  if ($tobec) { // file entirely read OK (note that we could stop after X first lines, the '404' message is not at the 345th line...
    // nothing, result is TRUE already
    // this block is in case you want to log anything like "last correct date where found the URI was OK"
  } else { // we stopped before the end : 404 found
    $result=FALSE;
  }
} else { // page not found
  $result=FALSE;
} // if page trouvie ou non
return $result;
} // CheckURI Boolean Function
?>

and it produces (correctly) :
3 links in input
link 1 'http://www.netscape.com' is OK
link 2 'http://www.badlink.zob' is KO
link 3 'http://www.experts-exchange.com' is OK
1 bad links found
bad link 'http://www.badlink.zob' (index=0)

Just set $DEBUGTEST=0 and your code will behave as expected by you.
0
 
noobieAuthor Commented:
the script works, but i want to check all of the link that are associated with the site...
if i put in yahoo.com, i want it to check the entire site map of it! all of the links the page is linked to and all of the pages the linked site is linked to

later.
0
 
VGRCommented:
that's not at all what was your original question about...

... anyway, it's feasible (same CheckURI calls), but after having reda the page and CheckURI-ed all links encountered in it

I let you build this loop, given it's a different question. I even suggest you ask a new question, because I fairly answered your original one.

I would do this :
-for each URL in the original sites' list
-check it using technique above, BUT
-modify checkURI so that it recursively checks all encountered URIs in the currently-being-checked page
-you have to provide an external constant "maximum depth" to stop the recursion
-you have to parse the $contents[] array for tags : A HREF, IMG, FORM ACTION= etc it's a lot of work, and build a local array, then loop through it and call the same function again recursively

feasible but time-consuming if you go deeper than first level (ie, verify sites and immediate links, not the links of linked pages)
0
 
Morph007x2bCommented:
You could try one of those Free Link Harvestors :) Search google http://www.google.com/search?q=Link+Harvestor
0
 
snoyes_jwCommented:
No comment has been added to this question in more than 21 days, so it is now classified as abandoned.

I will leave the following recommendation for this question in the Cleanup topic area:
    Accept: VGR {http:#8144259}

Any objections should be posted here in the next 4 days. After that time, the question will be closed.

snoyes_jw
EE Cleanup Volunteer
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 5
  • 4
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now