Validate URLS?

www.centralasiacommerce.com

This is a business directory, listing links to various sites. Some links have been added a while back. Is there a tool that can scan the site and validate that all the URLs are still valid and work?

Plus: Can such tools verify that the company URL is unchanged?  i.e. that if the company is doing a redirect to a new domain?
sandshakimiAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Daniel WilsonCommented:
Google for URL validator and there are a bunch of answers.  This one looks like it may do what you want.

commons.apache.org/proper/commons-validator/apidocs/org/apache/commons/validator/UrlValidator.html

The trick may be limiting the depth of validation more than getting it to validate in the first place.

I'm also see what dmoz.org uses for the purpose.  If I had to guess, I'd say they've used / created / enhanced an open source tool for it.
0
Ray PaseurCommented:
This is a question with a lot of "latitude" in the answer.  Please see the PHP answer to FILTER_VALIDATE_URL on this page.
http://php.net/manual/en/filter.filters.validate.php

To scan the site, you might want to consider using a screen scraper.  Any experienced PHP programmer can write one for you.  I'll give it a try as time permits and post back here if I can get a good result.  But I'm not optimistic.  When I clicked the Tajikstan link in the header it seemed to take forever to get a response.

To detect what is changed and unchanged you need you have your own database of baseline and current URLs.

You may also want to apply some "human intelligence" to this project.  For example, see http://www.famfamfam.com/ which is listed under web development.  The link works, but the site itself appears to have entered the steady state in 2006.  A lot has changed since then!
0
Julian HansenCommented:
If I understand the question correctly you want a tool that reads that page and ...
a) Checks each URL to see if it is still valid - i.e. if a request to the URL returns an error (404) then report it as such
b) Report on which URL's issue a redirect.

The page refers to other pages so you would need a crawler that extracts the links off each page - within a particular container - so on the main page that would be id=maindirectorycontainer.
On the child pages - the results are in rows of a table that spans other elements of the page so you might get URL's that are not directory URL's.

You would then use something like cURL to check each link ... etc

Or you could just use an existing tool like this
https://validator.w3.org/checklink

There are other freemium sites out there that do the same thing - if you google broken link checker you will find them.
0
Ray PaseurCommented:
This script works, "sort of."  The issue appears to be that the site is occasionally very slow to respond.  You can fix that problem, sometimes, by raising the cURL timeout, but when you do that, the script may run much longer before it dies.  So you might want to adopt a strategy that includes getting all of the URLs from the web site first, storing them somewhere, then using your stored list to check the links.  Maybe not as true an evaluation as a real-time acquisition of the links, but it might work more reliably.

Also, there are things like this. Obviously that URL is wrong, but I don't know how you would want to handle it.
<a href="http://jwww.query.com/" target='_new' title="jQuery">
jQuery</a> - JavaScript library with capabilities to make HTML document traversal and manipulation, event handling, animation, and Ajax functionality</p>

Open in new window


Here is the sort of thing you'll get from running this script. screen shot
<?php // demo/temp_sandshakimi.php

/**
 * http://www.experts-exchange.com/questions/28708309/Validate-URLS.html
 *
 * http://curl.haxx.se/libcurl/c/libcurl-errors.html
 */
error_reporting(E_ALL);

Class Page_Response_Object
{
    public $href, $title, $http_code, $errno, $info, $urls, $document;
    public function __construct($href, $title)
    {
        // AVOID TIMEOUT FOR LONG RUNNING SCRIPT
        set_time_limit(10);

        $this->href  = $href;
        $this->title = $title;
        $this->urls  = [];
        if (!$this->my_curl($href))
        {
            // ACTIVATE THIS TO SEE THE ERRORS AS THEY OCCUR
            // trigger_error("Errno: $this->errno; HTTP: $this->http_code; URL: $this->href", E_USER_WARNING);
        }
    }
    public function tidy_up()
    {
        // IF NO ERRORS, TIDY UP EXTRANEOUS INFORMATION
        if ($this->http_code == 200) unset($this->document);
        if ($this->http_code == 200) unset($this->info);
    }
    public function find_urls()
    {
        $rgx_href
        = '#'          // REGEX DELIMITER
        . 'href="'     // SIGNAL STRING
        . '(.*?)'      // CAPTURE GROUP
        . '"'          // SIGNAL STRING
        . '#'
        ;
        $rgx_title
        = '#'          // REGEX DELIMITER
        . 'title="'    // SIGNAL STRING
        . '(.*?)'      // CAPTURE GROUP
        . '"'          // SIGNAL STRING
        . '#'
        ;

        $doc = $this->document;
        $doc = preg_replace('/\s/', ' ', $doc);
        $doc = preg_replace('/\s\s+/', ' ', $doc);
        $sig = '<!-- START Directory -->';
        $arr = explode($sig, $doc);
        $doc = $arr[1];
        $sig = '<!-- END Directory -->';
        $arr = explode($sig, $doc);
        $doc = $arr[0];
        $doc = strip_tags($doc, '<a>');

        $rgx = '#' . preg_quote('<a href=') . '.*?' . preg_quote('>') . '#';
        preg_match_all($rgx, $doc, $mat);
        foreach ($mat[0] as $atag)
        {
            preg_match($rgx_href, $atag, $amat);
            $url = $amat[1];

            preg_match($rgx_title, $atag, $amat);
            $title = $amat[1];

            $pro = new Page_Response_Object($url, $title);
            $pro->tidy_up();
            $this->urls[] = $pro;
        }
    }
    protected function my_curl($url, $timeout=3)
    {
        $curl = curl_init();

        // HEADERS AND OPTIONS APPEAR TO BE A FIREFOX BROWSER REFERRED BY GOOGLE
        $header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] = "Cache-Control: max-age=0";
        $header[] = "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // BROWSERS USUALLY LEAVE THIS BLANK

        // SET THE CURL OPTIONS - SEE http://php.net/manual/en/function.curl-setopt.php
        curl_setopt( $curl, CURLOPT_URL,            $url  );
        curl_setopt( $curl, CURLOPT_USERAGENT,      'Mozilla/5.0 (Windows NT 6.1; rv:22.0) Gecko/20100101 Firefox/22.0'  );
        curl_setopt( $curl, CURLOPT_HTTPHEADER,     $header  );
        curl_setopt( $curl, CURLOPT_REFERER,        'http://www.google.com'  );
        curl_setopt( $curl, CURLOPT_ENCODING,       'gzip,deflate'  );
        curl_setopt( $curl, CURLOPT_AUTOREFERER,    TRUE  );
        curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE  );
        curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE  );
        curl_setopt( $curl, CURLOPT_TIMEOUT,        $timeout  );
        curl_setopt( $curl, CURLOPT_VERBOSE,        TRUE   );
        curl_setopt( $curl, CURLOPT_FAILONERROR,    TRUE   );

        // IF USING SSL, THIS INFORMATION MAY BE IMPORTANT
        // http://php.net/manual/en/function.curl-setopt.php#110457
        // http://php.net/manual/en/function.curl-setopt.php#115993
        // http://php.net/manual/en/function.curl-setopt.php#113754
        // REDACTED IN 2015 curl_setopt( $curl, CURLOPT_SSLVERSION, 3 );
        curl_setopt( $curl, CURLOPT_SSL_VERIFYHOST, FALSE  );
        curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, FALSE  );


        // RUN THE CURL REQUEST AND GET THE RESULTS
        $this->document  = curl_exec($curl);
        $this->errno     = curl_errno($curl);
        $this->info      = curl_getinfo($curl);
        $this->http_code = $this->info['http_code'];
        curl_close($curl);

        // RETURN DOCUMENT SUCCESS SIGNAL
        return $this->document;
    }
}

echo '<pre>';

$url = 'http://www.centralasiacommerce.com/';
$htm = file_get_contents($url);

// ISOLATE THE URLS AND TITLES INTO AN ARRAY OF OBJECTS
$links = [];
$rgx_href
= '#'          // REGEX DELIMITER
. 'href="'     // SIGNAL STRING
. '(.*?)'      // CAPTURE GROUP
. '"'          // SIGNAL STRING
. '#'
;
$rgx_title
= '#'          // REGEX DELIMITER
. 'title="'    // SIGNAL STRING
. '(.*?)'      // CAPTURE GROUP
. '"'          // SIGNAL STRING
. '#'
;

$sig = '<ul class="directorycolumns">';
$arr = explode($sig, $htm);
unset($arr[0]);
foreach ($arr as $str)
{
    $str = explode('</ul>', $str);
    $str = $str[0];
    $str = strip_tags($str, '<a>');
    $str = trim($str);

    $link = new StdClass;

    preg_match($rgx_href, $str, $mat);
    $link->href = $url . $mat[1];

    preg_match($rgx_title, $str, $mat);
    $link->title = $mat[1];

    $links[] = $link;
}

echo '<h2>' . count($links) . " Pages to be Searched on $url" . '</h2>';


// ACQUIRE PAGES FOR EACH OF THE MAIN LINKS
foreach ($links as $key => $link)
{
    $pro = new Page_Response_Object($link->href, $link->title);
    $pro->find_urls();
    $pro->tidy_up();
    $num = count($pro->urls);

    echo '<h2>' . "$pro->title has $num Pages at URL: $pro->href" . '</h2>';

    foreach ($pro->urls as $obj)
    {
        echo PHP_EOL;
        if ($obj->errno)
        {
            echo PHP_EOL;
            echo '<span style="background-color:red;">';
            echo "<b>cURL Errno: $obj->errno, ";
            echo "HTTP Resp: $obj->http_code</b> ";
            echo '</span> ';
        }
        echo $obj->title;
        echo PHP_EOL . '<a target="_new" href="' . $obj->href . '">' . $obj->href . '</a>';
        echo PHP_EOL;
    }
    echo PHP_EOL;
    flush();
}

Open in new window

[edited to correct whitespace and alignment in code snippet]
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
sandshakimiAuthor Commented:
All this is good feedback for me to
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.