Can you explain this PHP Snippet?

Hello experts,

I'm working with a bit of code that was passed down to me and I don't quite have the ability to parse it to fully understand it. It's a crawler - so it's doing a lot of matching.

Could someone run through this and give me an explanation of what this is doing?

I have written some PHP but some of this is new on me, particularly starting with
 if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;

Thank you!

 function get_a_href($url){
    $url = htmlentities(strip_tags($url));
    $ExplodeUrlInArray = explode('/',$url);
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];
    $file = @file_get_contents($url);
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);
    $linksInArray = $patterns[2];
    $CountOfLinks = count($linksInArray);
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
    {

    if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;
    preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);
    if($CheckJavascriptLink != NULL)
    continue;
    $Link = $linksInArray[$Counter];
    preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);
    if($CheckForArgumentsInUrl != NULL)
    {
    $ExplodeLink = explode('?',$linksInArray[$Counter]);
    $Link = $ExplodeLink[0];
    }
    preg_match('/'.$DomainName.'/',$Link,$Check);
    if($Check == NULL)
    {
    preg_match('/http:\/\//',$Link,$ExternalLinkCheck);
    if($ExternalLinkCheck == NULL)
    {
    $InternalDomainsInArray[$InternalLinkCount] = $Link;
    $InternalLinkCount++;
    }
    else
    {
    $ExternalDomainsInArray[$ExternalLinkCount] = $Link;
    $ExternalLinkCount++;
    }

    }
    else
    {
    $InternalDomainsInArray[$InternalLinkCount] = $Link;
    $InternalLinkCount++;
    }
    }

    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );
    return $LinksResultsInArray;
    }

Open in new window

EffinGoodAsked:
Who is Participating?

Improve company productivity with a Business Account.Sign Up

x
 
Ray PaseurConnect With a Mentor Commented:
Annotated with comments.  You can use var_dump() to print out the data so you can see what the code is creating.

<?php // demo/effingood.php
error_reporting(E_ALL);


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28379282.html


function get_a_href($url)
{
    // SANITIZE THE $url ARGUMENT
    $url = htmlentities(strip_tags($url));

    // BREAK THE ARGUMENT STRING APART ON THE DIRECTOR-SEPARATOR SLASH
    $ExplodeUrlInArray = explode('/',$url);

    // GET SOME PARTS OF THE EXPLODED STRING
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];

    // READ THE CONTENTS OF THE URL RESOURCE, BUT SUPPRESS ERROR MESSAGES
    $file = @file_get_contents($url);

    // FIND THE LINKS IN THE DOCUMENT WITH REGEX GROUPING
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);

    // THE LINKS ARE HERE
    $linksInArray = $patterns[2];

    // THE NUMBER OF LINKS ARE HERE
    $CountOfLinks = count($linksInArray);

    // SET THESE VARIABLES TO ZERO
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;

    // ITERATE THROUGH THE LINKS (THIS SHOULD BE FOREACH() INSTEAD OF FOR)
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
    {
        // IF THE ELEMENT IS NULL OR # SKIP IT
        if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#") continue;

        // LOOK FOR JAVASCRIPT LINKS
        preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);

        // SKIP JAVASCRIPT LINKS
        if($CheckJavascriptLink != NULL) continue;

        // COPY THIS LINK DATA TO ANOTHER VARIABLE
        $Link = $linksInArray[$Counter];

        // TRY TO MATCH THE QUESTION MARK
        preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);

        // IF THERE IS A MATCH ON THE QUESTION MARK
        if($CheckForArgumentsInUrl != NULL)
        {
            // FIND THE LINK WITHOUT THE REQUEST ARGUMENTS
            $ExplodeLink = explode('?',$linksInArray[$Counter]);
            $Link = $ExplodeLink[0];
        }

        // DETERMINE WHETHER THIS IS AN INTERANL PAGE LINK OR AN EXTERNAL PAGE LINKS
        preg_match('/'.$DomainName.'/',$Link,$Check);
        if($Check == NULL)
        {
            preg_match('/http:\/\//',$Link,$ExternalLinkCheck);
            if($ExternalLinkCheck == NULL)
            {
                $InternalDomainsInArray[$InternalLinkCount] = $Link;
                $InternalLinkCount++;
            }
            else
            {
                $ExternalDomainsInArray[$ExternalLinkCount] = $Link;
                $ExternalLinkCount++;
            }
        }
        else
        {
            $InternalDomainsInArray[$InternalLinkCount] = $Link;
            $InternalLinkCount++;
        }
    }

    // SET UP AN ARRAY OF ARRAYS - GIVING THE EXTERNAL AND INTERNAL LINKS
    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );

    // RETURN THE MULTI-DIMENSIONAL ARRAY
    return $LinksResultsInArray;
}

Open in new window

0
 
Dan CraciunConnect With a Mentor IT ConsultantCommented:
if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;

Open in new window

simply means:
- if it's not a link (empty string) or
- if it's a dummy link (#)
skipt the rest of the loop, as it's not needed to be parsed.

The rest will be explained by someone else :)

HTH,
Dan
0
 
Ray PaseurCommented:
You might want to consider applying a coding standard to the script.  If you just indent the control structures in a sensible manner, a lot of the logic will be visible.
0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

 
EffinGoodAuthor Commented:
Hi Dan,

Thanks man, that # was throwing me off. Couldn't figure that one out! It's looking for an anchor. Check.
0
 
Dave BaldwinFixer of ProblemsCommented:
To make it easy on myself, especially my future self that may have to change the code, I usually use more parenthesis to group the statements to make it clearer what I think I'm doing.
if(($linksInArray[$Counter] == "") || ($linksInArray[$Counter] == "#"))
    continue;

Open in new window

0
 
Dan CraciunConnect With a Mentor IT ConsultantCommented:
Here's how I understood the code:
function get_a_href($url){
    $url = htmlentities(strip_tags($url));
    $ExplodeUrlInArray = explode('/',$url);
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];
    $file = @file_get_contents($url);
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);
    $linksInArray = $patterns[2];
    $CountOfLinks = count($linksInArray);
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
	{
		if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#") continue; //ignore null or anchor # links
		preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);  // check if link contains js
		if($CheckJavascriptLink != NULL) continue; // ignore js links
		$Link = $linksInArray[$Counter];
		preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);  // check for ? - are there arguments
		if($CheckForArgumentsInUrl != NULL) {  // if there are arguments in link
			$ExplodeLink = explode('?',$linksInArray[$Counter]);
			$Link = $ExplodeLink[0];  // set $link as the part before ?
		}
		preg_match('/'.$DomainName.'/',$Link,$Check);  //check if it's an internal link - contains the internal domain
		if($Check == NULL) {  // it it does not contain the internal domain
			preg_match('/http:\/\//',$Link,$ExternalLinkCheck); // check if it's an absolute link - contains http
			if($ExternalLinkCheck == NULL) {  // if it does not contain http means it'a relative link, so it's internal
				$InternalDomainsInArray[$InternalLinkCount] = $Link;
				$InternalLinkCount++;
			} else {  // it's an external link
				$ExternalDomainsInArray[$ExternalLinkCount] = $Link;
				$ExternalLinkCount++;
			}
		} else {  // it's an internal link
			$InternalDomainsInArray[$InternalLinkCount] = $Link;
			$InternalLinkCount++;
		}
    }

    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );
    return $LinksResultsInArray;
}

Open in new window

0
 
EffinGoodAuthor Commented:
Wow, thank you gentlemen. I wasn't 100% sure on how to break up points on your deeeelish answers. You make a lady feel special. Thank you!
0
 
Ray PaseurCommented:
Glad we were able to help!  Thanks for the points and thanks for using EE, ~Ray
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.