Can you explain this PHP Snippet?

EffinGood
EffinGood used Ask the Experts™
on
Hello experts,

I'm working with a bit of code that was passed down to me and I don't quite have the ability to parse it to fully understand it. It's a crawler - so it's doing a lot of matching.

Could someone run through this and give me an explanation of what this is doing?

I have written some PHP but some of this is new on me, particularly starting with
 if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;

Thank you!

 function get_a_href($url){
    $url = htmlentities(strip_tags($url));
    $ExplodeUrlInArray = explode('/',$url);
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];
    $file = @file_get_contents($url);
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);
    $linksInArray = $patterns[2];
    $CountOfLinks = count($linksInArray);
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
    {

    if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;
    preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);
    if($CheckJavascriptLink != NULL)
    continue;
    $Link = $linksInArray[$Counter];
    preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);
    if($CheckForArgumentsInUrl != NULL)
    {
    $ExplodeLink = explode('?',$linksInArray[$Counter]);
    $Link = $ExplodeLink[0];
    }
    preg_match('/'.$DomainName.'/',$Link,$Check);
    if($Check == NULL)
    {
    preg_match('/http:\/\//',$Link,$ExternalLinkCheck);
    if($ExternalLinkCheck == NULL)
    {
    $InternalDomainsInArray[$InternalLinkCount] = $Link;
    $InternalLinkCount++;
    }
    else
    {
    $ExternalDomainsInArray[$ExternalLinkCount] = $Link;
    $ExternalLinkCount++;
    }

    }
    else
    {
    $InternalDomainsInArray[$InternalLinkCount] = $Link;
    $InternalLinkCount++;
    }
    }

    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );
    return $LinksResultsInArray;
    }

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Commented:
if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;

Open in new window

simply means:
- if it's not a link (empty string) or
- if it's a dummy link (#)
skipt the rest of the loop, as it's not needed to be parsed.

The rest will be explained by someone else :)

HTH,
Dan
Most Valuable Expert 2011
Top Expert 2016

Commented:
You might want to consider applying a coding standard to the script.  If you just indent the control structures in a sensible manner, a lot of the logic will be visible.

Author

Commented:
Hi Dan,

Thanks man, that # was throwing me off. Couldn't figure that one out! It's looking for an anchor. Check.
Exploring ASP.NET Core: Fundamentals

Learn to build web apps and services, IoT apps, and mobile backends by covering the fundamentals of ASP.NET Core and  exploring the core foundations for app libraries.

Dave BaldwinFixer of Problems
Most Valuable Expert 2014

Commented:
To make it easy on myself, especially my future self that may have to change the code, I usually use more parenthesis to group the statements to make it clearer what I think I'm doing.
if(($linksInArray[$Counter] == "") || ($linksInArray[$Counter] == "#"))
    continue;

Open in new window

Most Valuable Expert 2011
Top Expert 2016
Commented:
Annotated with comments.  You can use var_dump() to print out the data so you can see what the code is creating.

<?php // demo/effingood.php
error_reporting(E_ALL);


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28379282.html


function get_a_href($url)
{
    // SANITIZE THE $url ARGUMENT
    $url = htmlentities(strip_tags($url));

    // BREAK THE ARGUMENT STRING APART ON THE DIRECTOR-SEPARATOR SLASH
    $ExplodeUrlInArray = explode('/',$url);

    // GET SOME PARTS OF THE EXPLODED STRING
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];

    // READ THE CONTENTS OF THE URL RESOURCE, BUT SUPPRESS ERROR MESSAGES
    $file = @file_get_contents($url);

    // FIND THE LINKS IN THE DOCUMENT WITH REGEX GROUPING
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);

    // THE LINKS ARE HERE
    $linksInArray = $patterns[2];

    // THE NUMBER OF LINKS ARE HERE
    $CountOfLinks = count($linksInArray);

    // SET THESE VARIABLES TO ZERO
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;

    // ITERATE THROUGH THE LINKS (THIS SHOULD BE FOREACH() INSTEAD OF FOR)
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
    {
        // IF THE ELEMENT IS NULL OR # SKIP IT
        if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#") continue;

        // LOOK FOR JAVASCRIPT LINKS
        preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);

        // SKIP JAVASCRIPT LINKS
        if($CheckJavascriptLink != NULL) continue;

        // COPY THIS LINK DATA TO ANOTHER VARIABLE
        $Link = $linksInArray[$Counter];

        // TRY TO MATCH THE QUESTION MARK
        preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);

        // IF THERE IS A MATCH ON THE QUESTION MARK
        if($CheckForArgumentsInUrl != NULL)
        {
            // FIND THE LINK WITHOUT THE REQUEST ARGUMENTS
            $ExplodeLink = explode('?',$linksInArray[$Counter]);
            $Link = $ExplodeLink[0];
        }

        // DETERMINE WHETHER THIS IS AN INTERANL PAGE LINK OR AN EXTERNAL PAGE LINKS
        preg_match('/'.$DomainName.'/',$Link,$Check);
        if($Check == NULL)
        {
            preg_match('/http:\/\//',$Link,$ExternalLinkCheck);
            if($ExternalLinkCheck == NULL)
            {
                $InternalDomainsInArray[$InternalLinkCount] = $Link;
                $InternalLinkCount++;
            }
            else
            {
                $ExternalDomainsInArray[$ExternalLinkCount] = $Link;
                $ExternalLinkCount++;
            }
        }
        else
        {
            $InternalDomainsInArray[$InternalLinkCount] = $Link;
            $InternalLinkCount++;
        }
    }

    // SET UP AN ARRAY OF ARRAYS - GIVING THE EXTERNAL AND INTERNAL LINKS
    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );

    // RETURN THE MULTI-DIMENSIONAL ARRAY
    return $LinksResultsInArray;
}

Open in new window

Commented:
Here's how I understood the code:
function get_a_href($url){
    $url = htmlentities(strip_tags($url));
    $ExplodeUrlInArray = explode('/',$url);
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];
    $file = @file_get_contents($url);
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);
    $linksInArray = $patterns[2];
    $CountOfLinks = count($linksInArray);
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
	{
		if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#") continue; //ignore null or anchor # links
		preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);  // check if link contains js
		if($CheckJavascriptLink != NULL) continue; // ignore js links
		$Link = $linksInArray[$Counter];
		preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);  // check for ? - are there arguments
		if($CheckForArgumentsInUrl != NULL) {  // if there are arguments in link
			$ExplodeLink = explode('?',$linksInArray[$Counter]);
			$Link = $ExplodeLink[0];  // set $link as the part before ?
		}
		preg_match('/'.$DomainName.'/',$Link,$Check);  //check if it's an internal link - contains the internal domain
		if($Check == NULL) {  // it it does not contain the internal domain
			preg_match('/http:\/\//',$Link,$ExternalLinkCheck); // check if it's an absolute link - contains http
			if($ExternalLinkCheck == NULL) {  // if it does not contain http means it'a relative link, so it's internal
				$InternalDomainsInArray[$InternalLinkCount] = $Link;
				$InternalLinkCount++;
			} else {  // it's an external link
				$ExternalDomainsInArray[$ExternalLinkCount] = $Link;
				$ExternalLinkCount++;
			}
		} else {  // it's an internal link
			$InternalDomainsInArray[$InternalLinkCount] = $Link;
			$InternalLinkCount++;
		}
    }

    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );
    return $LinksResultsInArray;
}

Open in new window

Author

Commented:
Wow, thank you gentlemen. I wasn't 100% sure on how to break up points on your deeeelish answers. You make a lady feel special. Thank you!
Most Valuable Expert 2011
Top Expert 2016

Commented:
Glad we were able to help!  Thanks for the points and thanks for using EE, ~Ray

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial