Can you explain this PHP Snippet?

Hello experts,

I'm working with a bit of code that was passed down to me and I don't quite have the ability to parse it to fully understand it. It's a crawler - so it's doing a lot of matching.

Could someone run through this and give me an explanation of what this is doing?

I have written some PHP but some of this is new on me, particularly starting with
 if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;

Thank you!

 function get_a_href($url){
    $url = htmlentities(strip_tags($url));
    $ExplodeUrlInArray = explode('/',$url);
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];
    $file = @file_get_contents($url);
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);
    $linksInArray = $patterns[2];
    $CountOfLinks = count($linksInArray);
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
    {

    if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;
    preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);
    if($CheckJavascriptLink != NULL)
    continue;
    $Link = $linksInArray[$Counter];
    preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);
    if($CheckForArgumentsInUrl != NULL)
    {
    $ExplodeLink = explode('?',$linksInArray[$Counter]);
    $Link = $ExplodeLink[0];
    }
    preg_match('/'.$DomainName.'/',$Link,$Check);
    if($Check == NULL)
    {
    preg_match('/http:\/\//',$Link,$ExternalLinkCheck);
    if($ExternalLinkCheck == NULL)
    {
    $InternalDomainsInArray[$InternalLinkCount] = $Link;
    $InternalLinkCount++;
    }
    else
    {
    $ExternalDomainsInArray[$ExternalLinkCount] = $Link;
    $ExternalLinkCount++;
    }

    }
    else
    {
    $InternalDomainsInArray[$InternalLinkCount] = $Link;
    $InternalLinkCount++;
    }
    }

    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );
    return $LinksResultsInArray;
    }

Open in new window

EffinGoodAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dan CraciunIT ConsultantCommented:
if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;

Open in new window

simply means:
- if it's not a link (empty string) or
- if it's a dummy link (#)
skipt the rest of the loop, as it's not needed to be parsed.

The rest will be explained by someone else :)

HTH,
Dan
0
Ray PaseurCommented:
You might want to consider applying a coding standard to the script.  If you just indent the control structures in a sensible manner, a lot of the logic will be visible.
0
EffinGoodAuthor Commented:
Hi Dan,

Thanks man, that # was throwing me off. Couldn't figure that one out! It's looking for an anchor. Check.
0
Rowby Goren Makes an Impact on Screen and Online

Learn about longtime user Rowby Goren and his great contributions to the site. We explore his method for posing questions that are likely to yield a solution, and take a look at how his career transformed from a Hollywood writer to a website entrepreneur.

Dave BaldwinFixer of ProblemsCommented:
To make it easy on myself, especially my future self that may have to change the code, I usually use more parenthesis to group the statements to make it clearer what I think I'm doing.
if(($linksInArray[$Counter] == "") || ($linksInArray[$Counter] == "#"))
    continue;

Open in new window

0
Ray PaseurCommented:
Annotated with comments.  You can use var_dump() to print out the data so you can see what the code is creating.

<?php // demo/effingood.php
error_reporting(E_ALL);


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28379282.html


function get_a_href($url)
{
    // SANITIZE THE $url ARGUMENT
    $url = htmlentities(strip_tags($url));

    // BREAK THE ARGUMENT STRING APART ON THE DIRECTOR-SEPARATOR SLASH
    $ExplodeUrlInArray = explode('/',$url);

    // GET SOME PARTS OF THE EXPLODED STRING
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];

    // READ THE CONTENTS OF THE URL RESOURCE, BUT SUPPRESS ERROR MESSAGES
    $file = @file_get_contents($url);

    // FIND THE LINKS IN THE DOCUMENT WITH REGEX GROUPING
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);

    // THE LINKS ARE HERE
    $linksInArray = $patterns[2];

    // THE NUMBER OF LINKS ARE HERE
    $CountOfLinks = count($linksInArray);

    // SET THESE VARIABLES TO ZERO
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;

    // ITERATE THROUGH THE LINKS (THIS SHOULD BE FOREACH() INSTEAD OF FOR)
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
    {
        // IF THE ELEMENT IS NULL OR # SKIP IT
        if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#") continue;

        // LOOK FOR JAVASCRIPT LINKS
        preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);

        // SKIP JAVASCRIPT LINKS
        if($CheckJavascriptLink != NULL) continue;

        // COPY THIS LINK DATA TO ANOTHER VARIABLE
        $Link = $linksInArray[$Counter];

        // TRY TO MATCH THE QUESTION MARK
        preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);

        // IF THERE IS A MATCH ON THE QUESTION MARK
        if($CheckForArgumentsInUrl != NULL)
        {
            // FIND THE LINK WITHOUT THE REQUEST ARGUMENTS
            $ExplodeLink = explode('?',$linksInArray[$Counter]);
            $Link = $ExplodeLink[0];
        }

        // DETERMINE WHETHER THIS IS AN INTERANL PAGE LINK OR AN EXTERNAL PAGE LINKS
        preg_match('/'.$DomainName.'/',$Link,$Check);
        if($Check == NULL)
        {
            preg_match('/http:\/\//',$Link,$ExternalLinkCheck);
            if($ExternalLinkCheck == NULL)
            {
                $InternalDomainsInArray[$InternalLinkCount] = $Link;
                $InternalLinkCount++;
            }
            else
            {
                $ExternalDomainsInArray[$ExternalLinkCount] = $Link;
                $ExternalLinkCount++;
            }
        }
        else
        {
            $InternalDomainsInArray[$InternalLinkCount] = $Link;
            $InternalLinkCount++;
        }
    }

    // SET UP AN ARRAY OF ARRAYS - GIVING THE EXTERNAL AND INTERNAL LINKS
    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );

    // RETURN THE MULTI-DIMENSIONAL ARRAY
    return $LinksResultsInArray;
}

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Dan CraciunIT ConsultantCommented:
Here's how I understood the code:
function get_a_href($url){
    $url = htmlentities(strip_tags($url));
    $ExplodeUrlInArray = explode('/',$url);
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];
    $file = @file_get_contents($url);
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);
    $linksInArray = $patterns[2];
    $CountOfLinks = count($linksInArray);
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
	{
		if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#") continue; //ignore null or anchor # links
		preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);  // check if link contains js
		if($CheckJavascriptLink != NULL) continue; // ignore js links
		$Link = $linksInArray[$Counter];
		preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);  // check for ? - are there arguments
		if($CheckForArgumentsInUrl != NULL) {  // if there are arguments in link
			$ExplodeLink = explode('?',$linksInArray[$Counter]);
			$Link = $ExplodeLink[0];  // set $link as the part before ?
		}
		preg_match('/'.$DomainName.'/',$Link,$Check);  //check if it's an internal link - contains the internal domain
		if($Check == NULL) {  // it it does not contain the internal domain
			preg_match('/http:\/\//',$Link,$ExternalLinkCheck); // check if it's an absolute link - contains http
			if($ExternalLinkCheck == NULL) {  // if it does not contain http means it'a relative link, so it's internal
				$InternalDomainsInArray[$InternalLinkCount] = $Link;
				$InternalLinkCount++;
			} else {  // it's an external link
				$ExternalDomainsInArray[$ExternalLinkCount] = $Link;
				$ExternalLinkCount++;
			}
		} else {  // it's an internal link
			$InternalDomainsInArray[$InternalLinkCount] = $Link;
			$InternalLinkCount++;
		}
    }

    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );
    return $LinksResultsInArray;
}

Open in new window

0
EffinGoodAuthor Commented:
Wow, thank you gentlemen. I wasn't 100% sure on how to break up points on your deeeelish answers. You make a lady feel special. Thank you!
0
Ray PaseurCommented:
Glad we were able to help!  Thanks for the points and thanks for using EE, ~Ray
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.