Solved

Can you explain this PHP Snippet?

Posted on 2014-03-03
8
373 Views
Last Modified: 2014-03-03
Hello experts,

I'm working with a bit of code that was passed down to me and I don't quite have the ability to parse it to fully understand it. It's a crawler - so it's doing a lot of matching.

Could someone run through this and give me an explanation of what this is doing?

I have written some PHP but some of this is new on me, particularly starting with
 if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;

Thank you!

 function get_a_href($url){
    $url = htmlentities(strip_tags($url));
    $ExplodeUrlInArray = explode('/',$url);
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];
    $file = @file_get_contents($url);
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);
    $linksInArray = $patterns[2];
    $CountOfLinks = count($linksInArray);
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
    {

    if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;
    preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);
    if($CheckJavascriptLink != NULL)
    continue;
    $Link = $linksInArray[$Counter];
    preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);
    if($CheckForArgumentsInUrl != NULL)
    {
    $ExplodeLink = explode('?',$linksInArray[$Counter]);
    $Link = $ExplodeLink[0];
    }
    preg_match('/'.$DomainName.'/',$Link,$Check);
    if($Check == NULL)
    {
    preg_match('/http:\/\//',$Link,$ExternalLinkCheck);
    if($ExternalLinkCheck == NULL)
    {
    $InternalDomainsInArray[$InternalLinkCount] = $Link;
    $InternalLinkCount++;
    }
    else
    {
    $ExternalDomainsInArray[$ExternalLinkCount] = $Link;
    $ExternalLinkCount++;
    }

    }
    else
    {
    $InternalDomainsInArray[$InternalLinkCount] = $Link;
    $InternalLinkCount++;
    }
    }

    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );
    return $LinksResultsInArray;
    }

Open in new window

0
Comment
Question by:EffinGood
  • 3
  • 2
  • 2
  • +1
8 Comments
 
LVL 34

Assisted Solution

by:Dan Craciun
Dan Craciun earned 150 total points
ID: 39901831
if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;

Open in new window

simply means:
- if it's not a link (empty string) or
- if it's a dummy link (#)
skipt the rest of the loop, as it's not needed to be parsed.

The rest will be explained by someone else :)

HTH,
Dan
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39901833
You might want to consider applying a coding standard to the script.  If you just indent the control structures in a sensible manner, a lot of the logic will be visible.
0
 

Author Comment

by:EffinGood
ID: 39901835
Hi Dan,

Thanks man, that # was throwing me off. Couldn't figure that one out! It's looking for an anchor. Check.
0
ScreenConnect 6.0 Free Trial

Explore all the enhancements in one game-changing release, ScreenConnect 6.0, based on partner feedback. New features include a redesigned UI, app configurations and chat acknowledgement to improve customer engagement!

 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 39901848
To make it easy on myself, especially my future self that may have to change the code, I usually use more parenthesis to group the statements to make it clearer what I think I'm doing.
if(($linksInArray[$Counter] == "") || ($linksInArray[$Counter] == "#"))
    continue;

Open in new window

0
 
LVL 109

Accepted Solution

by:
Ray Paseur earned 350 total points
ID: 39901865
Annotated with comments.  You can use var_dump() to print out the data so you can see what the code is creating.

<?php // demo/effingood.php
error_reporting(E_ALL);


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28379282.html


function get_a_href($url)
{
    // SANITIZE THE $url ARGUMENT
    $url = htmlentities(strip_tags($url));

    // BREAK THE ARGUMENT STRING APART ON THE DIRECTOR-SEPARATOR SLASH
    $ExplodeUrlInArray = explode('/',$url);

    // GET SOME PARTS OF THE EXPLODED STRING
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];

    // READ THE CONTENTS OF THE URL RESOURCE, BUT SUPPRESS ERROR MESSAGES
    $file = @file_get_contents($url);

    // FIND THE LINKS IN THE DOCUMENT WITH REGEX GROUPING
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);

    // THE LINKS ARE HERE
    $linksInArray = $patterns[2];

    // THE NUMBER OF LINKS ARE HERE
    $CountOfLinks = count($linksInArray);

    // SET THESE VARIABLES TO ZERO
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;

    // ITERATE THROUGH THE LINKS (THIS SHOULD BE FOREACH() INSTEAD OF FOR)
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
    {
        // IF THE ELEMENT IS NULL OR # SKIP IT
        if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#") continue;

        // LOOK FOR JAVASCRIPT LINKS
        preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);

        // SKIP JAVASCRIPT LINKS
        if($CheckJavascriptLink != NULL) continue;

        // COPY THIS LINK DATA TO ANOTHER VARIABLE
        $Link = $linksInArray[$Counter];

        // TRY TO MATCH THE QUESTION MARK
        preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);

        // IF THERE IS A MATCH ON THE QUESTION MARK
        if($CheckForArgumentsInUrl != NULL)
        {
            // FIND THE LINK WITHOUT THE REQUEST ARGUMENTS
            $ExplodeLink = explode('?',$linksInArray[$Counter]);
            $Link = $ExplodeLink[0];
        }

        // DETERMINE WHETHER THIS IS AN INTERANL PAGE LINK OR AN EXTERNAL PAGE LINKS
        preg_match('/'.$DomainName.'/',$Link,$Check);
        if($Check == NULL)
        {
            preg_match('/http:\/\//',$Link,$ExternalLinkCheck);
            if($ExternalLinkCheck == NULL)
            {
                $InternalDomainsInArray[$InternalLinkCount] = $Link;
                $InternalLinkCount++;
            }
            else
            {
                $ExternalDomainsInArray[$ExternalLinkCount] = $Link;
                $ExternalLinkCount++;
            }
        }
        else
        {
            $InternalDomainsInArray[$InternalLinkCount] = $Link;
            $InternalLinkCount++;
        }
    }

    // SET UP AN ARRAY OF ARRAYS - GIVING THE EXTERNAL AND INTERNAL LINKS
    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );

    // RETURN THE MULTI-DIMENSIONAL ARRAY
    return $LinksResultsInArray;
}

Open in new window

0
 
LVL 34

Assisted Solution

by:Dan Craciun
Dan Craciun earned 150 total points
ID: 39901866
Here's how I understood the code:
function get_a_href($url){
    $url = htmlentities(strip_tags($url));
    $ExplodeUrlInArray = explode('/',$url);
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];
    $file = @file_get_contents($url);
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);
    $linksInArray = $patterns[2];
    $CountOfLinks = count($linksInArray);
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
	{
		if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#") continue; //ignore null or anchor # links
		preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);  // check if link contains js
		if($CheckJavascriptLink != NULL) continue; // ignore js links
		$Link = $linksInArray[$Counter];
		preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);  // check for ? - are there arguments
		if($CheckForArgumentsInUrl != NULL) {  // if there are arguments in link
			$ExplodeLink = explode('?',$linksInArray[$Counter]);
			$Link = $ExplodeLink[0];  // set $link as the part before ?
		}
		preg_match('/'.$DomainName.'/',$Link,$Check);  //check if it's an internal link - contains the internal domain
		if($Check == NULL) {  // it it does not contain the internal domain
			preg_match('/http:\/\//',$Link,$ExternalLinkCheck); // check if it's an absolute link - contains http
			if($ExternalLinkCheck == NULL) {  // if it does not contain http means it'a relative link, so it's internal
				$InternalDomainsInArray[$InternalLinkCount] = $Link;
				$InternalLinkCount++;
			} else {  // it's an external link
				$ExternalDomainsInArray[$ExternalLinkCount] = $Link;
				$ExternalLinkCount++;
			}
		} else {  // it's an internal link
			$InternalDomainsInArray[$InternalLinkCount] = $Link;
			$InternalLinkCount++;
		}
    }

    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );
    return $LinksResultsInArray;
}

Open in new window

0
 

Author Closing Comment

by:EffinGood
ID: 39901888
Wow, thank you gentlemen. I wasn't 100% sure on how to break up points on your deeeelish answers. You make a lady feel special. Thank you!
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39902150
Glad we were able to help!  Thanks for the points and thanks for using EE, ~Ray
0

Featured Post

Does Powershell have you tied up in knots?

Managing Active Directory does not always have to be complicated.  If you are spending more time trying instead of doing, then it's time to look at something else. For nearly 20 years, AD admins around the world have used one tool for day-to-day AD management: Hyena. Discover why

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
Author Note: Since this E-E article was originally written, years ago, formal testing has come into common use in the world of PHP.  PHPUnit (http://en.wikipedia.org/wiki/PHPUnit) and similar technologies have enjoyed wide adoption, making it possib…
The viewer will learn how to count occurrences of each item in an array.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question