Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Can you explain this PHP Snippet?

Posted on 2014-03-03
8
376 Views
Last Modified: 2014-03-03
Hello experts,

I'm working with a bit of code that was passed down to me and I don't quite have the ability to parse it to fully understand it. It's a crawler - so it's doing a lot of matching.

Could someone run through this and give me an explanation of what this is doing?

I have written some PHP but some of this is new on me, particularly starting with
 if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;

Thank you!

 function get_a_href($url){
    $url = htmlentities(strip_tags($url));
    $ExplodeUrlInArray = explode('/',$url);
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];
    $file = @file_get_contents($url);
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);
    $linksInArray = $patterns[2];
    $CountOfLinks = count($linksInArray);
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
    {

    if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;
    preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);
    if($CheckJavascriptLink != NULL)
    continue;
    $Link = $linksInArray[$Counter];
    preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);
    if($CheckForArgumentsInUrl != NULL)
    {
    $ExplodeLink = explode('?',$linksInArray[$Counter]);
    $Link = $ExplodeLink[0];
    }
    preg_match('/'.$DomainName.'/',$Link,$Check);
    if($Check == NULL)
    {
    preg_match('/http:\/\//',$Link,$ExternalLinkCheck);
    if($ExternalLinkCheck == NULL)
    {
    $InternalDomainsInArray[$InternalLinkCount] = $Link;
    $InternalLinkCount++;
    }
    else
    {
    $ExternalDomainsInArray[$ExternalLinkCount] = $Link;
    $ExternalLinkCount++;
    }

    }
    else
    {
    $InternalDomainsInArray[$InternalLinkCount] = $Link;
    $InternalLinkCount++;
    }
    }

    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );
    return $LinksResultsInArray;
    }

Open in new window

0
Comment
Question by:EffinGood
  • 3
  • 2
  • 2
  • +1
8 Comments
 
LVL 34

Assisted Solution

by:Dan Craciun
Dan Craciun earned 150 total points
ID: 39901831
if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#")
    continue;

Open in new window

simply means:
- if it's not a link (empty string) or
- if it's a dummy link (#)
skipt the rest of the loop, as it's not needed to be parsed.

The rest will be explained by someone else :)

HTH,
Dan
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39901833
You might want to consider applying a coding standard to the script.  If you just indent the control structures in a sensible manner, a lot of the logic will be visible.
0
 

Author Comment

by:EffinGood
ID: 39901835
Hi Dan,

Thanks man, that # was throwing me off. Couldn't figure that one out! It's looking for an anchor. Check.
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 83

Expert Comment

by:Dave Baldwin
ID: 39901848
To make it easy on myself, especially my future self that may have to change the code, I usually use more parenthesis to group the statements to make it clearer what I think I'm doing.
if(($linksInArray[$Counter] == "") || ($linksInArray[$Counter] == "#"))
    continue;

Open in new window

0
 
LVL 109

Accepted Solution

by:
Ray Paseur earned 350 total points
ID: 39901865
Annotated with comments.  You can use var_dump() to print out the data so you can see what the code is creating.

<?php // demo/effingood.php
error_reporting(E_ALL);


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28379282.html


function get_a_href($url)
{
    // SANITIZE THE $url ARGUMENT
    $url = htmlentities(strip_tags($url));

    // BREAK THE ARGUMENT STRING APART ON THE DIRECTOR-SEPARATOR SLASH
    $ExplodeUrlInArray = explode('/',$url);

    // GET SOME PARTS OF THE EXPLODED STRING
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];

    // READ THE CONTENTS OF THE URL RESOURCE, BUT SUPPRESS ERROR MESSAGES
    $file = @file_get_contents($url);

    // FIND THE LINKS IN THE DOCUMENT WITH REGEX GROUPING
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);

    // THE LINKS ARE HERE
    $linksInArray = $patterns[2];

    // THE NUMBER OF LINKS ARE HERE
    $CountOfLinks = count($linksInArray);

    // SET THESE VARIABLES TO ZERO
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;

    // ITERATE THROUGH THE LINKS (THIS SHOULD BE FOREACH() INSTEAD OF FOR)
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
    {
        // IF THE ELEMENT IS NULL OR # SKIP IT
        if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#") continue;

        // LOOK FOR JAVASCRIPT LINKS
        preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);

        // SKIP JAVASCRIPT LINKS
        if($CheckJavascriptLink != NULL) continue;

        // COPY THIS LINK DATA TO ANOTHER VARIABLE
        $Link = $linksInArray[$Counter];

        // TRY TO MATCH THE QUESTION MARK
        preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);

        // IF THERE IS A MATCH ON THE QUESTION MARK
        if($CheckForArgumentsInUrl != NULL)
        {
            // FIND THE LINK WITHOUT THE REQUEST ARGUMENTS
            $ExplodeLink = explode('?',$linksInArray[$Counter]);
            $Link = $ExplodeLink[0];
        }

        // DETERMINE WHETHER THIS IS AN INTERANL PAGE LINK OR AN EXTERNAL PAGE LINKS
        preg_match('/'.$DomainName.'/',$Link,$Check);
        if($Check == NULL)
        {
            preg_match('/http:\/\//',$Link,$ExternalLinkCheck);
            if($ExternalLinkCheck == NULL)
            {
                $InternalDomainsInArray[$InternalLinkCount] = $Link;
                $InternalLinkCount++;
            }
            else
            {
                $ExternalDomainsInArray[$ExternalLinkCount] = $Link;
                $ExternalLinkCount++;
            }
        }
        else
        {
            $InternalDomainsInArray[$InternalLinkCount] = $Link;
            $InternalLinkCount++;
        }
    }

    // SET UP AN ARRAY OF ARRAYS - GIVING THE EXTERNAL AND INTERNAL LINKS
    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );

    // RETURN THE MULTI-DIMENSIONAL ARRAY
    return $LinksResultsInArray;
}

Open in new window

0
 
LVL 34

Assisted Solution

by:Dan Craciun
Dan Craciun earned 150 total points
ID: 39901866
Here's how I understood the code:
function get_a_href($url){
    $url = htmlentities(strip_tags($url));
    $ExplodeUrlInArray = explode('/',$url);
    $SubDomainName = $ExplodeUrlInArray[1];
    $DomainName = $ExplodeUrlInArray[2];
    $file = @file_get_contents($url);
    $h1count = preg_match_all('/(href=["|\'])(.*?)(["|\'])/i',$file,$patterns);
    $linksInArray = $patterns[2];
    $CountOfLinks = count($linksInArray);
    $InternalLinkCount = 0;
    $ExternalLinkCount = 0;
    for($Counter=0;$Counter<$CountOfLinks;$Counter++)
	{
		if($linksInArray[$Counter] == "" || $linksInArray[$Counter] == "#") continue; //ignore null or anchor # links
		preg_match('/javascript:/', $linksInArray[$Counter],$CheckJavascriptLink);  // check if link contains js
		if($CheckJavascriptLink != NULL) continue; // ignore js links
		$Link = $linksInArray[$Counter];
		preg_match('/\?/', $linksInArray[$Counter],$CheckForArgumentsInUrl);  // check for ? - are there arguments
		if($CheckForArgumentsInUrl != NULL) {  // if there are arguments in link
			$ExplodeLink = explode('?',$linksInArray[$Counter]);
			$Link = $ExplodeLink[0];  // set $link as the part before ?
		}
		preg_match('/'.$DomainName.'/',$Link,$Check);  //check if it's an internal link - contains the internal domain
		if($Check == NULL) {  // it it does not contain the internal domain
			preg_match('/http:\/\//',$Link,$ExternalLinkCheck); // check if it's an absolute link - contains http
			if($ExternalLinkCheck == NULL) {  // if it does not contain http means it'a relative link, so it's internal
				$InternalDomainsInArray[$InternalLinkCount] = $Link;
				$InternalLinkCount++;
			} else {  // it's an external link
				$ExternalDomainsInArray[$ExternalLinkCount] = $Link;
				$ExternalLinkCount++;
			}
		} else {  // it's an internal link
			$InternalDomainsInArray[$InternalLinkCount] = $Link;
			$InternalLinkCount++;
		}
    }

    $LinksResultsInArray = array(
    'ExternalLinks'=>$ExternalDomainsInArray,
    'InternalLinks'=>$InternalDomainsInArray
    );
    return $LinksResultsInArray;
}

Open in new window

0
 

Author Closing Comment

by:EffinGood
ID: 39901888
Wow, thank you gentlemen. I wasn't 100% sure on how to break up points on your deeeelish answers. You make a lady feel special. Thank you!
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39902150
Glad we were able to help!  Thanks for the points and thanks for using EE, ~Ray
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
3 proven steps to speed up Magento powered sites. The article focus is on optimizing time to first byte (TTFB), full page caching and configuring server for optimal performance.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

789 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question