Link to home
Start Free TrialLog in
Avatar of TheMaximumWeasel
TheMaximumWeasel

asked on

Get Domain from Hostname Easy 500 points urgent

I have a hostname (will not have http://)

Hostname: 98-465-47-54.AUSTIN.ISP.example.com
then Domain would be example.com

or

Hostname: sgd.example.com/sfgsdfg/sdf
then Domain would be example.com

how would I do that. and the hostname can pretty much be anything. it could also be example.com and then domain would be example.com

Max
ASKER CERTIFIED SOLUTION
Avatar of tdterry
tdterry
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
For url parsing there are a lot of examples and standart functions in PHP.
The easiest way in your situation is using of Manual's example:

Example 3. Getting the domain name out of a URL
<?php
// get host name from URL
preg_match("/^(http:\/\/)?([^\/]+)/i", "http://www.php.net/index.html", $matches);
$host = $matches[2];

// get last two segments of host name
preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
echo "domain name is: {$matches[0]}\n";
?>
This example will produce:
domain name is: php.net
Avatar of spotx
spotx

as the domain name for
98-465-47-54.AUSTIN.ISP.example.com
is actually
AUSTIN.ISP.example.com
not
example.com
that is the domain name of
ISP.example.com
as the domain name of
example.com
is
com

the only sure way to get exactly what you seem to be after
is to check that last zone
if it is com net biz... etc
then the info that you are after will be in the last two
otherwise
it is more likely to be in the last three

strip out the domain part as tdterry said above
<?php
$first_slash = strpos($url, '/');
$hostname = ($first_slash !== false) ? substr($url, 0, $first_slash) : $url;
?>
then this will work for most
<?php
$root = substr($hostname,strrpos($hostname,'.')+1);
$US_root = array('com','edu','gov','biz'); //can't recall them all need to add all the US domains
$domain = split('\.',$hostname);
$c = count($domain);
if(in_array($root,$US_root))
print $domain[$c-2].'.'.$domain[$c-1];
else
print $domain[$c-3].'.'.$domain[$c-2].'.'.$domain[$c-1];
?>

 
Dear, spotx.

Your knowledges of what is domain is great, but TheMaximumWeasel explained that he needs to get:
---
Hostname: 98-465-47-54.AUSTIN.ISP.example.com
then Domain would be example.com
or
Hostname: sgd.example.com/sfgsdfg/sdf
then Domain would be example.com
---
So he need to get last two segments of hostname.

Anyway, even if you need to get domain name as you had described - it still can be done much easier...
Let's extend Manual's example and do it:

<?php
preg_match("/^(http:\/\/)?([^\/]+)/i", "http://www.php.net/index.html", $matches);
$host = $matches[2];
$domain = preg_replace("/^[^\.\/]+\./", "", $host);
?>

It's only three lines...
As Russians says "Don't invent a bicycle..."
Just take a little look on manual and think a little and you'll find simple solution...

PS Have you tested your code? Try to test it... The prompting is - in_array() searches in arrays, not in strings...
Dear ixti
Yeah I did test my code You need to feed it with a url  
if(in_array($root,$US_root))  //$US_root is an array as set in a previous line
try putting
$url = '98-465-47-54.AUSTIN.ISP.example.com';
as the first line
then try
$url = '98-465-47-54.AUSTIN.ISP.example.com.au';

Anyway this was not my point
as the world is bigger than just the US
to get
example.com.au
or
example.co.nz
example.co.uk
etc.etc.etc

2 zones will not cut it (thus my long winded explaination)

Dear spotx
Sorry, I was inattentive.
Here's my take on it:

$url = parse_url('http://my.server.co.uk/path/to/file.html');
$host = array_reverse(explode('.',$url['host']));
$domain = $host[2].'.'.$host[1].'.'.$host[0];

That will yield $domain === 'server.co.uk' although, as spotx pointed out, it's up to you to determine how many pieces you consider to be 'the domain'.  You should also guard against short host names by checking isset() on $host[] elements before you try to reference them.
WilliamFrantz, and what will happen if $url = "http://web.server.com/path/to/file.html" ?
<?php

function rootHost($text)
{
        $url = parse_url($text);
        $hostArray = explode('.', $url['host']);
        $host = array_pop($hostArray);
        while (count($hostArray) && gethostbyname($host) === $host)
                $host = array_pop($hostArray).'.'.$host;
        return($host);
}

echo '<p>'.rootHost('http://www.bbc.co.uk/');
echo '<p>'.rootHost('http://www.cnn.com/');
?>

That will get you 'bbc.co.uk' and 'cnn.com'.  Maybe that's what you were looking for.

Note that if the domain doesn't resolve to anything then you'll just get it back.  For example, rootHost('http://www.sub.foobar.edu') will return www.sub.foobar.edu which does not exist.
Well, guys, you're inventing bicycle...

At first, WilliamFrantz, please, look attentively at first post. The first line of that post is: "I have a hostname (will not have http://)", so parse_url() will not work. The solution in this case is to check if http:// exists at the begining of line and if not add it there...
The manual is giving the most easy way to solve this problem, so I really don't  understand you...

Maybe you all affraid of regular expressions? Don't affraid them.
I guess the only way is to test speed of all solutions. Speed tester will be like this:

<?php
for ($i = 0; $i < 3; $i++)
{
    $test['start']      = microtime(true);
    /* Calling function */
    $test['stop']       = microtime(true);
    $test['result'][]   = $test['stop'] - $test['start'];
}
echo ($test['result'][0] * $test['result'][1] * $test['result'][2]) / 3;
?>

----[ TESTS ]------------------------------------------------------------------

    +--> Step 1. Testing WilliamFrantz's function:
        <?php
        function WilliamFrantz($text)
        {
            $url = parse_url($text);
            $hostArray = explode('.', $url['host']);
            $host = array_pop($hostArray);
            while (count($hostArray) && gethostbyname($host) === $host)
                $host = array_pop($hostArray).'.'.$host;
            return($host);
        }
        echo "SpeedTime of WilliamFrantz's function is: ";
        for ($i = 0; $i < 3; $i++)
        {
            $test['start']      = microtime(true);
            WilliamFrantz('http://www.bbc.co.uk');
            $test['stop']       = microtime(true);
            $test['result'][]   = $test['stop'] - $test['start'];
        }
        echo ($test['result'][0] * $test['result'][1] * $test['result'][2]) / 3;
        echo "<br />\n";
        ?>
       
        The result is: "SpeedTime of WilliamFrantz's function is: 30.300546278358"

    +--> Step 2. Testing spotx's function:
        <?php
        function spotx($url)
        {
            $first_slash = strpos($url, '/');
            $hostname = ($first_slash !== false) ? substr($url, 0, $first_slash) : $url;
            $root = substr($hostname,strrpos($hostname,'.')+1);
            $US_root = array('com','edu','gov','biz'); //can't recall them all need to add all the US domains
            $domain = split('\.',$hostname);
            $c = count($domain);
            if(in_array($root,$US_root))
            return $domain[$c-2].'.'.$domain[$c-1];
            else
            return $domain[$c-3].'.'.$domain[$c-2].'.'.$domain[$c-1];
        }
        echo "SpeedTime of spotx's function is: ";
        for ($i = 0; $i < 3; $i++)
        {
            $test['start']      = microtime(true);
            spotx('http://www.bbc.co.uk');
            $test['stop']       = microtime(true);
            $test['result'][]   = $test['stop'] - $test['start'];
        }
        echo ($test['result'][0] * $test['result'][1] * $test['result'][2]) / 3;
        echo "<br />\n";
        ?>
       
        The result contains error. And it looks like:
                SpeedTime of spotx's function is:
                Notice: Undefined offset: -2 in C:\www\speedtester.php on line 13
                Notice: Undefined offset: -1 in C:\www\speedtester.php on line 13
                Notice: Undefined offset: -2 in C:\www\speedtester.php on line 13
                Notice: Undefined offset: -1 in C:\www\speedtester.php on line 13
                Notice: Undefined offset: -2 in C:\www\speedtester.php on line 13
                Notice: Undefined offset: -1 in C:\www\speedtester.php on line 13
                0.00013702416356664
        The speed is great... Awefully great... Majestic...

    +--> Step 3. And finally testing Manual's solution little modified by ixti:
        <?php
        function ixti($url)
        {
            preg_match("/^(http:\/\/)?([^\/]+)/i", $url, $matches);
            $host = $matches[2];
            return preg_replace("/^[^\.\/]+\./", "", $host);
        }
        echo "SpeedTime of Manual's function little modified by ixti is: ";
        for ($i = 0; $i < 3; $i++)
        {
            $test['start']      = microtime(true);
            ixti('http://www.bbc.com');
            $test['stop']       = microtime(true);
            $test['result'][]   = $test['stop'] - $test['start'];
        }
        echo ($test['result'][0] * $test['result'][1] * $test['result'][2]) / 3;
        echo "<br />\n";
        ?>
        The result is: "SpeedTime of Manual's function little modified by ixti is: 4.626226973118E-014"

----[ TESTS :: RESULTS ]------------------------------------------------------------

WilliamFrantz's function: Line count - 6, Word count - 23, Byte count - 221
spotx's function: Line count - 10, Word count - 64, Byte count - 455
Manual's solution little modified by ixti: Line count - 3, Word count - 17, Byte count - 127

I don't wnat to talk anything more - just test them by yourself...
Or envent new bicycle - I guess you love to do it...
Oh! Sorry! I've forgot...
spot'x solution fails because of unexistence of "http://". So if we give that function url like "www.bbc.com" then it will work.
So let's retest his function:
        <?php
        function spotx($url)
        {
            $first_slash = strpos($url, '/');
            $hostname = ($first_slash !== false) ? substr($url, 0, $first_slash) : $url;
            $root = substr($hostname,strrpos($hostname,'.')+1);
            $US_root = array('com','edu','gov','biz'); //can't recall them all need to add all the US domains
            $domain = split('\.',$hostname);
            $c = count($domain);
            if(in_array($root,$US_root))
            return $domain[$c-2].'.'.$domain[$c-1];
            else
            return $domain[$c-3].'.'.$domain[$c-2].'.'.$domain[$c-1];
        }
        echo "SpeedTime of spotx's function is: ";
        for ($i = 0; $i < 3; $i++)
        {
            $test['start']      = microtime(true);
            spotx('www.bbc.com');
            $test['stop']       = microtime(true);
            $test['result'][]   = $test['stop'] - $test['start'];
        }
        echo ($test['result'][0] * $test['result'][1] * $test['result'][2]) / 3;
        echo "<br />\n";
        ?>
The result is "SpeedTime of spotx's function is: 4.7026365729748E-014"

So spotx's function is really fast...

PS Times may differ. So spotx's function time and manual's function time may be even 1.6971694232273E-014
PPS Both of this functions are fast (honestly, manual's solution is a little faster) and which of tem to use is your decision.

Yours, ixti.
well that all was fun
Love spotx
Come now ixti, you are comparing apples/oranges.  Furthermore IMHO, using a home brew regex to solve this problem is re-inventing the wheel since parse_url() already exists.

> Hostname: 98-465-47-54.AUSTIN.ISP.example.com
> then Domain would be example.com

...and ixti() returns "AUSTIN.ISP.example.com".  I'm afraid you failed the first test case, ixti.  Heck, anybody can make a super-fast, super-short function that returns the wrong result.  :)

The reason my function takes so long is because it actually performs several domain name look-ups to find the correct answer.  In fact, I have given the only complete solution that would actually work.  The next closest solution was from spotx who suggested using a lookup table of known TLDs.  That's a good idea too, BTW.

The author's test case is actually not a good example since there is no real server at 'example.com'.  Try these real world examples instead:

1. http://cnn.com/si
2. http://www.cnn.com
3. http://bbc.co.uk/tv/
4. http://www.foxsports.news.com.au/story/0,8659,18794676-32463,00.html

The ixti() function returns:

1. com
2. cnn.com
3. co.uk
4. foxsports.news.com.au

Based on my understanding of the requirements, 3 of those are wrong.  Now run them through the WilliamFrantz() function:

1. cnn.com
2. cnn.com
3. bbc.co.uk
4. news.com.au

Ah, I believe those are the answers the author was looking for.
Dear WilliamFrantz,

It's not difficult to modify my variant to cover situation with "cnn.com", if you can't understand how - I can post a modification...
If I'll check more then 5 urls with your function - execution will be aborted. Guess why?
And if I don't need of checking if domain exists?..

Anyway, please, don't be lazy, re-read al posts in this question from first till last.
Read carefully, and mybe then you'll find that:

1. I wrote that you can use parse_url(), but before, you will need to check validation of url with regular expressions. I don't "comparing apples/oranges". Read php's manual on function parse_url(). There you'll find that "This function is not meant to validate the given URL, it only breaks it up into the above listed parts."
parse_url() is great function when you sure that url will be valid, and if it will NOT HAVE an HTTP:// prefix, then you'll need to check for it's existance and add it if it not present! So what is more right: using only "home brew" Regular Expressions with parse_url, or only Regular Expressions IN THIS CASE!!!
And of course. Maybe you'll ask PHP Team to make every function you need, to protect everyone writing "home brew" code?

2. Read VERY ATTENTIVE my post "Comment from ixti Date: 04/10/2006 08:57AM GMT+04:00" there you'll find what are you talking about. and then read our discuss with spotx about what this function must return!
Talk is cheap ixti.  The fact is, my function works and yours doesn't.  Show me how you'd rework your algorithm to cover the 4 examples I gave.  You can't do it without using a function like gethostbyname().  Your routine just blindly removes the first label of the host name and it doesn't even do that particularly efficiently.  You should have done it like this:

function chopFirstLabel($url)
{
    preg_match("/[a-z0-9\-]+\.([a-z0-9\-\.]+)/i", $url, $matches);
    return $matches[1];
}

That returns exactly the same results as your function.  It's shorter, faster, and it's still wrong.

The parse_url() function isn't the point of this exercise, however, if you really want to get rid of it:

function WilliamFrantz2($url)
{
    preg_match("/[a-z0-9\-]+\.[a-z0-9\-\.]+/i", $url, $matches);
    $hostArray = explode('.', $matches[0]);
    $host = array_pop($hostArray);
    while (count($hostArray) && gethostbyname($host) === $host)
         $host = array_pop($hostArray).'.'.$host;
    return($host);
}

P.S. Whether or not this function causes your script to timeout depends on the speed of your DNS requests and your setting for max_execution_time.  You can't simply say that 5 URLs will cause the script to abort.
Another reason to use parse_url() instead of making your own regex is to cover cases you might not have considered.  For example, WilliamFrantz2('http://localhost') will fail.  Likewise, ixti('http://www.cnn.com:80/') and ixti('ftp://user@cnn.com/') will both fail.  The original WilliamFrantz() function would correctly parse all those cases because of parse_url().

Sure you could fix the regex to handle those, but the point is you may not have thought of those cases in the first place.  It's a latent defect.  The parse_url() function is better tested.  It would be wise to take advantage of that.
Can't you understand that url can be not valid so parse_url() will fails, and in this case you need to make url validation?
Have you read PHP's Manual on parse_url()? So how do you think it parses it?

So you want me to write a function that covers all that situations that you've listed before? So if you don't know and don't can understand how to rewrite that function then I can help you. Here you are. This is solution:

<?php
function urlParser($url)
{
    preg_match("/^(.*:\/\/)?(.+@)?([^\/^:]+)/i", $url, $matches);
    return $matches[3];
}

var_dump(urlParser('http://cnn.com/si'));
var_dump(urlParser('http://www.cnn.com'));
var_dump(urlParser('http://bbc.co.uk/tv/'));
var_dump(urlParser('http://www.foxsports.news.com.au/story/0,8659,18794676-32463,00.html'));
var_dump(urlParser('http://localhost'));
var_dump(urlParser('http://www.cnn.com:80/'));
var_dump(urlParser('ftp://user@cnn.com/'));
?>

I don't said that parse_url() is not good function, but it need an url validation, so in this case it's not the right solution to use parse_url(). In case that author was wanted RegExp is more correct solution. In cases that you have described, it's more clever to use parse_url() with validation. Everything is depends on what you can have on input and what you want on output...

For example not long ago I've posted (https://www.experts-exchange.com/questions/21810097/Generate-an-absolute-link-from-a-URL-plus-a-link.html) example where I had used parse_url();


So:
1. If you not sure what you can have on input - it's better to check input. In this case it's easier to create a regexp to do all dirty work.
2. If you are sure what will be on input then you can use parse_url();, otherwise you still have to create test of input and if test fails do something.
3. If you need independence then it's easier to use regexp, or substr + strpos.
4. parse_url() IMHO is great to show in schools in case of demontration the power of PHP.
*: All solutions depends on what you want! In the case of author - your solution IMHO is not the best... Please, don't try to contest my opinion.

Anyway it was very interesting discuss. Thanks for everyone :))
Please forgive me, if I hurt somebody's feelings. It's all because of my bad knowledge of english. :))

Sincerely yours, ixti.
Revisting this problem is getting tedious, but here is the output of the urlParser() function for all the examples:

1. cnn.com
2. www.cnn.com
3. bbc.co.uk
4. www.foxsports.news.com.au
5. localhost
6. www.cnn.com
7. cnn.com

Three of those are still wrong.  While the urlParser() function does a good job of extracting the host name, it doesn't do what the author originally requested.  Still it's a handy regex and I'll be making of note of it.
I just thought that it was what you wnated to have on output :))

There was a Russian poet Mayakovky (Lived when there was USSR) who wrote a poem to job:
"... All the jobs are good
So choose them by your taste ..."

I guess now we can say that topic is closed :))