Link to home
Start Free TrialLog in
Avatar of hiddenpearls
hiddenpearlsFlag for Pakistan

asked on

get domain name from URL in php

hi,
I'm trying to write a code that extracts the domain name from a list of URL's. URL's are written in a txt file.  following code works for most of the URL's but not with .co.uk etc

<?php
if(isset($_POST['submit']))
{
	$lines = file($_FILES['domainUploadFile']['tmp_name']);
	
	foreach ($lines as $line_num => $url) {
		preg_match('@^(?:http://)?([^/]+)@i',
		$url, $matches);
		$host = $matches[1];
	
		// get last two segments of host name
		preg_match('/[^.]+\.[^.]+$/', $host, $matches);
		echo "domain name is: {$matches[0]}\n"."<br />\n";
	   //echo "Line #<b>{$line_num}</b> : " . htmlspecialchars($line) . "<br />\n";
	}
}
?>

Open in new window


this is the input

adnan.com/?att=1&att=2&att3
abc.net
www.giwww.com
http://sites.google.com/
http://www.banksy.co.uk/
http://en.wikipedia.org/wiki/Site

see the output

domain name is: adnan.com
domain name is: abc.net
domain name is: giwww.com
domain name is: google.com
domain name is: co.uk
domain name is: wikipedia.org
Avatar of hernst42
hernst42
Flag of Germany image

Or simply use in your loop:
echo str_replace('www.', '', parse_ulr($url, PHP_URL_HOST));
Avatar of hiddenpearls

ASKER

PHP_URL_HOST doesn't work when url is adnan.com/?att=1&att=2&att3  means without http://
Avatar of kaufmed
This is an extremely difficult thing to do with regex, as you have already witnessed. URLs can have multiple parts, and TLDs can be anywhere from 2 to 6 (maybe more) characters long and themselves consist of multiple parts.

Is there any way to categorize the URLs into general categories of how the URLs are constructred? For example, given your above example, we could say you have 2 categories:

[server].[domain].[tld]
[server].[domain].[co].[country_code]
SOLUTION
Avatar of Marco Gasi
Marco Gasi
Flag of Spain image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Can you give some feedback, please?
Looking at this http://en.wikipedia.org/wiki/Site that you want to turn into this wikipedia.org makes me wonder why you want to discard the "en" part of the name?  The subdomain is fairly important, as is banksy in http://www.banksy.co.uk/.  

What is the desired output from the examples?