We help IT Professionals succeed at work.

Regex to get part of hostname

Hi all,

I am looking for two regular expressions that can handle two conditions respectively. The first regex should be able to extract the string of "host" out of the following urls:

http://www.host.com,
http://www.host.eu,
http://subdomain1.host.com,
ftp.host.com,
host.eu,
host.net/login.jsp,
https://host.jp,
and https://www.host.ca/?address=us

Thanks!
Comment
Watch Question

I'm not sure about the Java syntax, but this works in PHP:
<?php
$text= <<<X
http://www.host.com,
http://www.host.eu,
http://subdomain1.host.com,
ftp.host.com,
host.eu,
host.net/login.jsp,
https://host.jp,
https://www.host.ca/?address=us
X;

preg_match_all("@^(?:https?://)?(?:\w+\.)?(\w+)(?:\.\w+)@m",$text,$arr);

print_r($arr);

Open in new window

Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
Use this pattern:

^(?:(?:https?|s?ftp|mailto|gopher)://?)?(?:[a-z0-9-]+\.){0,}([a-z0-9-]+)\.(?:[a-z0-9-]+){1,}(?:.*?)

Code generated below from ddrudik's website www.myregextester.com:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("^(?:(?:https?|s?ftp|mailto|gopher)://?)?(?:[a-z0-9-]+\\.){0,}([a-z0-9-]+)\\.(?:[a-z0-9-]+){1,}(?:.*?)
",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Open in new window

Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:

Author

Commented:
Hi experts,

Thanks for such quick response!!

I use http://regexpal.com/ to test the regex you recommended. But for this regex:

^(?:(?:https?|s?ftp|mailto|gopher)://?)?(?:[a-z0-9-]+\.){0,}([a-z0-9-]+)\.(?:[a-z0-9-]+){1,}(?:.*?)

It matches the whole "http://www.host.com", while I only want "host" inside the url.

For this regex:

#(?:(?:https?|s?ftp|mailto|gopher)://?)?(?:[a-z0-9-]+\.){0,}([a-z0-9-]+){1,}\.#i

It doesn't match either of the url examples in my original post. It is the same for this regex:

@^(?:https?://)?(?:\w+\.)?(\w+)(?:\.\w+)@m

Do I miss something?
I think you gave the expression as it is,..

Give the expression as

^(?:https?://)?(?:\w+\.)?(\w+)(?:\.\w+)

and check ^$ match at line breaks (m) option

It highlights the whole domain part. BUT, the grouped expression (\w+) just contains "host"
Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
With the pattern I provided, you'll need to get the host from the first match group. With the provided code, you'd ignore the first group index and get the result from the second group index value.
IT Guru
Most Valuable Expert 2011
Commented:
Sorry - I'm confusing things! Correction:

With the pattern I provided, you'll need to get the host from the *second* match group. With the provided code, you'd ignore the first group index and get the result from the second group index value.
The name can be extracted by using REPLACE method easily.. To test:
Go to http://www.myregextester.com/index.php

Give pattern: ^(?:https?://)?(?:\w+\.)?(\w+)(?:\.\w+).*$
Select operation as  "Replace"
Select Delimiter as "@"
Select multiline option "m"
Select Show Code for Java/Javascript
Select Replace pattern and enter \1
Enter your text in Source Text
Click submit

To see my run: http://www.myregextester.com/?r=fa8e26e2
What the experts here have given for regex here is perfect, but as an alternative incase you dont need to use regex you could try the parse_url function.
http://php.net/manual/en/function.parse-url.php
Most Valuable Expert 2011
Top Expert 2015

Commented:
>>  you could try the parse_url function.

I don't think PHP code works under Java  ;)
Most Valuable Expert 2011
Top Expert 2015

Commented:
I think you will only get approximate answers as DOMAIN could be located in any of several positions. Any of the following are valid URLs to my knowledge:

    www.host.com
    www.subdomain1.host.com
    www.host.com.jp
    host.com

Given the above, how would you differentiate between "www.subdomain1.host.com" and "www.host.com.jp"?
>> I don't think PHP code works under Java  ;)

Crumbs, forgot this was a Java question, too many languages, so little time ;)
Although you could use a similar Java version:
http://java.sun.com/docs/books/tutorial/networking/urls/urlInfo.html

Author

Commented:
kaufmed,

for your question above, I only concerned about such domain postfixes as  .com.jp, .com.co, .com.ag, .com.bz, .com.es, .net.bz, .net.br, and .net.in.

Is it possible to add ".com" or ".net" into the regex to match?

Thanks.

Author

Commented:
TerryAtOpus:

You provided two references. Did you mean I should use the first reference and search the second group? Ok, I will test it.

Thanks.
Most Valuable Expert 2011
Top Expert 2015

Commented:
I think you misinterpreted my question. What I am asking is, when you have a, for the sake of discussion, 4-part host, how will you know whether or not the second or third index is the domain?

So for my previous question, let's take "www.subdomain1.host.com" and "www.host.com.jp". We can write a regex to extract out either the second or third index as the domain. If we write it to extract the second index, then

    www.subdomain1.host.com - returns "subdomain1"
    www.host.com.jp - returns "host"

If we write the regex to extract the third index, then

    www.subdomain1.host.com - returns "host"
    www.host.com.jp - returns "com"

In either case, there will be inconsistencies. It may be possible to craft a conditional regex, but your conditions will need to be very well-defined (e.g. for four-part hosts, if the third part is "com" or "net", take the second part as the domain).
Most Valuable Expert 2011
Top Expert 2015

Commented:
Of course the conditional concept described above would be applicable to only 4-part hosts and may break on 5+ part host names.

Author

Commented:
kaufmed,

Is it possible to write a regex to get the string right ahead of .com, .net, .org, etc?

Thanks.
Most Valuable Expert 2011
Top Expert 2015

Commented:
I believe TerryAtOpus already answered that in #32990621.

Author

Commented:
TerryAtOpus:

The regex you provided causes errors in eclipse and can't be compiled. Is there any missing escape sign?

Thanks.
Most Valuable Expert 2011
Top Expert 2015

Commented:
>>  The regex you provided causes errors

Could you be a bit more specific?

Author

Commented:
This line of code shows errors when I put it into eclipse. The errors are about the first parameters in the .compile function.

Pattern re = Pattern.compile("^(?:(?:https?|s?ftp|mailto|gopher)://?)?(?:[a-z0-9-]+\\.){0,}([a-z0-9-]+)\\.(?:[a-z0-9-]+){1,}(?:.*?)",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);

Something must be missing in
Most Valuable Expert 2011
Top Expert 2015
Commented:
I copied the line exactly from you last post and it seems fine in NetBeans. Is there anything specific in the error message itself?

Author

Commented:
Thanks a lot!