Link to home
Start Free TrialLog in
Avatar of wsyy
wsyy

asked on

How to use regex to get things out of hostname

Hi,

I would like to use ONE regular expression to get abc.com or abc.com.cn out of all the following host names:

1) www.abc.com
2) www.abc.com.cn
3) www.xyz.abc.com
4) www.xyz.abc.com.cn
5) xyz.abc.com
6) xyz.abc.com.cn

Thanks!
Avatar of Gurvinder Pal Singh
Gurvinder Pal Singh
Flag of India image

http://www.exampledepot.com/egs/java.lang/HasSubstr.html

just check

if (string.indexOf("abc.com") != -1 }|| string.indexOf("abc.com") != -1 )
{
   //string is containing required substrings
}
Avatar of stachenov
stachenov

Something like this works:
 
String[] t = {"www.abc.com", 
            "www.abc.com.cn", 
            "www.xyz.abc.com", 
            "www.xyz.abc.com.cn",
            "xyz.abc.com", 
            "xyz.abc.com.cn",
        };
        Pattern p = Pattern.compile("((?:[a-z0-9][-a-z0-9]*[a-z0-9]|[a-z0-9])"
                + "(?:\\.com\\.cn|\\.com)$)");
        for (String s : t) {
            Matcher m = p.matcher(s);
            if (m.find()) {
                System.out.println("Found " + m.group(1) + " in " + s);
            } else {
                System.out.println("Not found in " + s);
            }
        }

Open in new window

Looks a bit ugly because I couldn't find a more elegant way to enforce the "host name can't end or start with a hyphen" rule.

If you need to match more domains, not just ".com.cn" and ".com", then the second part should contain more complicated alternatives, but the idea stays the same.
Try this code
public class TestSubstring {
public static void main(String[] args) {
	String[] string = {"www.abc.com","www.abc.com.cn","www.xyz.abc.com","www.xyz.abc.com.cn","xyz.abc.com","xyz.abc.com.cn"};
	for (String stg: string){
		System.out.println(stg.substring(0, stg.indexOf(".abc.com")));
	}

}
}

Open in new window

        String [] hosts = {
"www.abc.com",
"www.abc.com.cn",
"www.xyz.abc.com",
"www.xyz.abc.com.cn",
"xyz.abc.com",
"xyz.abc.com.cn"
};

        for(String sh : hosts){

            sh = sh.replaceAll(".*\\.(.+?\\.com)","$1");
            System.out.println("result: " + sh);

        }

Open in new window


Output:

result: abc.com
result: abc.com.cn
result: abc.com
result: abc.com.cn
result: abc.com
result: abc.com.cn

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of for_yan
for_yan
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@for_yan, this doesn't work for something like "com.cn.com.cn".

 why?  it returns:
result: cn.com.cn

com.cn.com

returns

cn.com

that is what is expected, as I understand.

And certainly for any regex you  can invent
some strign which will break it.


Sorry, I was wrong, it actually works.
No problem.
Though nothing is ideal, I'm sure there is some string which will break it.
Still it helps in great majority of cases
Personally i would use URL.getHost