Link to home
Start Free TrialLog in
Avatar of modsiw
modsiw

asked on

java - break url into parts

I have a long list of urls (10million). I need to parse each and extract its protocol, domain, path, and query string.

URL u = new URL(URLDecoder.decode(url, "UTF-8")); Isn't viable because many of the URLs are malformed and this throws exceptions.

I'd like methods that would just return null if they can't parse and keep going or any other efficient way of breaking the urls without throwing exceptions.
Avatar of cmalakar
cmalakar
Flag of India image

If the url is malformed, Why not just ignore the exception and proceed to other url's,
You can make a method where you cvatch Exception and ignore it.


Something like that, though you might consdier
differentiate if it has some other Exception not URLMalformed
and do somthing about it , or not:

public URL myDecode(String url){
try{
URL u = new URL(URLDecoder.decode(url, "UTF-8"));
} catch(Exception ex) {
return null;
}
return u;
}

That'll be like you asked for - null instaed of exception
Avatar of modsiw
modsiw

ASKER

I need to prevent the exceptions from being formed. I can wrap them in try / catch.

The act of creating and throwing an exception is a performance issue.
You can look and amnalyze most of your problesm - hwat is their cause
and then do preliminary check with substrings
Avatar of modsiw

ASKER

I'm also interested in a way of doing this without creating a new URL object for each url.
Avatar of modsiw

ASKER

>>You can look and amnalyze most of your problesm - hwat is their cause
and then do preliminary check with substrings

There are a lot of cases. I don't want to check them by hand. I'm hoping for a premade something that already does this.

New odd cases will continuously spring up; it needs a solution more robust than the hackery I'd turn out in a few hours.
Well, you can also find the
source code for URLDecoder and modify it so it does not create
Exception
Can't believe it is the actual source code - so short.
But maybe can still be useful to you
Avatar of modsiw

ASKER

Url is a much bigger piece of the puzzle. I need to extract protocol domain path and querystring separately while identifying bad urls.
Avatar of modsiw

ASKER

Url is a much bigger piece of the puzzle. I need to extract protocol domain path and querystring separately while identifying bad urls.
ASKER CERTIFIED SOLUTION
Avatar of Mick Barry
Mick Barry
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Try the following regular expression.  I don't know how efficient this will be -- might be faster to use the URL class and handle the exceptions, but you can performance test it.  You should be able to match any valid URL except mailto: and ftp://username:password@ with this expression and, once successfully matched, you can use the Matcher.group() method to retrieve the URL parts or null/empty string if they did not exist (see the javadoc for Matcher.group(), it will return an empty string if the group could match the empty string.  I just put this together and haven't tested it so there may be some errors, but hopefully its a starting point.

If you are parsing private URLs with a non-standard TLD, you can replace the TLD section with the expression below with: (?:[a-zA-Z][a-zA-Z0-9\\-]*[a-zA-Z0-9])

(?:([a-zA-Z]*)://)((?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|(?:[a-zA-Z0-9\\-]*[a-zA-Z0-9]\\.)+(?i:aero|asia|biz|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|xxx|[a-z]{2}))(?::(0?\\d{1,4}|6[0-4]\\d{3}|65[0-4]\\d{2}|655[0-2]\\d|6553[0-5]))?(?:((?/[^#?])*)(?:#([a-zA-Z0-9]*))?(?:\\?((?:[a-zA-Z]\\w*=[^&]*&)*(?:[a-zA-Z]\\w*=[^&]*))))?

Open in new window


Quick explanation, with group indexes.

Open in new window

(?:([a-zA-Z]*)://) # matches protocol, if present, in format "xyz://", GROUP 1
(
  (?:
    (?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}
    (?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])
  ) # match IPv4 address
  | # OR DNS name
  (?:[a-zA-Z0-9\\-]*[a-zA-Z0-9]\\.)+
  (?i:aero|asia|biz|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|xxx|[a-z]{2})
) # server address, GROUP 2
(?::(0?\\d{1,4}|6[0-4]\\d{3}|65[0-4]\\d{2}|655[0-2]\\d|6553[0-5]))? # matches optional port 0-65535, GROUP 3
(?:
  ((?/[^#?])*) # matches path, if present, GROUP 4
  (?:#([a-zA-Z0-9]*))? # matches hash, if present, GROUP 5
  (?:\\?((?:[a-zA-Z]\\w*=[^&]*&)*(?:[a-zA-Z]\\w*=[^&]*))) # matches query string, if present, GROUP 6
)?

Open in new window