modsiw
asked on
java - break url into parts
I have a long list of urls (10million). I need to parse each and extract its protocol, domain, path, and query string.
URL u = new URL(URLDecoder.decode(url, "UTF-8")); Isn't viable because many of the URLs are malformed and this throws exceptions.
I'd like methods that would just return null if they can't parse and keep going or any other efficient way of breaking the urls without throwing exceptions.
URL u = new URL(URLDecoder.decode(url,
I'd like methods that would just return null if they can't parse and keep going or any other efficient way of breaking the urls without throwing exceptions.
If the url is malformed, Why not just ignore the exception and proceed to other url's,
You can make a method where you cvatch Exception and ignore it.
Something like that, though you might consdier
differentiate if it has some other Exception not URLMalformed
and do somthing about it , or not:
public URL myDecode(String url){
try{
URL u = new URL(URLDecoder.decode(url,
} catch(Exception ex) {
return null;
}
return u;
}
That'll be like you asked for - null instaed of exception
ASKER
I need to prevent the exceptions from being formed. I can wrap them in try / catch.
The act of creating and throwing an exception is a performance issue.
The act of creating and throwing an exception is a performance issue.
You can look and amnalyze most of your problesm - hwat is their cause
and then do preliminary check with substrings
and then do preliminary check with substrings
ASKER
I'm also interested in a way of doing this without creating a new URL object for each url.
ASKER
>>You can look and amnalyze most of your problesm - hwat is their cause
and then do preliminary check with substrings
There are a lot of cases. I don't want to check them by hand. I'm hoping for a premade something that already does this.
New odd cases will continuously spring up; it needs a solution more robust than the hackery I'd turn out in a few hours.
and then do preliminary check with substrings
There are a lot of cases. I don't want to check them by hand. I'm hoping for a premade something that already does this.
New odd cases will continuously spring up; it needs a solution more robust than the hackery I'd turn out in a few hours.
Well, you can also find the
source code for URLDecoder and modify it so it does not create
Exception
source code for URLDecoder and modify it so it does not create
Exception
Can't believe it is the actual source code - so short.
But maybe can still be useful to you
But maybe can still be useful to you
ASKER
Url is a much bigger piece of the puzzle. I need to extract protocol domain path and querystring separately while identifying bad urls.
ASKER
Url is a much bigger piece of the puzzle. I need to extract protocol domain path and querystring separately while identifying bad urls.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Try the following regular expression. I don't know how efficient this will be -- might be faster to use the URL class and handle the exceptions, but you can performance test it. You should be able to match any valid URL except mailto: and ftp://username:password@ with this expression and, once successfully matched, you can use the Matcher.group() method to retrieve the URL parts or null/empty string if they did not exist (see the javadoc for Matcher.group(), it will return an empty string if the group could match the empty string. I just put this together and haven't tested it so there may be some errors, but hopefully its a starting point.
If you are parsing private URLs with a non-standard TLD, you can replace the TLD section with the expression below with: (?:[a-zA-Z][a-zA-Z0-9\\-]* [a-zA-Z0-9 ])
Quick explanation, with group indexes.
If you are parsing private URLs with a non-standard TLD, you can replace the TLD section with the expression below with: (?:[a-zA-Z][a-zA-Z0-9\\-]*
(?:([a-zA-Z]*)://)((?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|(?:[a-zA-Z0-9\\-]*[a-zA-Z0-9]\\.)+(?i:aero|asia|biz|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|xxx|[a-z]{2}))(?::(0?\\d{1,4}|6[0-4]\\d{3}|65[0-4]\\d{2}|655[0-2]\\d|6553[0-5]))?(?:((?/[^#?])*)(?:#([a-zA-Z0-9]*))?(?:\\?((?:[a-zA-Z]\\w*=[^&]*&)*(?:[a-zA-Z]\\w*=[^&]*))))?
Quick explanation, with group indexes.
(?:([a-zA-Z]*)://) # matches protocol, if present, in format "xyz://", GROUP 1
(
(?:
(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}
(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])
) # match IPv4 address
| # OR DNS name
(?:[a-zA-Z0-9\\-]*[a-zA-Z0-9]\\.)+
(?i:aero|asia|biz|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|xxx|[a-z]{2})
) # server address, GROUP 2
(?::(0?\\d{1,4}|6[0-4]\\d{3}|65[0-4]\\d{2}|655[0-2]\\d|6553[0-5]))? # matches optional port 0-65535, GROUP 3
(?:
((?/[^#?])*) # matches path, if present, GROUP 4
(?:#([a-zA-Z0-9]*))? # matches hash, if present, GROUP 5
(?:\\?((?:[a-zA-Z]\\w*=[^&]*&)*(?:[a-zA-Z]\\w*=[^&]*))) # matches query string, if present, GROUP 6
)?