[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

java - break url into parts

Posted on 2011-04-26
14
Medium Priority
?
312 Views
Last Modified: 2012-05-11
I have a long list of urls (10million). I need to parse each and extract its protocol, domain, path, and query string.

URL u = new URL(URLDecoder.decode(url, "UTF-8")); Isn't viable because many of the URLs are malformed and this throws exceptions.

I'd like methods that would just return null if they can't parse and keep going or any other efficient way of breaking the urls without throwing exceptions.
0
Comment
Question by:modsiw
14 Comments
 
LVL 23

Expert Comment

by:cmalakar
ID: 35468178
If the url is malformed, Why not just ignore the exception and proceed to other url's,
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35468319
You can make a method where you cvatch Exception and ignore it.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35468356


Something like that, though you might consdier
differentiate if it has some other Exception not URLMalformed
and do somthing about it , or not:

public URL myDecode(String url){
try{
URL u = new URL(URLDecoder.decode(url, "UTF-8"));
} catch(Exception ex) {
return null;
}
return u;
}

That'll be like you asked for - null instaed of exception
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 3

Author Comment

by:modsiw
ID: 35468425
I need to prevent the exceptions from being formed. I can wrap them in try / catch.

The act of creating and throwing an exception is a performance issue.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35468453
You can look and amnalyze most of your problesm - hwat is their cause
and then do preliminary check with substrings
0
 
LVL 3

Author Comment

by:modsiw
ID: 35468464
I'm also interested in a way of doing this without creating a new URL object for each url.
0
 
LVL 3

Author Comment

by:modsiw
ID: 35468501
>>You can look and amnalyze most of your problesm - hwat is their cause
and then do preliminary check with substrings

There are a lot of cases. I don't want to check them by hand. I'm hoping for a premade something that already does this.

New odd cases will continuously spring up; it needs a solution more robust than the hackery I'd turn out in a few hours.
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35468529
Well, you can also find the
source code for URLDecoder and modify it so it does not create
Exception
0
 
LVL 47

Expert Comment

by:for_yan
ID: 35468546
Can't believe it is the actual source code - so short.
But maybe can still be useful to you
0
 
LVL 3

Author Comment

by:modsiw
ID: 35468590
Url is a much bigger piece of the puzzle. I need to extract protocol domain path and querystring separately while identifying bad urls.
0
 
LVL 3

Author Comment

by:modsiw
ID: 35468591
Url is a much bigger piece of the puzzle. I need to extract protocol domain path and querystring separately while identifying bad urls.
0
 
LVL 92

Accepted Solution

by:
objects earned 2000 total points
ID: 35471694
doubt you will find anything
best bet is probably to grab the URL class and modify it to meet your needs
http://www.docjar.com/html/api/java/net/URL.java.html
0
 
LVL 10

Expert Comment

by:gordon_vt02
ID: 35476091
Try the following regular expression.  I don't know how efficient this will be -- might be faster to use the URL class and handle the exceptions, but you can performance test it.  You should be able to match any valid URL except mailto: and ftp://username:password@ with this expression and, once successfully matched, you can use the Matcher.group() method to retrieve the URL parts or null/empty string if they did not exist (see the javadoc for Matcher.group(), it will return an empty string if the group could match the empty string.  I just put this together and haven't tested it so there may be some errors, but hopefully its a starting point.

If you are parsing private URLs with a non-standard TLD, you can replace the TLD section with the expression below with: (?:[a-zA-Z][a-zA-Z0-9\\-]*[a-zA-Z0-9])

(?:([a-zA-Z]*)://)((?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|(?:[a-zA-Z0-9\\-]*[a-zA-Z0-9]\\.)+(?i:aero|asia|biz|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|xxx|[a-z]{2}))(?::(0?\\d{1,4}|6[0-4]\\d{3}|65[0-4]\\d{2}|655[0-2]\\d|6553[0-5]))?(?:((?/[^#?])*)(?:#([a-zA-Z0-9]*))?(?:\\?((?:[a-zA-Z]\\w*=[^&]*&)*(?:[a-zA-Z]\\w*=[^&]*))))?

Open in new window


Quick explanation, with group indexes.


Open in new window

(?:([a-zA-Z]*)://) # matches protocol, if present, in format "xyz://", GROUP 1
(
  (?:
    (?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}
    (?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])
  ) # match IPv4 address
  | # OR DNS name
  (?:[a-zA-Z0-9\\-]*[a-zA-Z0-9]\\.)+
  (?i:aero|asia|biz|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|xxx|[a-z]{2})
) # server address, GROUP 2
(?::(0?\\d{1,4}|6[0-4]\\d{3}|65[0-4]\\d{2}|655[0-2]\\d|6553[0-5]))? # matches optional port 0-65535, GROUP 3
(?:
  ((?/[^#?])*) # matches path, if present, GROUP 4
  (?:#([a-zA-Z0-9]*))? # matches hash, if present, GROUP 5
  (?:\\?((?:[a-zA-Z]\\w*=[^&]*&)*(?:[a-zA-Z]\\w*=[^&]*))) # matches query string, if present, GROUP 6
)?

Open in new window

0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
Viewers learn about the third conditional statement “else if” and use it in an example program. Then additional information about conditional statements is provided, covering the topic thoroughly. Viewers learn about the third conditional statement …
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
Suggested Courses
Course of the Month18 days, 5 hours left to enroll

831 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question