?
Solved

Parsing (getting) URLs from text

Posted on 2006-04-20
10
Medium Priority
?
216 Views
Last Modified: 2010-04-23
Ok, so, I just want to know how to parse a string so that I can extract all the URLs (and only the URLs) from the string. Suppose that this is the text:
"The Guqin is the modern name for a plucked seven-string Chinese musical instrument of the zither family. For more info on it, check this page: http://en.wikipedia.org/wiki/Guqin."
I would want to know how to get just "http://en.wikipedia.org/wiki/Guqin" from that.
Obviously, there are a few things that will need to be considered... Not all the URLs will be separated by spaces or commas as sometimes the URL is followed by a period since it is at the end of a sentence.  Not all URLs will end in an extension, such as the one above. The URLs will vary in location and there will be no static formatting. All URLs will have http:// which should make things easier...

Thanks in advance.
0
Comment
Question by:codemaster3
  • 5
  • 5
10 Comments
 
LVL 4

Expert Comment

by:g_johnson
ID: 16502392
Can it be assumed that urls will be separated by ONE of these:  space, comma, period?

If so, I would search the string (using instr function) for "http://"  (P1)
then for a space (P2)
then for a comma (P3)
then for a period (P4)

then I would take the string between P1 and the lesser of P2,P3, and P4

Does that help?

P.S.  I'm fairly new to .Net so maybe INDEXOF instead of INSTR
0
 
LVL 4

Expert Comment

by:g_johnson
ID: 16502402
oh, and by the way,

"repeat if necessary" -- to find more urls in the same string
0
 

Author Comment

by:codemaster3
ID: 16502454
Yes, but there would be still a few things you'd have to check for. For example, what if this is the URL:
"http://www.site.com/something.somethingelse/file.rar. http://anothersite.com"
Then parsing by period's would only return "http://www" and "http://anothersite". In which case, parsing by spaces would return "http://www.site.com/something.somethingelse/file.rar." and "http://anothersite.com". The unecessary period at the end of the first URL would screw it up...
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 

Author Comment

by:codemaster3
ID: 16502471
Oh, and also a newline character could separate the URLs.
0
 
LVL 4

Expert Comment

by:g_johnson
ID: 16503467
yep, forgot about the period after www etc.
hmmm
not sure how to solve this
is the string ALL urls or is there other information in it?
0
 

Author Comment

by:codemaster3
ID: 16503974
Other info... Basically it's suppose to let you paste a big block of text with descriptions of each file as well, but eliminating the need to filter out the extra info...
0
 
LVL 4

Accepted Solution

by:
g_johnson earned 600 total points
ID: 16506801
I honestly don't know what to do.  The fact that a period can be the separator messes the whole thing up.  I might resort to needing to "test" strings after I've parsed them out, i.e., by parsing "http://www.yahoo.com.  And then another sentence." I get:

http://www then
http://www.yahoo then
http://www.yahoo.com and finally
http://www.yahoo.com And then another sentence

the first three would return a valid web address in a browser, the last one wouldn't, so #3 is my url.

Check this link to test for valid urls:
http://vbnet.mvps.org/index.html?code/fileapi/pathisurl.htm

let me know if that helps
0
 

Author Comment

by:codemaster3
ID: 16509736
Yea, I think I might just do this... I had imagined that I would have to resort to something like this, although I had wondered if there was an easier way, something done specifically for this. Thanks for the help anyways :)
0
 
LVL 4

Expert Comment

by:g_johnson
ID: 16510021
hey, we tried, right?!   LOL
0
 

Author Comment

by:codemaster3
ID: 16510070
Hehe, yea, thanks.
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article explains how to create and use a custom WaterMark textbox class.  The custom WaterMark textbox class allows you to set the WaterMark Background Color and WaterMark text at design time.   IMAGE OF WATERMARKS STEPS Create VB …
Microsoft Reports are based on a report definition, which is an XML file that describes data and layout for the report, with a different extension. You can create a client-side report definition language (*.rdlc) file with Visual Studio, and build g…
this video summaries big data hadoop online training demo (http://onlineitguru.com/big-data-hadoop-online-training-placement.html) , and covers basics in big data hadoop .
This lesson discusses how to use a Mainform + Subforms in Microsoft Access to find and enter data for payments on orders. The sample data comes from a custom shop that builds and sells movable storage structures that are delivered to your property. …
Suggested Courses

831 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question