• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 267
  • Last Modified:

Getting uniform url

How do you get an uniform web address using c#
example: converting strings "www.yahoo.com" or "http://yahoo.com" or "http://yahoo.com/" or any other possible combination to one unique string such as "http://www.yahoo.com"? Is there a function out there that I can use?
0
skyrise11
Asked:
skyrise11
  • 2
1 Solution
 
DropZoneCommented:
What you are asking for is called "Canonicalization", from the verb "to canonicalize", meaning to convert to its base or canonical form.  You can use a RegEx object to do this.  However, there is a problem: you'll you have to know and be sure what the canonical form is.

For example, "yahoo.com" may be a valid URL.  It is certainly syntactically valid according to the RFC that defines URIs, so how do you "know" for sure that it requires a "www" before it?  Perhaps its "w3c.yahoo.com", or maybe "my.yahoo.com".  The only way to find this out would be to have a list of all known URLs before hand and look it up, which we can agree that is not a very practical solution.

You'll also have to consider that the URL may also contain a path at the end, or perhaps a QueryString, such as: "http://www.yahoo.com/mypage" or "http://www.yahoo.com/mypage?id=123".  These are all valid URLs, so you'll have to make sure that you canonicalize strictly the domain part.

Once you settle on the specific criteria that you want to evaluate, and you are comfortable that it defines the canonical form for your URLs, then its straightforward to create a regular expression pattern for it.  For that I can help.

    -dZ.
0
 
skyrise11Author Commented:
Well, my goal is basically to parse images from many sites and store them locally for quick access. In order to figure out which images are stored for which sites, I need to store their URL. Since, yahoo.com and www.yahoo.com, etc. all point to the same site, I want to reduce the number of times I have to parse a site and store its images.

Not sure which options would be best.
0
 
DropZoneCommented:
I understand what you want to do, but like I said, there isn't a perfect solution to that without knowing first hand what is the correct URL.

For example, if you had already "www.yahoo.com" on your list, when someone enters "yahoo.com", you could perform a domain search in your list and notice that they match, and complete it.  However, what if "yahoo.com" was the first one entered?  And also, what if both point to different servers?  It may be very common for web URLs to start with "www" but not absolute:  Perhaps "mydomain.com" resolves directly to "images.mydomain.com".

Its a delicate issue.  You could force at least a 3-level domain (one with "third.second.tld") and perform a match on the existing ones, or you could keep a list of the most common ones you expect users are going to enter, and canonicalize them.

A third option, and perhaps this may be the best one, is to perform the search, or the automatic canonicalization and confirm it with the user:  If the user enters "yahoo.com", present him with "http://www.yahoo.com" and ask if it is correct.  Additionally, you could perform an HTTP request directly to behind the scenes just to make sure it exists and valid (I do that with an old site directory I used to keep).

      -dZ.
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now