?
Solved

What is the best way to parse a URL

Posted on 2010-08-19
6
Medium Priority
?
591 Views
Last Modified: 2013-11-10
What is the best way to parse a url for use in a data warehouse?
 I need to extract the "name of the referrer" and the referrer domain,
which ends up being the domain and subdomain, I think.  
 I have been using a t-sql function, but need something more dynamic .  
0
Comment
Question by:claytonhd
  • 3
  • 2
6 Comments
 
LVL 16

Expert Comment

by:vdr1620
ID: 33480215
try patindex along with substring
0
 
LVL 42

Expert Comment

by:dqmq
ID: 33480282
>I have been using a t-sql function, but need something more dynamic .  

Don't know how you could get more dynamic than that.  What do you mean by "more dynamic".  Perhaps illustrate with examples of the strings you are parsing and what you want to parse out.  Wouldn't hurt to provide your existing function and explain its shortcomings.
0
 

Author Comment

by:claytonhd
ID: 33480387
ok. here is the function which I got off the web, don't remember exactly where (sorry to the person that posted it).
 RETURNS varchar(256)
AS
BEGIN
IF CHARINDEX('http://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0
-- Ghetto-tastic
SELECT @strURL = REPLACE(@strURL,'https://','')
SELECT @strURL = REPLACE(@strURL,'http://','')
SELECT @strURL = REPLACE(@strURL,'www','')
-- Remove everything after "/" if one exists
IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))

--Optional: Remove subdomains but differentiate between www.google.com and www.google.com.au
IF (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') < 3 -- if there are less than 3 periods
SELECT @strURL = PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)

if (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') = 4 -- if there mor than 3 periods
SELECT @strURL = substring(@strURL,10,len(@strURL))
ELSE -- It's likely a google.co.uk, or google.com.au
SELECT @strURL = PARSENAME(@strURL,3) + '.' + PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)
if SUBSTRING(@strURL,1,1) = '.'
select @strURL = substring(@strURL,2,len(@strURL))

return @strURL
END

The problem is with URL's like:
http://aa.mc503.mail.yahoo.com/mc/welcome?.gx=1&.tm=1263103624&.rand=f0m5a3r281s8l
http://co111w.col111.mail.live.com/mail/InboxLight.aspx?FolderID=0000
I want respectively:
mail.yahoo.com
mail.live.com

I tried adding anothr "if" statement as follows:

if (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') = 4 -- if there are 4 periods
SELECT @strURL = substring(@strURL,10,len(@strURL))

which returns:
mail.yahoo.com
but
l111.mail.live.com

so the problem is I need a flexible way to strip off the varying lengths of preceeding characters. Does that make sense?

I can reverse the url, but then I don't know how to find the length of characters to remove on the end.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 42

Expert Comment

by:dqmq
ID: 33486745
After stripping off the front and back, you are left with an expression delimited by a variable number of periods.  It's apparent that you want to reduce it to an expression with only 2 periods.  But I cannot ascertain a reliable pattern for truncating at the front or the back or both.

We could, of course, code for the specific examples you gave, but we will no sooner fix those and you will undoubtedly find more exceptions.  I expect you will be chasing exceptions for a long time and will eventually to the point where fixing one special case breaks the coding for other URLs.  

In other words, unless you can come up with some definitive rules for parsing expressions with more than 2 periods, I think you're between a rock and a hard spot.
0
 

Accepted Solution

by:
claytonhd earned 0 total points
ID: 33486917
A coworker actually came up with a workable solution.  Here it is:
After stripping off the front and back:
1.  Reverse the string.
2.  Find the number of characters to the second dot.
3.  Substring from 1 to the number of characters to the 2nd dot.
4.  Reverse the string again to the correct direction.

Here is the whole function:  

CREATE FUNCTION [dbo].[parseURL]  (@strURL varchar(1000))
RETURNS varchar(256)
AS
BEGIN
declare @chars int;
set @chars = 0

IF CHARINDEX('http://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0
-- Ghetto-tastic
SELECT @strURL = REPLACE(@strURL,'https://','')
SELECT @strURL = REPLACE(@strURL,'http://','')
SELECT @strURL = REPLACE(@strURL,'www','')
-- Remove everything after "/" if one exists
IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))


SELECT @strURL = reverse(@strURL)
set @chars = CHARINDEX('.', @strURL)
set @chars = @chars+CHARINDEX('.', (substring(@strURL,6,LEN(@strURL))))
set @strURL = substring(@strURL,1,@chars)
SELECT @strURL = reverse(@strURL)


return @strURL
END

Works pretty well.  Still need to add some conditions for less than two dots etc.  
0
 
LVL 42

Expert Comment

by:dqmq
ID: 33489798
If you're happy, i'm happy...but something tells me you will be revisiting your function again and again and again

For example:  if you have "www" in the middle of your URL somewhere
For example: if you have "WWW" instead of "www" or "HTTPS://"

BTW, stripping off the front can be simplified to this (untested, but I think it's good):

IF CHARINDEX('://',@strURL) > 0
  SET @strURL = RIGHT(@strURL,len(@strURL)-CHARINDEX('://',@strURL)-2)
IF left(ucase(@strURL),4)='WWW.'
  SET @strURL = RIGHT(@strURL,len(@strURL)-4)




0

Featured Post

[Webinar] Cloud and Mobile-First Strategy

Maybe you’ve fully adopted the cloud since the beginning. Or maybe you started with on-prem resources but are pursuing a “cloud and mobile first” strategy. Getting to that end state has its challenges. Discover how to build out a 100% cloud and mobile IT strategy in this webinar.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

What if you have to shut down the entire Citrix infrastructure for hardware maintenance, software upgrades or "the unknown"? I developed this plan for "the unknown" and hope that it helps you as well. This article explains how to properly shut down …
It is possible to export the data of a SQL Table in SSMS and generate INSERT statements. It's neatly tucked away in the generate scripts option of a database.
Via a live example, show how to extract information from SQL Server on Database, Connection and Server properties
Viewers will learn how to use the INSERT statement to insert data into their tables. It will also introduce the NULL statement, to show them what happens when no value is giving for any given column.
Suggested Courses

864 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question