Solved

What is the best way to parse a URL

Posted on 2010-08-19
6
548 Views
Last Modified: 2013-11-10
What is the best way to parse a url for use in a data warehouse?
 I need to extract the "name of the referrer" and the referrer domain,
which ends up being the domain and subdomain, I think.  
 I have been using a t-sql function, but need something more dynamic .  
0
Comment
Question by:claytonhd
  • 3
  • 2
6 Comments
 
LVL 16

Expert Comment

by:vdr1620
ID: 33480215
try patindex along with substring
0
 
LVL 42

Expert Comment

by:dqmq
ID: 33480282
>I have been using a t-sql function, but need something more dynamic .  

Don't know how you could get more dynamic than that.  What do you mean by "more dynamic".  Perhaps illustrate with examples of the strings you are parsing and what you want to parse out.  Wouldn't hurt to provide your existing function and explain its shortcomings.
0
 

Author Comment

by:claytonhd
ID: 33480387
ok. here is the function which I got off the web, don't remember exactly where (sorry to the person that posted it).
 RETURNS varchar(256)
AS
BEGIN
IF CHARINDEX('http://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0
-- Ghetto-tastic
SELECT @strURL = REPLACE(@strURL,'https://','')
SELECT @strURL = REPLACE(@strURL,'http://','')
SELECT @strURL = REPLACE(@strURL,'www','')
-- Remove everything after "/" if one exists
IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))

--Optional: Remove subdomains but differentiate between www.google.com and www.google.com.au
IF (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') < 3 -- if there are less than 3 periods
SELECT @strURL = PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)

if (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') = 4 -- if there mor than 3 periods
SELECT @strURL = substring(@strURL,10,len(@strURL))
ELSE -- It's likely a google.co.uk, or google.com.au
SELECT @strURL = PARSENAME(@strURL,3) + '.' + PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)
if SUBSTRING(@strURL,1,1) = '.'
select @strURL = substring(@strURL,2,len(@strURL))

return @strURL
END

The problem is with URL's like:
http://aa.mc503.mail.yahoo.com/mc/welcome?.gx=1&.tm=1263103624&.rand=f0m5a3r281s8l
http://co111w.col111.mail.live.com/mail/InboxLight.aspx?FolderID=0000
I want respectively:
mail.yahoo.com
mail.live.com

I tried adding anothr "if" statement as follows:

if (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') = 4 -- if there are 4 periods
SELECT @strURL = substring(@strURL,10,len(@strURL))

which returns:
mail.yahoo.com
but
l111.mail.live.com

so the problem is I need a flexible way to strip off the varying lengths of preceeding characters. Does that make sense?

I can reverse the url, but then I don't know how to find the length of characters to remove on the end.
0
Control application downtime with dependency maps

Visualize the interdependencies between application components better with Applications Manager's automated application discovery and dependency mapping feature. Resolve performance issues faster by quickly isolating problematic components.

 
LVL 42

Expert Comment

by:dqmq
ID: 33486745
After stripping off the front and back, you are left with an expression delimited by a variable number of periods.  It's apparent that you want to reduce it to an expression with only 2 periods.  But I cannot ascertain a reliable pattern for truncating at the front or the back or both.

We could, of course, code for the specific examples you gave, but we will no sooner fix those and you will undoubtedly find more exceptions.  I expect you will be chasing exceptions for a long time and will eventually to the point where fixing one special case breaks the coding for other URLs.  

In other words, unless you can come up with some definitive rules for parsing expressions with more than 2 periods, I think you're between a rock and a hard spot.
0
 

Accepted Solution

by:
claytonhd earned 0 total points
ID: 33486917
A coworker actually came up with a workable solution.  Here it is:
After stripping off the front and back:
1.  Reverse the string.
2.  Find the number of characters to the second dot.
3.  Substring from 1 to the number of characters to the 2nd dot.
4.  Reverse the string again to the correct direction.

Here is the whole function:  

CREATE FUNCTION [dbo].[parseURL]  (@strURL varchar(1000))
RETURNS varchar(256)
AS
BEGIN
declare @chars int;
set @chars = 0

IF CHARINDEX('http://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0
-- Ghetto-tastic
SELECT @strURL = REPLACE(@strURL,'https://','')
SELECT @strURL = REPLACE(@strURL,'http://','')
SELECT @strURL = REPLACE(@strURL,'www','')
-- Remove everything after "/" if one exists
IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))


SELECT @strURL = reverse(@strURL)
set @chars = CHARINDEX('.', @strURL)
set @chars = @chars+CHARINDEX('.', (substring(@strURL,6,LEN(@strURL))))
set @strURL = substring(@strURL,1,@chars)
SELECT @strURL = reverse(@strURL)


return @strURL
END

Works pretty well.  Still need to add some conditions for less than two dots etc.  
0
 
LVL 42

Expert Comment

by:dqmq
ID: 33489798
If you're happy, i'm happy...but something tells me you will be revisiting your function again and again and again

For example:  if you have "www" in the middle of your URL somewhere
For example: if you have "WWW" instead of "www" or "HTTPS://"

BTW, stripping off the front can be simplified to this (untested, but I think it's good):

IF CHARINDEX('://',@strURL) > 0
  SET @strURL = RIGHT(@strURL,len(@strURL)-CHARINDEX('://',@strURL)-2)
IF left(ucase(@strURL),4)='WWW.'
  SET @strURL = RIGHT(@strURL,len(@strURL)-4)




0

Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Ever wondered why sometimes your SQL Server is slow or unresponsive with connections spiking up but by the time you go in, all is well? The following article will show you how to install and configure a SQL job that will send you email alerts includ…
I have a large data set and a SSIS package. How can I load this file in multi threading?
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now