Solved

What is the best way to parse a URL

Posted on 2010-08-19
6
557 Views
Last Modified: 2013-11-10
What is the best way to parse a url for use in a data warehouse?
 I need to extract the "name of the referrer" and the referrer domain,
which ends up being the domain and subdomain, I think.  
 I have been using a t-sql function, but need something more dynamic .  
0
Comment
Question by:claytonhd
  • 3
  • 2
6 Comments
 
LVL 16

Expert Comment

by:vdr1620
ID: 33480215
try patindex along with substring
0
 
LVL 42

Expert Comment

by:dqmq
ID: 33480282
>I have been using a t-sql function, but need something more dynamic .  

Don't know how you could get more dynamic than that.  What do you mean by "more dynamic".  Perhaps illustrate with examples of the strings you are parsing and what you want to parse out.  Wouldn't hurt to provide your existing function and explain its shortcomings.
0
 

Author Comment

by:claytonhd
ID: 33480387
ok. here is the function which I got off the web, don't remember exactly where (sorry to the person that posted it).
 RETURNS varchar(256)
AS
BEGIN
IF CHARINDEX('http://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0
-- Ghetto-tastic
SELECT @strURL = REPLACE(@strURL,'https://','')
SELECT @strURL = REPLACE(@strURL,'http://','')
SELECT @strURL = REPLACE(@strURL,'www','')
-- Remove everything after "/" if one exists
IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))

--Optional: Remove subdomains but differentiate between www.google.com and www.google.com.au
IF (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') < 3 -- if there are less than 3 periods
SELECT @strURL = PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)

if (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') = 4 -- if there mor than 3 periods
SELECT @strURL = substring(@strURL,10,len(@strURL))
ELSE -- It's likely a google.co.uk, or google.com.au
SELECT @strURL = PARSENAME(@strURL,3) + '.' + PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)
if SUBSTRING(@strURL,1,1) = '.'
select @strURL = substring(@strURL,2,len(@strURL))

return @strURL
END

The problem is with URL's like:
http://aa.mc503.mail.yahoo.com/mc/welcome?.gx=1&.tm=1263103624&.rand=f0m5a3r281s8l
http://co111w.col111.mail.live.com/mail/InboxLight.aspx?FolderID=0000
I want respectively:
mail.yahoo.com
mail.live.com

I tried adding anothr "if" statement as follows:

if (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') = 4 -- if there are 4 periods
SELECT @strURL = substring(@strURL,10,len(@strURL))

which returns:
mail.yahoo.com
but
l111.mail.live.com

so the problem is I need a flexible way to strip off the varying lengths of preceeding characters. Does that make sense?

I can reverse the url, but then I don't know how to find the length of characters to remove on the end.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 42

Expert Comment

by:dqmq
ID: 33486745
After stripping off the front and back, you are left with an expression delimited by a variable number of periods.  It's apparent that you want to reduce it to an expression with only 2 periods.  But I cannot ascertain a reliable pattern for truncating at the front or the back or both.

We could, of course, code for the specific examples you gave, but we will no sooner fix those and you will undoubtedly find more exceptions.  I expect you will be chasing exceptions for a long time and will eventually to the point where fixing one special case breaks the coding for other URLs.  

In other words, unless you can come up with some definitive rules for parsing expressions with more than 2 periods, I think you're between a rock and a hard spot.
0
 

Accepted Solution

by:
claytonhd earned 0 total points
ID: 33486917
A coworker actually came up with a workable solution.  Here it is:
After stripping off the front and back:
1.  Reverse the string.
2.  Find the number of characters to the second dot.
3.  Substring from 1 to the number of characters to the 2nd dot.
4.  Reverse the string again to the correct direction.

Here is the whole function:  

CREATE FUNCTION [dbo].[parseURL]  (@strURL varchar(1000))
RETURNS varchar(256)
AS
BEGIN
declare @chars int;
set @chars = 0

IF CHARINDEX('http://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0
-- Ghetto-tastic
SELECT @strURL = REPLACE(@strURL,'https://','')
SELECT @strURL = REPLACE(@strURL,'http://','')
SELECT @strURL = REPLACE(@strURL,'www','')
-- Remove everything after "/" if one exists
IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))


SELECT @strURL = reverse(@strURL)
set @chars = CHARINDEX('.', @strURL)
set @chars = @chars+CHARINDEX('.', (substring(@strURL,6,LEN(@strURL))))
set @strURL = substring(@strURL,1,@chars)
SELECT @strURL = reverse(@strURL)


return @strURL
END

Works pretty well.  Still need to add some conditions for less than two dots etc.  
0
 
LVL 42

Expert Comment

by:dqmq
ID: 33489798
If you're happy, i'm happy...but something tells me you will be revisiting your function again and again and again

For example:  if you have "www" in the middle of your URL somewhere
For example: if you have "WWW" instead of "www" or "HTTPS://"

BTW, stripping off the front can be simplified to this (untested, but I think it's good):

IF CHARINDEX('://',@strURL) > 0
  SET @strURL = RIGHT(@strURL,len(@strURL)-CHARINDEX('://',@strURL)-2)
IF left(ucase(@strURL),4)='WWW.'
  SET @strURL = RIGHT(@strURL,len(@strURL)-4)




0

Featured Post

Ransomware-A Revenue Bonanza for Service Providers

Ransomware – malware that gets on your customers’ computers, encrypts their data, and extorts a hefty ransom for the decryption keys – is a surging new threat.  The purpose of this eBook is to educate the reader about ransomware attacks.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Why is this different from all of the other step by step guides?  Because I make a living as a DBA and not as a writer and I lived through this experience. Defining the name: When I talk to people they say different names on this subject stuff l…
In this article I will describe the Copy Database Wizard method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
Familiarize people with the process of retrieving data from SQL Server using an Access pass-thru query. Microsoft Access is a very powerful client/server development tool. One of the ways that you can retrieve data from a SQL Server is by using a pa…
Via a live example, show how to set up a backup for SQL Server using a Maintenance Plan and how to schedule the job into SQL Server Agent.

733 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question