Solved

What is the best way to parse a URL

Posted on 2010-08-19
6
543 Views
Last Modified: 2013-11-10
What is the best way to parse a url for use in a data warehouse?
 I need to extract the "name of the referrer" and the referrer domain,
which ends up being the domain and subdomain, I think.  
 I have been using a t-sql function, but need something more dynamic .  
0
Comment
Question by:claytonhd
  • 3
  • 2
6 Comments
 
LVL 16

Expert Comment

by:vdr1620
ID: 33480215
try patindex along with substring
0
 
LVL 42

Expert Comment

by:dqmq
ID: 33480282
>I have been using a t-sql function, but need something more dynamic .  

Don't know how you could get more dynamic than that.  What do you mean by "more dynamic".  Perhaps illustrate with examples of the strings you are parsing and what you want to parse out.  Wouldn't hurt to provide your existing function and explain its shortcomings.
0
 

Author Comment

by:claytonhd
ID: 33480387
ok. here is the function which I got off the web, don't remember exactly where (sorry to the person that posted it).
 RETURNS varchar(256)
AS
BEGIN
IF CHARINDEX('http://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0
-- Ghetto-tastic
SELECT @strURL = REPLACE(@strURL,'https://','')
SELECT @strURL = REPLACE(@strURL,'http://','')
SELECT @strURL = REPLACE(@strURL,'www','')
-- Remove everything after "/" if one exists
IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))

--Optional: Remove subdomains but differentiate between www.google.com and www.google.com.au
IF (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') < 3 -- if there are less than 3 periods
SELECT @strURL = PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)

if (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') = 4 -- if there mor than 3 periods
SELECT @strURL = substring(@strURL,10,len(@strURL))
ELSE -- It's likely a google.co.uk, or google.com.au
SELECT @strURL = PARSENAME(@strURL,3) + '.' + PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)
if SUBSTRING(@strURL,1,1) = '.'
select @strURL = substring(@strURL,2,len(@strURL))

return @strURL
END

The problem is with URL's like:
http://aa.mc503.mail.yahoo.com/mc/welcome?.gx=1&amp;.tm=1263103624&amp;.rand=f0m5a3r281s8l
http://co111w.col111.mail.live.com/mail/InboxLight.aspx?FolderID=0000
I want respectively:
mail.yahoo.com
mail.live.com

I tried adding anothr "if" statement as follows:

if (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') = 4 -- if there are 4 periods
SELECT @strURL = substring(@strURL,10,len(@strURL))

which returns:
mail.yahoo.com
but
l111.mail.live.com

so the problem is I need a flexible way to strip off the varying lengths of preceeding characters. Does that make sense?

I can reverse the url, but then I don't know how to find the length of characters to remove on the end.
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 42

Expert Comment

by:dqmq
ID: 33486745
After stripping off the front and back, you are left with an expression delimited by a variable number of periods.  It's apparent that you want to reduce it to an expression with only 2 periods.  But I cannot ascertain a reliable pattern for truncating at the front or the back or both.

We could, of course, code for the specific examples you gave, but we will no sooner fix those and you will undoubtedly find more exceptions.  I expect you will be chasing exceptions for a long time and will eventually to the point where fixing one special case breaks the coding for other URLs.  

In other words, unless you can come up with some definitive rules for parsing expressions with more than 2 periods, I think you're between a rock and a hard spot.
0
 

Accepted Solution

by:
claytonhd earned 0 total points
ID: 33486917
A coworker actually came up with a workable solution.  Here it is:
After stripping off the front and back:
1.  Reverse the string.
2.  Find the number of characters to the second dot.
3.  Substring from 1 to the number of characters to the 2nd dot.
4.  Reverse the string again to the correct direction.

Here is the whole function:  

CREATE FUNCTION [dbo].[parseURL]  (@strURL varchar(1000))
RETURNS varchar(256)
AS
BEGIN
declare @chars int;
set @chars = 0

IF CHARINDEX('http://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0
-- Ghetto-tastic
SELECT @strURL = REPLACE(@strURL,'https://','')
SELECT @strURL = REPLACE(@strURL,'http://','')
SELECT @strURL = REPLACE(@strURL,'www','')
-- Remove everything after "/" if one exists
IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))


SELECT @strURL = reverse(@strURL)
set @chars = CHARINDEX('.', @strURL)
set @chars = @chars+CHARINDEX('.', (substring(@strURL,6,LEN(@strURL))))
set @strURL = substring(@strURL,1,@chars)
SELECT @strURL = reverse(@strURL)


return @strURL
END

Works pretty well.  Still need to add some conditions for less than two dots etc.  
0
 
LVL 42

Expert Comment

by:dqmq
ID: 33489798
If you're happy, i'm happy...but something tells me you will be revisiting your function again and again and again

For example:  if you have "www" in the middle of your URL somewhere
For example: if you have "WWW" instead of "www" or "HTTPS://"

BTW, stripping off the front can be simplified to this (untested, but I think it's good):

IF CHARINDEX('://',@strURL) > 0
  SET @strURL = RIGHT(@strURL,len(@strURL)-CHARINDEX('://',@strURL)-2)
IF left(ucase(@strURL),4)='WWW.'
  SET @strURL = RIGHT(@strURL,len(@strURL)-4)




0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

JSON is being used more and more, besides XML, and you surely wanted to parse the data out into SQL instead of doing it in some Javascript. The below function in SQL Server can do the job for you, returning a quick table with the parsed data.
Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
Using examples as well as descriptions, and references to Books Online, show the different Recovery Models available in SQL Server and explain, as well as show how full, differential and transaction log backups are performed
Via a live example, show how to shrink a transaction log file down to a reasonable size.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now