?
Solved

What is the best way to parse a URL

Posted on 2010-08-19
6
Medium Priority
?
575 Views
Last Modified: 2013-11-10
What is the best way to parse a url for use in a data warehouse?
 I need to extract the "name of the referrer" and the referrer domain,
which ends up being the domain and subdomain, I think.  
 I have been using a t-sql function, but need something more dynamic .  
0
Comment
Question by:claytonhd
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
6 Comments
 
LVL 16

Expert Comment

by:vdr1620
ID: 33480215
try patindex along with substring
0
 
LVL 42

Expert Comment

by:dqmq
ID: 33480282
>I have been using a t-sql function, but need something more dynamic .  

Don't know how you could get more dynamic than that.  What do you mean by "more dynamic".  Perhaps illustrate with examples of the strings you are parsing and what you want to parse out.  Wouldn't hurt to provide your existing function and explain its shortcomings.
0
 

Author Comment

by:claytonhd
ID: 33480387
ok. here is the function which I got off the web, don't remember exactly where (sorry to the person that posted it).
 RETURNS varchar(256)
AS
BEGIN
IF CHARINDEX('http://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0
-- Ghetto-tastic
SELECT @strURL = REPLACE(@strURL,'https://','')
SELECT @strURL = REPLACE(@strURL,'http://','')
SELECT @strURL = REPLACE(@strURL,'www','')
-- Remove everything after "/" if one exists
IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))

--Optional: Remove subdomains but differentiate between www.google.com and www.google.com.au
IF (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') < 3 -- if there are less than 3 periods
SELECT @strURL = PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)

if (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') = 4 -- if there mor than 3 periods
SELECT @strURL = substring(@strURL,10,len(@strURL))
ELSE -- It's likely a google.co.uk, or google.com.au
SELECT @strURL = PARSENAME(@strURL,3) + '.' + PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)
if SUBSTRING(@strURL,1,1) = '.'
select @strURL = substring(@strURL,2,len(@strURL))

return @strURL
END

The problem is with URL's like:
http://aa.mc503.mail.yahoo.com/mc/welcome?.gx=1&.tm=1263103624&.rand=f0m5a3r281s8l
http://co111w.col111.mail.live.com/mail/InboxLight.aspx?FolderID=0000
I want respectively:
mail.yahoo.com
mail.live.com

I tried adding anothr "if" statement as follows:

if (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') = 4 -- if there are 4 periods
SELECT @strURL = substring(@strURL,10,len(@strURL))

which returns:
mail.yahoo.com
but
l111.mail.live.com

so the problem is I need a flexible way to strip off the varying lengths of preceeding characters. Does that make sense?

I can reverse the url, but then I don't know how to find the length of characters to remove on the end.
0
Get MySQL database support online, now!

At Percona’s web store you can order your MySQL database support needs in minutes. No hassles, no fuss, just pick and click. Pay online with a credit card.

 
LVL 42

Expert Comment

by:dqmq
ID: 33486745
After stripping off the front and back, you are left with an expression delimited by a variable number of periods.  It's apparent that you want to reduce it to an expression with only 2 periods.  But I cannot ascertain a reliable pattern for truncating at the front or the back or both.

We could, of course, code for the specific examples you gave, but we will no sooner fix those and you will undoubtedly find more exceptions.  I expect you will be chasing exceptions for a long time and will eventually to the point where fixing one special case breaks the coding for other URLs.  

In other words, unless you can come up with some definitive rules for parsing expressions with more than 2 periods, I think you're between a rock and a hard spot.
0
 

Accepted Solution

by:
claytonhd earned 0 total points
ID: 33486917
A coworker actually came up with a workable solution.  Here it is:
After stripping off the front and back:
1.  Reverse the string.
2.  Find the number of characters to the second dot.
3.  Substring from 1 to the number of characters to the 2nd dot.
4.  Reverse the string again to the correct direction.

Here is the whole function:  

CREATE FUNCTION [dbo].[parseURL]  (@strURL varchar(1000))
RETURNS varchar(256)
AS
BEGIN
declare @chars int;
set @chars = 0

IF CHARINDEX('http://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0
-- Ghetto-tastic
SELECT @strURL = REPLACE(@strURL,'https://','')
SELECT @strURL = REPLACE(@strURL,'http://','')
SELECT @strURL = REPLACE(@strURL,'www','')
-- Remove everything after "/" if one exists
IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))


SELECT @strURL = reverse(@strURL)
set @chars = CHARINDEX('.', @strURL)
set @chars = @chars+CHARINDEX('.', (substring(@strURL,6,LEN(@strURL))))
set @strURL = substring(@strURL,1,@chars)
SELECT @strURL = reverse(@strURL)


return @strURL
END

Works pretty well.  Still need to add some conditions for less than two dots etc.  
0
 
LVL 42

Expert Comment

by:dqmq
ID: 33489798
If you're happy, i'm happy...but something tells me you will be revisiting your function again and again and again

For example:  if you have "www" in the middle of your URL somewhere
For example: if you have "WWW" instead of "www" or "HTTPS://"

BTW, stripping off the front can be simplified to this (untested, but I think it's good):

IF CHARINDEX('://',@strURL) > 0
  SET @strURL = RIGHT(@strURL,len(@strURL)-CHARINDEX('://',@strURL)-2)
IF left(ucase(@strURL),4)='WWW.'
  SET @strURL = RIGHT(@strURL,len(@strURL)-4)




0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
In the first part of this tutorial we will cover the prerequisites for installing SQL Server vNext on Linux.
This video shows, step by step, how to configure Oracle Heterogeneous Services via the Generic Gateway Agent in order to make a connection from an Oracle session and access a remote SQL Server database table.
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question