Link to home
Start Free TrialLog in
Avatar of jobprojn
jobprojnFlag for United States of America

asked on

Function for FAST String Parse and Summation

We've got a table that stores monthly cost balances in a semi-colon delimited field, where each delimiter represents a month of the year. The data in the field looks like this:

100;100;100;100;100;100;100;100;100;100;0;50

We need a fast function to sum the balances for all months up to the specified period.  For example, when month 6 is specified it should sum to $600, when month 12 is specified it should sum to $1050.  

We currently sum this up by running a basic split function that parses each part of the string and then adds up those parsed numbers.  The table that stores these values contains many many records and the result is becoming very slow.  The function we're currently running is below.

Does anyone have ideas for a FASTER method for splitting and summing the string values?  Any suggestions are appreciated.

CREATE FUNCTION [dbo].[Split] ( @CharacterExpression VARCHAR(8000), @Delimiter CHAR(1), @Position INTEGER)
 
RETURNS VARCHAR(8000)
AS
BEGIN
 
If @Position<1 return null
if len(@Delimiter)<>1 return null
declare @Start integer
set @Start=1
while @Position>1
 
BEGIN
 
Set @Start=ISNULL(CHARINDEX(@Delimiter, @CharacterExpression, @Start),0)
IF @Start=0 return null
set @position= @position-1
set @Start=@Start+1
 
END
 
Declare @End INTEGER
Set @End= ISNULL(CHARINDEX(@Delimiter, @CharacterExpression, @Start),0)
If @End=0 Set @End=LEN(@CharacterExpression)+1
RETURN SUBSTRING(@CharacterExpression, @Start, @End-@Start)
 
END

Open in new window

Avatar of Kevin Cross
Kevin Cross
Flag of United States of America image

You can go with a simplier split modified from your above like http://crossedlogic.blogspot.com/2008/09/converting-delimited-string-to-separate.html -- can use while loop instead of recursion -- and would add an column in table that signified the order in the original string which wll translate to month like 1, 2, 3 ... 12.  I saw an example in a previous question that I will post to shorten development time.

You can then do sum outside of function.
SELECT SUM(CAST(entry AS INT))
FROM dbo.split('100;100;100;100;100;100;100;100;100;100;0;50', ';')
WHERE indexColumn <= 6

Open in new window

SOLUTION
Avatar of Kevin Cross
Kevin Cross
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of adlink_la
adlink_la

Yes, the bottome line is that there really isn't any "good" way to get at that data.  You can see how much code is required.  If you want fast your schema should allow the data to be accessed without any procedural code.  I don't know if you are able to make a schema change but that would be the way to go if you are having performance problems.

My split function is REALLY fast.  To speed it up, create a physical table in place of vw_nums with a clustered index on the number.  That's what we have in our system.
I agree with Brandon as that is what I was showing user in my comments.  The point of the question I think was creating code that allows calculation/split to occur fast during processing NOT what is fastest way to code it if that makes sense.
create table #t
(
      id int identity,
      val int
)
insert  #t
SELECT CAST(data AS INT)
FROM dbo.split(@CharacterExpression , @Delimiter )

select sum(val) from #t
where id < @Position
jindalankush:

I'm not saying that yours won't work in the same way, but yours uses looping to find all the individual members and that is not an efficient way to do it.
Trust me I know because my first delimited string parser that I wrote worked exactly the same way :).
>>The point of the question I think was creating code that allows calculation/split to occur fast during processing NOT what is fastest way to code it if that makes sense.

Yes, but seeing that we like to refer to ourselves as the "Experts" around here, I think we have an obligation to point things like that out.  To simply provide a solution and ignore an obvious problem is doing a disservice to the person asking the question.  Now granted, the OP has not stated whether he/she has the ability (or desire) to make a change to the schema.  They could be working with a 3rd party product and not have that luxury.  Nonetheless, the point should still be made.

That said, I find Brandon's solution very interesting.  I haven't dug into it yet to totally understand what it is doing but I like the fact that it does not use any looping.  For those times when we do need to parse data out of a column, it looks like a very clean solution.  I am definitely going to build that table and give it a try.  Very nice!
Yes, common table expressions are very cool.  The link in my first post doesn't use loops either but uses recursion within function.  Also very fast and low amount of code for what it is doing; however, it doesn't match up to the coolness that is CTE. :)
WOOHOOO!  I LOVE IT!!  I just built the numbers table and tested the function.  Very nice the way the output table gives the the ident, position of the value and the value itself.  This is going to be a handy function to have in the arsenal.  Thanks Brandon!  
np.

It has gone through several iterations.  From single character wildcards to multi.  from varchar to nvarchar.  I don't know if this is the most recent that supports space(s) as a delimiter but I have a rendition somewhere that does.  It was used a while back when someone needed to could the number of times each word appeared in a body of text.  I replaced all ,.() with spaces, stripped out double spaces, then used that function as a way to split each body of text into a table of words, sum each and store a counter.  I called it wordify.sql :)

The performance of a static table vs. a cte is big.  While the CTE to generate the #'s is fast, nothing compared to a real table.  But, the CTE I have above is a great way to populate that table if #s.


as a side not adlink_la.  If you do stash that goodie away in case you intend to re-post it here, can you throw my name in somewhere as the author.  I'd appreciate the little bit of advertisement for writing something useful.
hi BrandonGalderisi:
you are right,
i hv read ur solution before my post,
i hv tried to simplfy ur process,
sorry for inconvience
Brandon,
If it ever comes up, rather than reposting the code, I will likely just refer back to this thread.  

At any rate, I have been digging into the code a bit to see exactly what it is doing.  It looks to me like you might want to replace the two occurances of:

len(@delimiter)

with:

(datalength(@delimiter)/2)

The reason I mention it is the LEN() function ignores trailing blanks, so if the delimiter is like *   then the function returns incorrect results.  However, the above change fixes that problem.  

Do you agree?

I guess my quotation marks got replaced by those little boxes.  I was trying to show astrisk, space, space as the delimiter.
You're right.  At first it was all len().  I thought I replaced them all.
Avatar of jobprojn

ASKER

You guys rock.  Thanks.
>>I'd appreciate the little bit of advertisement for writing something useful.

Useful?  You deserve an award for this little nugget my friend.  One thing that I try to impress upon my fellow developers is to try to approach database problems from a database standpoint rather than from a programmer standpoint.  Often the database can do things much more efficiently.  This is a perfect example of that.  What better way to get at the values of a delimited string than to just select them out in one shot!  Absolutely brilliant!  Very nicely done!
Just know that the CTE in the view I use I found online, I just don't recall where.  But that is easily replaced with a real numbers table which is what we have in our system.

The numbers is very useful for displaying date ranges as well.