Link to home
Start Free TrialLog in
Avatar of dbaSQL
dbaSQLFlag for United States of America

asked on

v2008 - enhancements tSQL (i hope)

the best means of converting a varchar(16) string to a unique integer
function or otherwise

any ideas?
Avatar of chapmandew
chapmandew
Flag of United States of America image

what are you trying to accomplish?  have you tried newid()
Avatar of dbaSQL

ASKER

yep. unfortunately newid isn't going to work.  at least what i know of it, it won't work
i have a mammoth dataset.  way back when it was designed they were unable to give me integer values for the record ID attributes.  instead, we went with varchar(100).  lovely.
i got them to trim it down over time, but the overhead/performance and accuracy of this ID is crap.

they've finally agreed upstream to push me a numeric unique ID.  this applies, of course, to the new data going forward, once my new design is in place.  for the legacy data, i need to dump the ID string into a table, gen a unique ID per record in the existing dataset, and use this in my new ID attribute, which will be a bigint.

mammoth = 3B records

fortunately, i'm in process of moving to 08.  i've built a partitioned structure, and am feeling many of the gains on the new engine, in terms of performance/speed/manageability, etc.  

i have done some digging on cast/convert.  today i fell upon several docs on the 'tSQL enhancements in 08'.   i know there are ample functions simply to do the conversion.  but i am unsure of how to do so uniquely.  
Avatar of dbaSQL

ASKER

but it can't just be random unique number.  it must be converted from the existing ID strings, per record.  this will then be used to map the data to the old string, from the new front end, for reporting/analysis/troubleshooting...  that sort of thing
ASKER CERTIFIED SOLUTION
Avatar of simonsabin
simonsabin

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of dbaSQL

ASKER

yep, i thought of that first thing.  i've been asked, however, to do something different.  in retro, however, i just don't see why the identity wouldn't work.  i will revoice that tomorrow, we'll see what i get.  i'm being asked to ' just convert that string to a 64bit integer '
The most important consideration will be : "Is anything dependnant on the values already established", if the answer is no, then would be inclined to rename the old one if needed historically, and create a new identity column (bigint).
And if creating a new identity column, then consider that being the primary clustered index - but with 3billion rows, it is going to take a while with those types of restructures. If the current "value" is already numeric, then could try simply changing it to type bigint and making it an identity. Could do a select with a convert to bigint to see if it crashes and bombs and that would be the acid test... Make sure you have plenty of tempdb space if changing structure like that.
Avatar of simonsabin
simonsabin

Its just not posible

100 chars = 100bytes
64 bit int = 8 bytes

8 <> 100

If you have values 8 chars or less then you could try

cast(cast('12345678' as varbinary(8))as bigint)

@simonsabin: correct, but it is kinda dependant on what is in that column more so than the characteristics, can always kill trailing spaces...

declare @char char(20)   -- doing 20 for visual size sake
set @char = '123456'

select convert(bigint,@char),'|'+@char+'|'
and is not really limited to 8 bytes as such, it is limited to values that can be represented... e.g.


declare @char char(20)   -- doing 20 for size sake
set @char = '1234567890123456'

select convert(bigint,@char),'|'+@char+'|'
You are assuming that he has numbers in his char column. By the fact he wants to generate integers suggests he doesn't have integers to start with.
Therefore converting 'ABCDEFGH'  a number is more dificult

The converting to varbinary works with character data as well.


select cast(cast(cast(cast('ABCEDFGH' as varbinary(8))as bigint) as varbinary(8))as varchar(10))
Avatar of dbaSQL

ASKER

Thank you each for the input.  And yes, simonsabin, you are correct.  The string is mixed alpha numeric.  

I'm forcing the issue on identity.  I am being fought, so for now, I still need to pursue the conversion.  Unfortunately, I still don't have it.

I pumped two records into a temp table.  these are the ID values:

717JHNSZD695
717JHNSZD696

select IDvalue, cast(cast(IDvalue as varbinary(8))as bigint)    from temptable gives this:
717JHNSZD695          3977020738125058906
717JHNSZD696          3977020738125058906

Note, the casted value is the same, yet the source ID differs.

and simonsabin, your last post returns:   717JHNSZ

Surely there is a means of converting the alpha numeric string to integer.  Uniquely?  Hard to say... but, just the conversion?
need to increase your scale...

select IDvalue, cast(cast(IDvalue as varbinary(12))as bigint)    
from (
select '717JHNSZD695' as idvalue union all
select '717JHNSZD696') a
What is the 'longest' example ?

can always do a conversion, it is a question of will it fit...
Nah that doesn't convert, it is automatically taking the last 8 bytes (or 16 of the varbinary)

bigint does go up to a value of 9,223,372,036,854,775,807 so big enough to cover 3 billion rows, it just depends on the content of that column.

Avatar of dbaSQL

ASKER

by 'longest', do you just mean data length, mark?  the attribute needing conversion is char(16)
Without some compression its not possible.

A bigint is just not big enough.

Each char represents a value from A-Z and 0-9 thats 36 characters (assuming not case sensitive and only numbers or letters).To hold a value of 36 requires 6bits. a bigint has 8bytes and so you have 64bits allowing for 10 6 bit values. So the max you can hold is 10 chars. If you only have A-Z then the size required reduces to 5 bits so you can hold 12 characters.

The mapping table with an identity column is the correct solution.

Especially if you have Enterprise edition and are using 2008 as you can compress your small bigint values to less than 8 bytes
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
A decimal 32 takes up 17 bytes which is 1 more than using char(16) which defeats the reason for coverting to an integer
I don't care about size (well yes, I do), all I am doing is giving a possible solution to converting from char to numeric as per:

"I'm forcing the issue on identity.  I am being fought, so for now, I still need to pursue the conversion.  Unfortunately, I still don't have it. "

I think everyone is in total agreement that identity is the single best option, purpose built for this type of scenario, but there is a battle to be fought...

You would think that having two MVP's (though chapmandew has gone quiet), and a mere expert giving advice that it would back up a string case for identity...

Cheers,
Mark Wills (not an MVP)
Avatar of dbaSQL

ASKER

Yes, there is a battle to be fought... unfortunately.  I am, however, actively creating the identity.  I've already created the mapping table, ID, String;  the ID is bigint identity(1,1), the string is going in now in batches of a number i am embarrassed to post.  (i'm the only one on the box right... no biggie....)

but.  I am a little surprised... SQL is SQL.  as you all know, it can do some pretty crazy/clever things, far beyond simply storing the data.  I will be very surprised if this is truly un-doable.

the function i have not looked at yet.  horribly slow and horrid is a problem, given the magnitude of this dataset.   but.  i will keep working at it and see what i can do.

worst case - or best, depending on where you're sitting -- the identity 'conversion' is being created now
I have not followed all of this, but if bigint is not big enough you can always define a numeric(38, 0) field as an IDENTITY column, you can then have up to:
99,999,999,999,999,999,999,999,999,999,999,999,999
The whole point is that using a char(16) column as a PK is not good a it is large (16 bytes). Changing to a bigint results in an 8 byte PK.

Use of numeric(38,0) uses 17 Bytes and so is worse than using char(16)
Fair enough.  As I stated, I did not read it all through.  I just wanted to make sure that people were not under the illusion that only integers could be IDENTITY.
The identity conversion should be "easy", but why is there a need to map ?

select indentity(bigint,1,1),  idvalue into my_idvalue_identity_map from my_blud_dee_big_table    -- (excuse the nomenclature - seemed to fit at the time)

As for the function in terms of 'horrid' refers to the resulting number, in terms of speed it depends on the content, try it out...

select top 30000 IDvalue, dbo.udf_convert_to_number(idvalue) from my_same_table_as_above    -- (see, context sensitive table names :)

But then there is the size problem, and being too big is not good...

Suppose you could run through that function, pickup the lowest calculated number and use that as number '1' so long as all that is required is uniqueness. Or, if there is at least 4 numeric digits, can subtract 48 and that would take it from a potentially 32 digit to a 28 digit and that is 13bytes ( and fit into a page a bit better )


User define functions are evil http://sqlblogcasts.com/blogs/simons/archive/2008/11/03/TSQL-Scalar-functions-are-evil-.aspx from a performance perspective.

This will be especially true in a bulk operation.

The map is essential if you have child tables.
Nice article... Did you just back date that ?

While it might be true, sometimes a scaler function is worth the cost, maybe for no other reason than to prototype. Unless of course you are a developer as well and can do CLR off the top of your head, an area still largely unvisited by a lot of DBA's (sounds like an article in the making)

Surely in a one-off event such as this you might consider toning down the "evil" aspect ;)

The child tables / dependancies still worry a lot, don't think we got an answer from that comment back up toward the top...
Avatar of dbaSQL

ASKER

'....from my_blud_dee_big_table ...'
I just love that.


Still on it... back soon with status.
http://sqlblogcasts.com/blogs/simons/archive/2005/01/17/User-Defined-Function-performance-comparison.aspx

Doing a CLR function is very very easy?

I would never touch a multi-statement function they really are evil.

Anything involving more than one row I would never use a user defined function unless you can use the table valued function sub query trick.
Avatar of dbaSQL

ASKER

production outage brewing with my v2008 instance.  back whenever i have resolved
Simon, totally capitulate on UDF's...

Not all DBA's can do CLR - have asked several and they basically shrug. A lot of them are pissed that they even have to get into VS for a variety of things...

How are you on linked servers ? 2008 64bit to 2000 and getting intermittant timeouts (deadlocks) increasing in frequency...

dbaSQL has another thread running, and it is nearly 6:00am down here and need some shut eye, maybe you could pop in and help ?

https://www.experts-exchange.com/questions/24389116/SQL-v2000-vs-v2008-maybe-a-big-problem.html?cid=1066&anchorAnswerId=24329844#a24329844
That just bugs be. Its soooo easy and they will get better performance. Its like someone saying they are going to travel by horse back because they can't be bothered to learn to drive a car.

Spend a little time and reap the rewards.
>>That just bugs be. Its soooo easy and they will get better performance.<<
Then you have not been in this industry very long:  You may have discovered the best thing since sliced bread, but if it does not sell ...

CLR in SQL Server has been out for nearly 5 years now and has had zero traction.  Just do a search on this site for CLR questions and see for yourself.

I hate to be the harbinger of bad news, but as they say around here, perhaps you should wake up and smell the coffee.
Why if something bugs me indicates I haven't been in this industry very long. I was No 1 in this forum 10 years ago.

I never said it had traction I just said that it if you want performance then CLR functions are what you should be using and not TSQL user defined functions.

and just to split hairs its coming up for being out for 4 years in November this year.
>>Why if something bugs me indicates I haven't been in this industry very long. <<
Is that what you understood?  Perhaps you should re-read my comments.

>>I was No 1 in this forum 10 years ago.<<
I know I participated in some of your questions back then and then you disappeared for a long time only to show up recently like Rip Van Winkle. :)

>>and just to split hairs its coming up for being out for 4 years in November this year.<<
A few of us don't wait for the RC to come out before we explore all the new features.
I remember your handle now.

I lost interest, I couldn't compete with the likes of angel. Wasn't enjoying it and I felt I should focus on my day job. Was interesting that EE wasn't noticed by Microsoft back then.

I was just being pedantic :) Whilst many get there hands dirty many don't until SP1 is out.

I would be interested to see how long it took, if it ever has, for java to be taken on board in Oracle systems.

I think MS have missed the boat for this and other feature like Service Broker. Although the latter has more traction than CLR IMHO.

I also still find companies on 2000 and I think that is hurting 2005 features. Many apps are also just upgraded and not adapted to new technologies.
Hi guys, back again....

There are also some companies still running NT.  I was a gold partner a while back, and we had various incentives to get people to start using the new technology. A lot were licensed, but never installed, and very reluctant to actually take the next step. MS do not make it overly easy, and in fact dbaSQL is suffering as a result in trying to keep a production site up and running while trickle migrating to SQL2008 from 2000. Those stories and experiences are all too common place - regardless of cause, regardless of ability, irregardless of technology.

The main problem is DBA's have been fighting Programmers since the day dot (maybe not, but seems that way), and now, MS are saying thou shalt use Programmer Technology to get thy DBA job done. Sometimes it is "easier" to stick to ones knitting on tried and proven (albeit without performance gains) approaches and methodologies. CLR does not have the takeup it should. There are a lot of technologies within the MS world that fall into the same category. Cool features, but hey, let someone else do the hard yards first, and then if we really need to, we can then think about it. MS do not do themselves any great favours that way. Remember back the fist time you went to do a DTS new thingy in SQL Server - bet any money (have none) that the temptation or first move was fire up this great all encompassing Management Studio.

I don't think it is horse and cart, but more like a point a to point b school and shops jallopy versus a ferrari.

P.S. simon, thanks for getting involved in that other thread :)

Avatar of dbaSQL

ASKER

I've gone w/the IDENTITY approach.  i have the udf in place, too, but am unable to really validate its use.
no biggie... we all know the IDENTITY is more appropriate anyway

(mr. wills, if you could give me a pointer or two on testing the udf, i would be grateful)

I will split amongst simonsabin and mark wills, and close the inquiry
thank you for all of the input
Sure, always happy to help...



-- modify function to include an offset so numbers remain at single digit. 
 
ALTER function udf_convert_to_number(@str varchar(100),@offset char(1))
returns decimal(32,0)
as
begin
 
   declare @s varchar(100)
   declare @x varchar(100)
   declare @i int
   declare @o int
 
   set @x = upper(rtrim(left(@str,16)))
 
   if patindex('%[^0-9A-Z]%',@x) > 0 return -1
 
   if isnull(@offset,'N') = 'Y' set @o = 48 else set @o = 0
 
   set @s = ''
   set @i = 1
 
   while @i <= len(@x)
   begin
      set @s = @s+convert(varchar,ascii(substring(@x,@i,1)) - @o)
      set @i = @i + 1
   end
   return @s
end
GO
 
-- now create our mapping table from the source data
 
if object_id('my_idvalue_identity_map','U') is not null drop table my_idvalue_identity_map  -- WARNING drops a table
 
select identity(bigint,1,1) as id,idvalue into my_idvalue_identity_map
from (
select '012-3456789' as idvalue union all
select '0123456789' as idvalue union all
select 'ABCDEFGHIJKLM' as idvalue union all
select 'NOPQRSTUVWXYZ' as idvalue union all
select 'SZD695' as idvalue union all
select '21SZD695' as idvalue union all
select 'ZNSZD695' as idvalue union all
select 'HNSZD695' as idvalue union all
select '87HNSZD695' as idvalue union all
select 'qas21HNSZD695' as idvalue union all
select '717JHNSZD695' as idvalue union all
select '717JHNSZD696' as idvalue
) a
GO
 
-- to populate the above, could also select from the real table "from my_real_table" instead of "from (...union query...) a"
-- then afterwards if it is going to remain, create the clustered primary key on the ID column, and an index on idvalue with include id
 
 
-- now test / use the function
 
select *,dbo.udf_convert_to_number(idvalue,'N') from my_idvalue_identity_map
 
 
 
-- could also use that function as a third column in the building of my_idvalue_identity_map

Open in new window

NB I did add that offset thingy just to see what happen. use 'N' to ignore - meaning each char = 2 digit number, 'Y' to use meaning digits remain single digit, chars are 2 digit
Avatar of dbaSQL

ASKER

aaah... I hadn't used the N before.  that's where i was tripping.  
just pumped two of my records into a temp table... it works:

717JHNSZD695          554955747278839068545753
717JHNSZD696          554955747278839068545754

good to have in my back pocket.  i know they're going to come back to me on this.
what do you think about unique-ness, mark?  say I ran this on a gazillion records (each with a unique string id)... any problem w/the converted values duplicating?  
If you use the 'N' option then no, it is completely reversable back to original character by doiing an ascii() on every 2 digits

e.g. 717H5 = 5549557253     where 55 = 7, 49 = 1, 55 = 7, 72 = H, 53 = 5

Using the 'Y' option, then it depends on the format of the strings in the first plac. If they have a pattern like 999XXXXXX999 for each an every code, then it will be OK, in terms of being unique, but if there are different formats, then it is unpredictable. e.g.  9999 give a 4 digit result, but so does 99X.

Does that make sense ?
Avatar of dbaSQL

ASKER

Perfect sense.
Thank you, sir.
Avatar of dbaSQL

ASKER

no i just gotta figure out how to use the darned mapping ID to load the new table

>>for the legacy data, i need to dump the ID string into a table, gen a unique ID per record in the existing dataset, and use this in my new ID attribute, which will be a bigint.

I'll get there....
have a look at that select ... into

as an example...

select top 1000 * into  tst_idvalue_table
from my_blud_dee_big_table

go

select identity(bigint,1,1) as id,idvalue,dbo.udf_convert_to_number(idvalue,'N') as hashed_idvalue into my_idvalue_identity_map
from tst_idvalue_table

go

select * from my_idvalue_identity_map

go

alter table tst_idvalue_table add new_idvalue decimal(32)

go

update tst_idvalue_table set new_idvalue = hashed_idvalue
from tst_idvalue_table
inner join my_idvalue_identity_map m on m.idvalue = tst_idvalue_table.idvalue

go

select * from tst_idvalue_table

-- let me know if I can help any more... But we are really starting to get into the realm of a new question maybe :)