Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
Solved

# Algorithm for converting a alphanumeric code to an integer

Posted on 2007-12-05
Medium Priority
675 Views
The short question; I have some account numbers that I need to generate an integer from.  I could use the binary value for it, but it seems a bit long.  Is there an algorithm I could use to convert them or will those numbers be just as long as the binary?

example of codes:
ACCO/COC/3263
ACCO/PDW/531
ACCO/CUP/558

My plan was to convert ACCO/CUP/ to a value, multiply it by 100000 and add the number on the end i.e.
The last code might look like this; 3462354600558

The full problem stands that when my data warehouse gets populated from the live data, I need some of the IDs to be converted to Integers so the SQL cube is more efficient and the data isn't so huge (30 million rows with those codes in takes up a bit more space than if it were Ints!).  Ideally I need the IDs to stay the same every time I refresh the data, hence the idea of using an algorithm.

Any help would be much appreciated.
Regards,
Kinton
0
Question by:kinton
[X]
###### Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

• Help others & share knowledge
• Earn cash & points
• 5
• 5
• 4
• +5

LVL 25

Expert Comment

ID: 20412958
Why not making a new table which contains the existing account numbers, give it an Identity-column and you have your integer.

Then you have to find out a way to ensure each new accountnumber gets added into that table before you  use it.
0

LVL 53

Expert Comment

ID: 20412959
What are the possible values for the three fields ? Will the first two always be 4 and 3 uppercase characters respectively ? Or can they be shorter/longer and/or consist of different characters ? What are the minimum and maximum values for the last field ?

What is the size of an int in your database ? 32 bits ? 64 bits ?

If I understand you correctly, there has to be a one-to-one relationship, and you have to be able to reverse the "encoding", yes ?
0

LVL 84

Expert Comment

ID: 20413151
3462354600558 is larger than 2^32 so perhaps that means ints in your data basr are 64 bits
So if there are no more than 184467440737096 possible combinations of ACCO/CUP it would fit in 64 bits after multiplying it by 100000
0

LVL 2

Expert Comment

ID: 20413308
Hi Kinton,

If each of the 7 letters in the "ACCO/COC" portion of the code can be any letters A-Z, that gives you 26^7 combinations (8031810176).

To get the number, you would follow this basic algorithm, which essentially converts a 'base 26' string to a decimal. (it also assumes that you remove the / from your code, leaving a string of 7 letters).

codenumber = 0
Foreach letter in code (from letterindex 0 to letterindex 6)
letternum = convert letter to respective number (0-25)
codenumber += (letternum * (26^letterindex))
end foreach

The above pseudocode (when properly translated into the language of your choice) should give the following results:

aaaaaaa = 0
aaaaaab = 1
aaaaaaz = 25
aaaaaba = 26
aaaabaa = 676
ACCOCOC = 24924486
ACCOPDW = 24933008
ACCOCUP = 24924655

You'll then want to multiply this 'codenumber' by a suitable amount to shift it above the range of your digits (as you described in your question). However, you will need to be careful that you use a suitable integer type (eg, a 64-bit int) because of the number of combinations that are possible!

Alternatively, to avoid huge integers, you could store the parts of the code in separate integers (and therfore separate fields in your database table).

If the first 4 letters of your code are always 'ACCO' then they could be disregarded, giving you a much smaller range of 26^3 (17576) which will easily fit in a 32 bit int.

hope this helps!

- Ben

0

LVL 84

Expert Comment

ID: 20413422
If they are always one of
ACCO/COC
ACCO/PDW
ACCO/CUP
they could be stored in two bits.
If they can be any 7 A-Z,0-9 character, that  would be 78364164096 possibilities, which would require 37 bits
if it can be any 8 A-Z,a-z,0-9,/ characters, that would be 248155780267521 possibilities, which would require 48 bits
That's wny it's important to know what are the possible values before choosing a conversion algorithm
0

LVL 27

Expert Comment

ID: 20415053
I would suggest creating a 4 char or 8 char field containing the following ID forms:

AAAA
AAAB
AAAC
....
AABA
AABB
.....
ZZZZ

In the language I use, they take the least space, humanly readable and have a faster index search.

If your language supports it, use the full ASCII scale 0-255.
0

LVL 2

Author Comment

ID: 20418932
Thanks for all of your replies, i'm still having a read through but some initial responses;

How do I find out if my INTs are 32 or 64 bit?  I thought Int in databases & programming was just 32 bit - though I couldn't say why I thought that.

Theres always 4 capital letters, a slash, 3 capital letters, another slash then numbers.  At the moment the highest is about 35000 but in theory this could conceivable reach 100,000.  Can't really see it reaching a million though.

I'll continue to read and understand (or try at least) your replies then i'll post again.  Thanks again mathematical geniuses of EE :)
0

LVL 53

Expert Comment

ID: 20419040
>> How do I find out if my INTs are 32 or 64 bit?  I thought Int in databases & programming was just 32 bit - though I couldn't say why I thought that.

What database are you using ? And on what system ?

>> Theres always 4 capital letters, a slash, 3 capital letters, another slash then numbers.  At the moment the highest is about 35000 but in theory this could conceivable reach 100,000.  Can't really see it reaching a million though.

4 + 3 capital letters gives 26^7 possibilities (8031810176). That's already more than a 32bit int can hold.

Are there any limits to the letter combinations ? Or is any combination possible ?

I'll assume that the ints are 64bit. Then you'd have 31 bits left for storing the number at the end, which is enough to contain values between 0 and 2147483648 (and thus certainly values up to 100000).
0

LVL 2

Author Comment

ID: 20421422
Hi Infinity,

Its a Windows Server 2003 standard edition, running SQL 2005 standard edition.  I now agree, I think it does support 64bit Ints.  I clearly wasn't paying attention when I made the variable (i.e. it was Int by default hence didnt click on the list to change it).  I have now gone back and clicked on the list and found Int16, 32 & 64 (and UInt32 & 64).

Theres 3,553,132 unique codes, however i'm not 100% sure but I think theres only 16 iterations of the first group and 112 iterations of the first 2 groups together.
The first group is companies, the second is products and the 3rd is their ID (i.e. there could be 112 records in the table where the 3rd group = 1).

To pad out the info a bit more, Account number is a dimension in a cube.  It gets its data from the same table as the fact table.  Theres 15 million rows in it and account number is included.  When the dimension is built the query doesnt group it so i assume it goes in multiple times.
My issues are 1. its duplicated for every copy of the product sold and 2. not an INT - thus slow & takes up lots of space.

I thought it would be better to have an INT value in my fact table and the code in a seperate table the dimension uses.
I could populate an Account table every night and generate a new key, but if anyone's saved any pivot tables using accout numbers the companies they've filtered for will change every night - I can see how that might be entertaining, but I don't think the users will.
Option 1 is Jogos solution.  Option 2 is base it on an algorithm so I can always generate the same number.

I can't update the original table as its all in DB2 and way too complicated to go messing with.

Regards,
Kinton.
0

LVL 20

Expert Comment

ID: 20421921
If I'm understanding your case correctly, you have an existing database (DB2) that contains the account numbers, then you populate another database (SQL 2k5) with these account numbers on a period basis (nightly, weekly, it doesn't really matter).

Is your DB2 database the one that creates the account numbers as well? If so, you can't just recreate the SQL2k5 table each time without a 1-to-1 algorithm (otherwise you run the risk of duplicating ids, etc).

Are the account numbers already created in the string form "AAAA/AAA/ddd" or is your conversion process doing that?

0

LVL 53

Accepted Solution

Infinity08 earned 1000 total points
ID: 20422505
Since you're using 64bit integers, there's no problem. You can cover all possible combinations in it. So :

ABCD/XYZ/12345

would become :

(A * 26^6 + B * 26^5 + C * 26^4 + D * 26^3 + X * 26^2 + Y * 26 + Z) * 2^31 + 12345
=  (0 * 26^6 + 1 * 26^5 + 2 * 26^4 + 3 * 26^3 + 23 * 26^2 + 24 * 26 + 25) * 2^31 + 12345
=  27625772961247289

and reversed :

27625772961247289 mod 2^31 = 12345                  <====  12345
27625772961247289 div 2^31 = 12864253
12864253 mod 26 = 25                                              <====  Z
12864253 div 26 = 494778
494778 mod 26 = 24                                                  <====  Y
494778 div 26 = 19029
19029 mod 26 = 23                                                    <====  X
19029 div 26 = 731
731 mod 26 = 3                                                          <====  D
731 div 26 = 28
28 mod 26 = 2                                                            <====  C
28 div 26 = 1
1 mod 26 = 1                                                              <====  B
1 div 26 = 0
0 mod 26 = 0                                                              <====  A

so, we get ABCD/XYZ/12345, which was the original code.

Note that A-Z are mapped onto the values 0-25.
0

LVL 24

Expert Comment

ID: 20424997
Resisting comment like ascii conversion, where this is about products, I would first shoot for example such as of ozo. Rather than allow for every possible combination, I would try to address only those that are needed, even if a lookup table is used.

> but I think theres only 16 iterations of the first group and 112 iterations of the first 2 groups together.

There ya go, cheat. For just 16, they could even be represented in hex. Find out what exactly they all are and look for the commonness and difference, and then it can be better seen if there are some algorthms that can be more efficient than a lookup table.

> ACCO/COC/3263
> ACCO/PDW/531
> ACCO/CUP/558
> My plan was to convert ACCO/CUP/ to a value

Now, what if the first four are always ACCO? While most of that looks like hex conversion can apply, can it be possibly we can also completely ignore the first four since they are always the same value?
0

LVL 24

Expert Comment

ID: 20425024
> and add the number on the end i.e.

Also, is it even possible that the last number is unique? Run a quick check for duplicates, maybe this is a forest/trees issue, so many numbers and letters they look like they can be anything, may they are not

> 16 iterations of the first group

Some may want to try a hash.

If that is the first four characters, still look for something to trim or ignore, something in common

ACC0
ACC1
ACC2
...
ACCN
ACCO
ACCP
...

Do the same for all potential groups, simplify everything.

Is it possible that each of the ~16 have dedicated range for one of the other groups? If so then that can lead to another reduction or simplification.

In short, rather than address all combinations, I like to (cheat) use inside knowledge about the data to try to simplify the processes to use.
0

LVL 53

Expert Comment

ID: 20426470
>> In short, rather than address all combinations, I like to (cheat) use inside knowledge about the data to try to simplify the processes to use.

The problem is that it might not be future proof, because new combinations might be added later. Also, if you take the lookup table approach ... you still have to store the lookup table, and the whole point was to save space ;) (weak argument I know heh).

Anyway, since the 64bit int is large enough to hold ALL combinations, why not just do it ? You don't lose anything (maybe a slight bit more time when converting).
0

LVL 20

Expert Comment

ID: 20427111
>>The problem is that it might not be future proof, because new combinations might be added later

Yep, and this is how a small problem now may become a huge problem later.
0

LVL 24

Expert Comment

ID: 20428937
Unable to respond without the knowledge. I tend to leave room. Suppose it actually was format of ACC* where ther were 15 variants of *. I can extend to 16 for hex compaction, but that is still another conversion. I can agree to extending it to 26 alpha, even adding the numerics, even all possibilities (blanks?) but if the first three are a constant, there is an obvious pattern, and one can then check with responsible officials about the meanings, and potential for changes. If it is to remain consistent then one can ignore it.

Companies can and do embed codes, such as for company, location, warehouse, type, model, and that can be fixed towards some formats. Major changes can be handles at that time. Where the issue regards size and timing, then expediencies can be sought to gain the productivity if not clarity. For addressing personnel types, they can use a long number but it has embedded codes, never will they have as many job titles as individuals when they have that many.

> 30 million rows
> to be converted to ....more efficient
> more time when converting

Simple as well doing own benchmarkings, example programs to manage data, and repetetive there may be the conversions. Is it more editing or more reporting etc. One time conversation or every instance of daily use, whether access is for on demand or offline, etc Models themselves are important and som exist, so inside knowledge is important for not only getting it correct in the first place (beware false assumptions) and for leveraging to convert from a barely adequate system to something actually useable. It is still too often that coders do not get contact with users and users are delivered what they did not know they want, and may not appreciate taking two steps backward for every step forward.
0

LVL 24

Expert Comment

ID: 20428958
> a small problem now may become a huge problem later.

The same can be said for extra times allocated to algorithms, get the clock upset and lose transactions and not even know there is a problem?
0

LVL 2

Author Comment

ID: 20440705
It looks like my options are a conversion to 64bit int or keep a table of account number to INT key mapping and add the new ones to it every night.
I think i'll have a crack at the first (Infinity08s algorithm), then resort to the second if I can't get that working.

Just to confirm some details:
The first part isn't always ACCO, there are 15 other codes at the moment which will grow when more companies get added.
Other codes for the first part include CIPD & IMAG, they're always capital letters.  No numbers or other chars.
The last part isn't unqiue, there are current anything upto 112 of each number.
The full code is what is held in DB2, its not pulled from different DB2 fields and concatinated into the 1 field in SQL.  Not sure if there are fields in DB2 holding the seperate values; I don't have anything to do with that side of it.  I've just been informed the field I should use.
The conversion takes place on a nightly basis.
The DB2 data is put into SQL, however I need an INT key instead of a string so my cube is more efficient in processing & running, however, I need each account number to maintain the same INT key as it had the previous night.

I'll post back when i've had a go with the algorithm.  Thanks again!
0

LVL 24

Expert Comment

ID: 20446872
> I think i'll have a crack at the first (Infinity08s algorithm),

OK by me, it looks decent enough, just remember that for efficiently you can do some of the math up front, ex

For x*2*2*2*3 in algorithm
A program can do it as x*24
Doing such simplifications is important when the number of such calcs (rows) is high
0

LVL 2

Author Comment

ID: 20456359
Thanks for helping everyone.  Thanks Infinity for providing the solution I used.  Firstly, it worked, secondly, I understood it!
0

LVL 53

Expert Comment

ID: 20456401
>> Firstly, it worked, secondly, I understood it!

Nice ;)
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this series, we will discuss common questions received as a database Solutions Engineer at Percona. In this role, we speak with a wide array of MySQL and MongoDB users responsible for both extremely large and complex environments to smaller singlâ€¦