Link to home
Start Free TrialLog in
Avatar of chsalvia
chsalvia

asked on

Substituting strings with checksums

If you're using some kind of a data structure such as a hashtable or b-tree, which maps keys to values, can you substitute keys with checksums of those keys?

This would save disk-space/memory, but I'm not sure if checksums are necessarily unique for each key.  If a checksum takes into account every bit of the key, presumably it would always generate a unique integer for every key.  But I'm not sure if that's true.

Are there checksum algorithms which are known to generate unique integers for every key?  
Avatar of Infinity08
Infinity08
Flag of Belgium image

>> If you're using some kind of a data structure such as a hashtable or b-tree, which maps keys to values, can you substitute keys with checksums of those keys?
Yes, that's basically how a hash table works ...

>> but I'm not sure if checksums are necessarily unique for each key
Usually they're not unique - read further.

>> If a checksum takes into account every bit of the key, presumably it would always generate a unique integer for every key.  But I'm not sure if that's true.
Not really. Suppose you have a 10 bit key. So, there are 2^10 = 1024 possible keys. If you now take the 5 bit checksum over that key, you get only 2^5 = 32 possible checksums. So, there are gonna be keys that generate the same checksum.
If your checksum size is smaller than the key size (which is usually the case - otherwise there's not much point in using a checksum), then it's not gonna be unique.

Unless you can exploit knowledge of the contents of thekey. Suppose for example that the key will only contain alphabetical characters, you can use that to create a checksum algorithm that generates a unique checksum value ...


I suggest you read up a bit on hash tables : http://en.wikipedia.org/wiki/Hash_table
It will make things a lot clearer for you ...
Avatar of chsalvia
chsalvia

ASKER

>>>> If you're using some kind of a data structure such as a hashtable or b-tree, which maps keys to values, can you substitute >>keys with checksums of those keys?
>>Yes, that's basically how a hash table works ...

Using a checksum to reduce the table size seems like an interesting idea.  But every single hash table implementation I've ever seen just stores the keys themselves in the table.

>>Unless you can exploit knowledge of the contents of thekey. Suppose for example that the key will only contain alphabetical >>characters, you can use that to create a checksum algorithm that generates a unique checksum value ...

I was looking into a few algorithms, such as Fletcher's algorithm.  Most checksum algorithms I find can't be used with that many keys, because of repeats.  Of course, a 64-bit checksum would give at least 2^64 potential keys, which would be great.  But, it seems as if the only way to ensure unique keys all the time is to use some kind of encryption algorithm, like SHA-1.  But that's a lot less efficient that a regular checksum.
>> But every single hash table implementation I've ever seen just stores the keys themselves in the table.
I should have been more specific : the key is usually a kind of checksum of (part of) the actual data that is stored.
You've got data you want to store in a hash table. You come up with a hash function that creates a key from (part of) that data. This hash is usually quite similar to a checksum, but can be quite different too. Everything depends on the kind of data you want to store in the hash table, and how you want to be able to retrieve that data.

>> Most checksum algorithms I find can't be used with that many keys, because of repeats.
A hash table is based on repeating keys. If you have a hash table of 10 entries, in which you want to store integers, then a simple hash function could be (x % 10), so your hash table would look like this :

    0 : 20  40  50
    1 : 1  91
    2 :
    3 : 33
    4 : 84  54
    5 :
    6 :
    7 : 7
    8 :
    9 : 89

If I want to find back value 54, I use the hash function (x % 10 = 54 % 10 = 4) to know that it will be located on position 4 in the hash table. All I have to do then is iterate over the values stored on position 4 in the hash table ...

It doesn't matter that different data values will have the same key - that's how a hash table works.


However, I get the idea that you might not want a hash table for your needs ... What is it that you want to do exactly ? What data do you want to store, for which purpose, and how do you want to retrieve it ?
>> It doesn't matter that different data values will have the same key - that's how a hash table works.

I understand the idea behind the hash table itself - i.e. a hash function converts keys to indexes.  When a key hashes to the same index you need to come up with some collision resolution method, like a linked list or internal probing.  

I was just wondering if, for example, you have keys which are strings, it would be feasible to store the strings as checksums within the hash table - and look up the strings by hashing the checksums rather than the strings.

But I think this will only work if you have a relatively small amount of keys, because checksums will not necessarily be unique for every string.  
Storing checksums instead of strings is only a good idea if you don't need the strings themselves.
Other than that you're right of course that the checksum should be unique, or at least unique enough for your pruposes.

Can you give a bit more info :
>> What is it that you want to do exactly ? What data do you want to store, for which purpose, and how do you want to retrieve it ?

We might be able to come up with a better idea ...
>> What is it that you want to do exactly ? What data do you want to store, for which purpose, and how do you want to retrieve it ?

I'm just experimenting with a bunch of different indexing possibilities for some data sorting programs I've been working on.  I came across the idea of checksums and it seemed interesting.

One idea I had was to use checksums to create a compact dictionary of translations between languages.  For example, if I want to store a dictionary file that translates between English and Spanish.  Since I would only need the translation, the English word (the key) could be stored as a checksum rather than a string.  This way the hash table "key" column would only need to be 4 bytes for each slot.  (The size of a 32-bit integer.)  

But I don't know if this is a good idea, because it could result in incorrect values if two English words produce the same checksum.  Given that there's hundreds of thousands of English words, it's quite possible there'd be duplicate checksums.  

Then again, a 32-bit unsigned integer can represent over 4 billion possible values, which is way more words than there are in any language.  But again, there's no guarentee that all checksums would be unique.
I've referred to this earlier : if your keys are only alphabetical characters, there are only 26 possibilities for each letter, as opposed to the 256 possibilities for a normal 8-bit character in a string. That can serve as the basis of your checksum algorithm.
Another thing is that each language has letter combinations that'll never occur - you can take advantage if that and similar facts too.

As you said, a 32 bit value can store 4 billion different values - no language has that many words. The challenge is to find a checksum algorithm that'll generate a unique checksum for each word.

The question is : will all this really gain you anything ? How much would you really compress the dictionary ? Not much - is it worth the extra effort ?
ASKER CERTIFIED SOLUTION
Avatar of Infinity08
Infinity08
Flag of Belgium image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial