Solved

Substituting strings with checksums

Posted on 2006-10-27
8
215 Views
Last Modified: 2010-04-15
If you're using some kind of a data structure such as a hashtable or b-tree, which maps keys to values, can you substitute keys with checksums of those keys?

This would save disk-space/memory, but I'm not sure if checksums are necessarily unique for each key.  If a checksum takes into account every bit of the key, presumably it would always generate a unique integer for every key.  But I'm not sure if that's true.

Are there checksum algorithms which are known to generate unique integers for every key?  
0
Comment
Question by:chsalvia
  • 5
  • 3
8 Comments
 
LVL 53

Expert Comment

by:Infinity08
ID: 17823149
>> If you're using some kind of a data structure such as a hashtable or b-tree, which maps keys to values, can you substitute keys with checksums of those keys?
Yes, that's basically how a hash table works ...

>> but I'm not sure if checksums are necessarily unique for each key
Usually they're not unique - read further.

>> If a checksum takes into account every bit of the key, presumably it would always generate a unique integer for every key.  But I'm not sure if that's true.
Not really. Suppose you have a 10 bit key. So, there are 2^10 = 1024 possible keys. If you now take the 5 bit checksum over that key, you get only 2^5 = 32 possible checksums. So, there are gonna be keys that generate the same checksum.
If your checksum size is smaller than the key size (which is usually the case - otherwise there's not much point in using a checksum), then it's not gonna be unique.

Unless you can exploit knowledge of the contents of thekey. Suppose for example that the key will only contain alphabetical characters, you can use that to create a checksum algorithm that generates a unique checksum value ...


I suggest you read up a bit on hash tables : http://en.wikipedia.org/wiki/Hash_table
It will make things a lot clearer for you ...
0
 

Author Comment

by:chsalvia
ID: 17825784
>>>> If you're using some kind of a data structure such as a hashtable or b-tree, which maps keys to values, can you substitute >>keys with checksums of those keys?
>>Yes, that's basically how a hash table works ...

Using a checksum to reduce the table size seems like an interesting idea.  But every single hash table implementation I've ever seen just stores the keys themselves in the table.

>>Unless you can exploit knowledge of the contents of thekey. Suppose for example that the key will only contain alphabetical >>characters, you can use that to create a checksum algorithm that generates a unique checksum value ...

I was looking into a few algorithms, such as Fletcher's algorithm.  Most checksum algorithms I find can't be used with that many keys, because of repeats.  Of course, a 64-bit checksum would give at least 2^64 potential keys, which would be great.  But, it seems as if the only way to ensure unique keys all the time is to use some kind of encryption algorithm, like SHA-1.  But that's a lot less efficient that a regular checksum.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 17826733
>> But every single hash table implementation I've ever seen just stores the keys themselves in the table.
I should have been more specific : the key is usually a kind of checksum of (part of) the actual data that is stored.
You've got data you want to store in a hash table. You come up with a hash function that creates a key from (part of) that data. This hash is usually quite similar to a checksum, but can be quite different too. Everything depends on the kind of data you want to store in the hash table, and how you want to be able to retrieve that data.

>> Most checksum algorithms I find can't be used with that many keys, because of repeats.
A hash table is based on repeating keys. If you have a hash table of 10 entries, in which you want to store integers, then a simple hash function could be (x % 10), so your hash table would look like this :

    0 : 20  40  50
    1 : 1  91
    2 :
    3 : 33
    4 : 84  54
    5 :
    6 :
    7 : 7
    8 :
    9 : 89

If I want to find back value 54, I use the hash function (x % 10 = 54 % 10 = 4) to know that it will be located on position 4 in the hash table. All I have to do then is iterate over the values stored on position 4 in the hash table ...

It doesn't matter that different data values will have the same key - that's how a hash table works.


However, I get the idea that you might not want a hash table for your needs ... What is it that you want to do exactly ? What data do you want to store, for which purpose, and how do you want to retrieve it ?
0
 

Author Comment

by:chsalvia
ID: 17826860
>> It doesn't matter that different data values will have the same key - that's how a hash table works.

I understand the idea behind the hash table itself - i.e. a hash function converts keys to indexes.  When a key hashes to the same index you need to come up with some collision resolution method, like a linked list or internal probing.  

I was just wondering if, for example, you have keys which are strings, it would be feasible to store the strings as checksums within the hash table - and look up the strings by hashing the checksums rather than the strings.

But I think this will only work if you have a relatively small amount of keys, because checksums will not necessarily be unique for every string.  
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 53

Expert Comment

by:Infinity08
ID: 17826886
Storing checksums instead of strings is only a good idea if you don't need the strings themselves.
Other than that you're right of course that the checksum should be unique, or at least unique enough for your pruposes.

Can you give a bit more info :
>> What is it that you want to do exactly ? What data do you want to store, for which purpose, and how do you want to retrieve it ?

We might be able to come up with a better idea ...
0
 

Author Comment

by:chsalvia
ID: 17826933
>> What is it that you want to do exactly ? What data do you want to store, for which purpose, and how do you want to retrieve it ?

I'm just experimenting with a bunch of different indexing possibilities for some data sorting programs I've been working on.  I came across the idea of checksums and it seemed interesting.

One idea I had was to use checksums to create a compact dictionary of translations between languages.  For example, if I want to store a dictionary file that translates between English and Spanish.  Since I would only need the translation, the English word (the key) could be stored as a checksum rather than a string.  This way the hash table "key" column would only need to be 4 bytes for each slot.  (The size of a 32-bit integer.)  

But I don't know if this is a good idea, because it could result in incorrect values if two English words produce the same checksum.  Given that there's hundreds of thousands of English words, it's quite possible there'd be duplicate checksums.  

Then again, a 32-bit unsigned integer can represent over 4 billion possible values, which is way more words than there are in any language.  But again, there's no guarentee that all checksums would be unique.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 17826978
I've referred to this earlier : if your keys are only alphabetical characters, there are only 26 possibilities for each letter, as opposed to the 256 possibilities for a normal 8-bit character in a string. That can serve as the basis of your checksum algorithm.
Another thing is that each language has letter combinations that'll never occur - you can take advantage if that and similar facts too.

As you said, a 32 bit value can store 4 billion different values - no language has that many words. The challenge is to find a checksum algorithm that'll generate a unique checksum for each word.

The question is : will all this really gain you anything ? How much would you really compress the dictionary ? Not much - is it worth the extra effort ?
0
 
LVL 53

Accepted Solution

by:
Infinity08 earned 250 total points
ID: 17827021
A dictionary data structure might be more handy for this purpose btw. Link on "dads" (dictionary of algorithms and data structures) :

http://www.nist.gov/dads/HTML/dictionary.html
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
The goal of this video is to provide viewers with basic examples to understand and use structures in the C programming language.
The goal of this video is to provide viewers with basic examples to understand how to use strings and some functions related to them in the C programming language.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now