The results are in! Meet the top members of our 2017 Expert Awards. Congratulations to all who qualified!

Published on

7,004 Points

Looking for a way to avoid searching through large data sets for data that doesn't exist? A Bloom Filter might be what you need. This data structure is a probabilistic filter that allows you to avoid unnecessary searches when you know the data definitely won't be found. Read on to find out more...

In general, the worse case scenario when searching through a data set is when the datum being searched for doesn’t exist. In this case, the complete data storage needs to be searched before it’s possible to conclude that the datum cannot be found. If only there was a way to eliminate the need to perform unnecessary searches when we know the data won’t be found. Fortunately, there is a data structure that allows you to do just that. This data structure is knows as a “Bloom Filter”.

So, how does this witchcraft work I can hear you ask? Well, a Bloom Filter is a probabilistic data structure that can tell you if an item definitely doesn’t exist in a data set. What it can’t tell you, with any degree of certainty, is if an item does exist. This clever data structure was invented by a chap called Burton Howard Bloom circa 1970. The power of a Bloom Filter is that checking it for existence of a datum is significantly faster than checking a complete data store. For those who are into Big O notation, the time complexity for performing a check on a Bloom Filter is O(k), where k is the number of hash functions used and bits set (see below).

The fact we can know for definite that an item doesn’t exist in a data store means we don’t have to waste our time searching for something we’re definitely not going to find. Unfortunately, we can only know for sure that data doesn't exist. We can get false positive hits, which means the data might exist and in these cases a search of the data store is still necessary. With careful usage of a Bloom Filter, we can avoid performing expensive searches if we know the data won’t be found. Neat, eh?

The principle of how a Bloom Filter works is quite simple, when an item is “added” to the Bloom Filter a statistically unique “bit pattern” is generated using a series of hash functions (one hash function for each bit), which is then written into a single bit vector (the same bit vector is used to store all bit patterns - hence the possibility of a false positive). When checking to see if an item has been added to the Bloom Filter we check to see if the same bit pattern exists in the bit vector. If it doesn’t then we know the item was never added to the filter and so won't be found in the data store. If we find a matching bit pattern the item may exist in the data store and so a full search is required.

The reason we can’t know for sure that it wasn’t added is because, over time, as more items are added to the filter there is a chance that there will be a collision on the specific bit pattern for a particular item, because one or more other items may have generated the same bit pattern (either singularly or as a group). This means, we cannot say for sure that an item does exist, only that it doesn’t – if the bit pattern for a particular item isn’t set then it cannot exist.

Let’s consider an example. Let’s assume we have a very simple Bloom Filter that is using a 16 bit filter (normally we'd use many more bits that this). We’re going to add three numbers, 21, 34 and 57, and each number will generate 3 unique bit patterns (note, that these bits are just an example and the actual number and value of the bits set in a real Bloom Filter will depend on the number and type of hash functions used):

+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | bits | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | +------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 21 | | | X | | | | | X | X | | | | | | | | +------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 34 | | | | | X | | | | | X | | | X | | | | +------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 57 | | | | | | | | | X | | | X | | | | X | +------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Now, let’s assume we want to see if number 85 is in the set. This generates the following bit pattern:

+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | bits | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | +------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 85 | | | X | | | | | | | | | | X | | X | | +------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Bit 2 is set in the bit vector, bit 12 is set, but bit 14 is not set and so we know for certain 85 is not in there; we can be sure it doesn’t exist in the data set.

Now, let’s do the same for 91.

+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | bits | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | +------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 91 | | | | | | | | X | | | | X | | | | X | +------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

In this case, we have a collision of bits, that just happen to match with numbers 21 and 57, which means 91 might exist in the data set and so a full search of the data store is necessary. Notice that we’ve never actually added 91, but because we have a collision cause by bits set when adding 21 (bit 7) and 57 (bits 11 and 15) we cannot rule out that 91 might exists, hence we have no choice but to search the data store to see if this number exists or not.

This is why a Bloom Filter is probabilistic. We can say an item definitely doesn’t exist or it might exist. This doesn’t help us in the case where data might exist, but it can save us performing unnecessary expensive searches when we know the data definitely doesn’t exist. That’s the job of a Bloom Filter, to allow us to identify those cases where data definitely doesn’t exist, hopefully saving us from performing an expensive search.

Now, for a Bloom Filter to be useful, we want to avoid as many collisions as we can and there is a trade off between the number of bits we set when adding a value (a Bloom Filter can work with any data, not just numbers), the size of the bit vector and the number of values we add. The more bits you set the less chance of a false positive; however the smaller your bit vector the quicker the space will become polluted and so the more chance of a false positive.

Unfortunately, there is no hard and fast rule in determining the size of the bit vector nor the number of hash functions to use and so trial and error is necessary. Experiment with a representative sample of your data set to try and find the ideal number of hashes vs. the size of your bit vector. As a rule of thumb, the more data you inject into the filter and the more bits you set, the larger the bit field needs to be to avoid collisions.

To set the bits, when adding a new item to the Bloom Filter, it is necessary to use different hash functions, one for each bit. Each hash function will generate a new and statistically unique hash value for the datum and the bit position can then be obtained by performing a modulo calculation against the hash using the number of bits in the bit vector. Each hash function should generate a different unique hash value, thus setting a different bit. For example, by using three different hash functions we can set three different bits.

Implementing a Bloom Filter in C++ is pretty simple, with the tricky part deciding on which hash functions to use. This is a decision that needs to be made during the implementation of the Bloom Filter and it’s a good idea to test a number of different hash functions to ensure they give an even distribution of bits within the bit vector for the type of data you plan to add to the filter. You can filter any type of data you like, the only requirement is that the data is suitable for hashing.

Once you’ve identified suitable hash functions, it’s just a case of deciding how large your bit vector needs to be and then storing the bit patterns for each datum added into the bit vector. When performing a lookup in the filter we just need to re-generate the bit pattern and see if it exists in the bit vector. If it doesn’t then the datum does not exist in our storage, if it does then the datum might exist in the storage and so a full search will be necessary.

The full code for a very simple Bloom Filter can be found, below. Although there is quite a lot of code, most of this is just the implementation of a few simple hash functions. The main guts of the Bloom Filter (see the bloom_filter class) is actually very straight forward. In this implementation, the hash functions are passed into the Bloom Filter’s constructor and so this means the filter can use any number of hash functions. The size of the bit vector is fixed at 128, but this can be any size and can even be decided at run-time. The only requirement is that it’s bigger than the number of hash functions. The actual size is dependent on how much data you plan to add to the filter.

#include <cstdint> #include <string> #include <vector> #include <memory> // =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= // Original hash function implementations and descriptions can be found here: // http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx // =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= class hash { public: virtual ~hash() = default; virtual uint32_t operator() (void const * key, size_t len) const = 0; uint32_t operator() (std::string const & s) const { return (*this)(s.c_str(), s.size()); } }; // =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= class djb_hash : public hash { public: uint32_t operator() (void const * key, size_t len) const override { auto p = reinterpret_cast<unsigned char const *>(key); auto h = uint32_t(0); for(auto i = size_t(0); i < len; i++) { h = 33 * h + p[i]; } return h; } using hash::operator(); }; // =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= class sax_hash : public hash { public: uint32_t operator() (void const * key, size_t len) const override { auto p = reinterpret_cast<unsigned char const *>(key); auto h = uint32_t(0); for(auto i = size_t(0); i < len; i++) { h ^= (h << 5) + (h >> 2) + p[i]; } return h; } using hash::operator(); }; // =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= class fnv_hash : public hash { public: uint32_t operator() (void const * key, size_t len) const override { auto p = reinterpret_cast<unsigned char const *>(key); uint32_t h = 2166136261; for(auto i = size_t(0); i < len; i++) { h = (h * 16777619) ^ p[i]; } return h; } using hash::operator(); }; // =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= class oat_hash : public hash { public: uint32_t operator() (void const * key, size_t len) const override { auto p = reinterpret_cast<unsigned char const *>(key); auto h = uint32_t(0); for(auto i = size_t(0); i < len; i++) { h += p[i]; h += (h << 10); h ^= (h >> 6); } h += (h << 3); h ^= (h >> 11); h += (h << 15); return h; } using hash::operator(); }; // =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= // The actual Bloom Filter class bloom_filter { public: static size_t const bit_count = 128; bloom_filter(std::vector<std::shared_ptr<hash>> && hashers) : hashers_(std::move(hashers)) , bits_(bit_count) { } void add(std::string const & s) { for(auto const hasher : hashers_) { size_t idx = (*hasher)(s) % bit_count; bits_[idx] = true; } } bool exists(std::string const & s) { for(auto const hasher : hashers_) { size_t idx = (*hasher)(s) % bit_count; if(!bits_[idx]) { return false; } } return true; } private: std::vector<std::shared_ptr<hash>> hashers_; std::vector<bool> bits_; }; // =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= int main() { auto && bfilt = bloom_filter{ std::vector<std::shared_ptr<hash>>{ std::make_shared<djb_hash>(), std::make_shared<sax_hash>(), std::make_shared<fnv_hash>(), std::make_shared<oat_hash>(), } }; bfilt.add("hello"); bfilt.add("world"); // these should exist assert(bfilt.exists("hello")); assert(bfilt.exists("world")); // these should not exist assert(!bfilt.exists("foobar")); assert(!bfilt.exists("eggplant")); }

The hash functions used in the example are some popular ones that are simple to implement and code for them can be found on the web. The choice of hash functions is really up to you and you should be sure to chose ones that give a good distribution for your data. You can read more on the hashes I used in this example here.

0 Comments

Course of the Month7 days, 4 hours left to enroll

Next Article:File Server Migration Methods