Solved

Bloom Filter intersection prone to Birthday Paradox?

Posted on 2010-08-24
8
848 Views
Last Modified: 2013-11-13
Consider the following abstract idea, for bloom filter matching, that was proposed to me as an alternative to looking up individual candidates one at a time...

Assume we are using a general purpose bloom filter that sets K bits per entry where K is undetermined but > 1

We build a bloom filter for a collection. We then have a candidate list of items we want to check against that filter so we build another filter out of the candidate list. We then perform perform a bit intersection (binary and operation) on the two filters. It is assumed that if any of the bits in the result are set there is a reasonably good probability that one of the items in Set B can be found in set A.

It is my contention that, actually, the probability of a false positive is very high because we are saying if any bit in filter A matches any bit in filter B there is a match. I see this as being no different to the Birthday Paradox where we ask if any person in the room shares any birthday.

Am I wrong or right and (without going beyond simple high school maths - as I am really not that clever) can you prove it? Would this still be an issue if we only set 1 bit per entry?

NB. This is not a "homework question" - check my profile - it is a real world discussion I am having.

Thank you.
0
Comment
Question by:evilrix
  • 3
  • 3
  • 2
8 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 33508512
> It is assumed that if any of the bits in the result are set there is a reasonably good probability that one of the items in Set B can be found in set A.
In this assumption, what are "the result", "Set B" and "set A"?
0
 
LVL 84

Expert Comment

by:ozo
ID: 33508617
I think I can make some guess about what you meant by "result", "A" and "B",
but I'm still not sure whether you are asking about what conclusion follows from the assumption, or whether you are questioning the validity of the assumption
(in which case you should state the premises you are starting from)
The birthday paradox can help you to judge the probability that one randomly chosen set of K bits out of n had a collision with another set of K bits out of n
for given values of K and n,
it can also help to judge the probability that some item from a random set A with cardinality |A| can be found in a set B with cardinality |B| out of a universe of cardinality |N|
Whether any of them are "very high" or "reasonably good" may depend on the sizes of the sets involved and what probablilites are considered "very high" or "reasonably good"
0
 
LVL 40

Author Comment

by:evilrix
ID: 33508664
>> In this assumption, what are "the result", "Set B" and "set A"?

Set A is a collection of 32 bit values.
Set B is a candidate set of 32 bit values there none, some or all items may be in Set A.

When items are inserted into the bloom filter K hash functions are used to set a bit pattern to represent a digest of the item being inserted - one bit per hash function will be set (this is just a standard bloom filter technique)

Result is just a bit pattern that is the result of performing a binary AND (or a set intersection) of the bits that represent the filter for Set A and the bits that represent the filter for Set B.

The assertion is that if just one bit is set in the result we have a match. I assert there is a high degree of probability that there will be a false positive for the reasons stated in the question.
0
Active Directory Webinar

We all know we need to protect and secure our privileges, but where to start? Join Experts Exchange and ManageEngine on Tuesday, April 11, 2017 10:00 AM PDT to learn how to track and secure privileged users in Active Directory.

 
LVL 84

Expert Comment

by:ozo
ID: 33508822
> Set A is a collection of 32 bit values.
How large a collection?
> Set B is a candidate set of 32 bit values
How large a set?
> K hash functions are used to set a bit pattern
How large a pattern?  is it K bits out of 32?

Given the above information, you can evaluate the relevant probabilities and determine whether they are of high degree.

0
 
LVL 40

Author Comment

by:evilrix
ID: 33508851
For the sake of argument let's assume the following....

- Each item hashed will set 2 bits in the filter
- The definitive set contains 10000 items to be hashed into the filter (setting 2 bits each).
- The candidate set is 100 items to be hashed into the candidate filter (setting 2 bits each).
- Each filter is 64K bits


0
 
LVL 53

Accepted Solution

by:
Infinity08 earned 500 total points
ID: 33508995
>> I assert there is a high degree of probability that there will be a false positive for the reasons stated in the question.

Intuitively, I'd come to the same conclusion.


But let's try to do the math.

Take a simple example of a set B of size 1. That would be the normal case when checking if an item is present in the bloom filter. However, there is a large difference in the approach taken to perform that check :

        normal approach : check if all bits of B are set in A
        proposed approach : check if at least one bit of B is set in A

The chance of false positives is respectively (where m is the number of bits in the bloom filter, k is the number of hash functions, and n is the number of items stored in the bloom filter) :

        normal approach : p^k
        proposed approach : 1 - (1 - p)^k

where p is the probability that a certain bit is 1 in A :

        p = (1 - ((1 - (1 / m))^(kn)))

with k = 3, m = 256, n = 64 for example, we get that p ~= 0.52833, so that would give the following chances of false positives :

        normal approach : (0.52833)^3 ~= 0.14747
        proposed approach : 1 - (1 - 0.52833)^3 ~= 0.89507

so, the proposed approach has a 6 times higher chance of a false positive (for this example), or almost a 90% chance of a false positive (for this example).


Now, let's extrapolate that to a set B of size l. Now :

        normal approach : for each of the items, check if all bits from the hash functions are set in A, sequentially
        proposed approach : check if at least one bit of B is set in A

The chance of false positives is now respectively :

        normal approach : 1 - (1 - p^k)^l
        proposed approach : 1 - (1 - p)^qm

where q is the probability that a certain bit is 1 in B :

        q = (1 - ((1 - (1 / m))^(kl)))

with k = 3, m = 256, n = 64, l = 8 for example, we get that p ~= 0.52833 and q ~= 0.08966, so that would give the following chances of false positives :

        normal approach : 1 - (1 - (0.52833)^3)^8 ~= 0.72096
        proposed approach : 1 - (1 - 0.52833)^(0.08966 * 256) ~= 0.99999996

so, the proposed approach has almost a 100% chance of a false positive (for this example).


I know you said not to use maths, but it was the easiest way for me to prove that our intuition is right, and it seems our intuition didn't deceive us heh.


Please do check my calculations, since I just quickly jotted them down.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 33509032
>> - Each item hashed will set 2 bits in the filter
>> - The definitive set contains 10000 items to be hashed into the filter (setting 2 bits each).
>> - The candidate set is 100 items to be hashed into the candidate filter (setting 2 bits each).
>> - Each filter is 64K bits

The math for these values would be :

        k = 2
        m = 64*1024 = 65536
        n = 10000
        l = 100

        p ~= 0.26301
        q ~= 0.00305

        normal approach : 1 - (1 - (0.26301)^2)^100 ~= 0.99923
        proposed approach : 1 - (1 - 0.26301)^(0.00305 * 65536) ~= 1.00000

So, both approaches would yield false positives with almost 100% probability, but the alternative approach has a higher probability.
0
 
LVL 40

Author Closing Comment

by:evilrix
ID: 33509747
Thanks :)
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
What statistical method/formula can be used for Range and Trend? 4 95
Math Question 1 105
Math Equation 23 106
Understanding factorial 4 19
Introduction This article discusses the Chain of Responsibility pattern, explaining What it is;Why it is; andHow it is At the end of this article, I hope you will be able to describe the use and benefits of Chain of Responsibility.  Backgrou…
Article by: Nicole
This is a research brief on the potential colonization of humans on Mars.
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…

821 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question