Consider the following abstract idea, for bloom filter matching, that was proposed to me as an alternative to looking up individual candidates one at a time...

Assume we are using a general purpose bloom filter that sets K bits per entry where K is undetermined but > 1

We build a bloom filter for a collection. We then have a candidate list of items we want to check against that filter so we build another filter out of the candidate list. We then perform perform a bit intersection (binary and operation) on the two filters. It is assumed that if any of the bits in the result are set there is a reasonably good probability that one of the items in Set B can be found in set A.

It is my contention that, actually, the probability of a false positive is very high because we are saying if any bit in filter A matches any bit in filter B there is a match. I see this as being no different to the Birthday Paradox where we ask if any person in the room shares any birthday.

Am I wrong or right and (without going beyond simple high school maths - as I am really not that clever) can you prove it? Would this still be an issue if we only set 1 bit per entry?

NB. This is not a "homework question" - check my profile - it is a real world discussion I am having.

Thank you.

Intuitively, I'd come to the same conclusion.

But let's try to do the math.

Take a simple example of a set B of size 1. That would be the normal case when checking if an item is present in the bloom filter. However, there is a large difference in the approach taken to perform that check :

normal approach : check if all bits of B are set in A

proposed approach : check if at least one bit of B is set in A

The chance of false positives is respectively (where m is the number of bits in the bloom filter, k is the number of hash functions, and n is the number of items stored in the bloom filter) :

normal approach : p^k

proposed approach : 1 - (1 - p)^k

where p is the probability that a certain bit is 1 in A :

p = (1 - ((1 - (1 / m))^(kn)))

with k = 3, m = 256, n = 64 for example, we get that p ~= 0.52833, so that would give the following chances of false positives :

normal approach : (0.52833)^3 ~= 0.14747

proposed approach : 1 - (1 - 0.52833)^3 ~= 0.89507

so, the proposed approach has a 6 times higher chance of a false positive (for this example), or almost a 90% chance of a false positive (for this example).

Now, let's extrapolate that to a set B of size l. Now :

normal approach : for each of the items, check if all bits from the hash functions are set in A, sequentially

proposed approach : check if at least one bit of B is set in A

The chance of false positives is now respectively :

normal approach : 1 - (1 - p^k)^l

proposed approach : 1 - (1 - p)^qm

where q is the probability that a certain bit is 1 in B :

q = (1 - ((1 - (1 / m))^(kl)))

with k = 3, m = 256, n = 64, l = 8 for example, we get that p ~= 0.52833 and q ~= 0.08966, so that would give the following chances of false positives :

normal approach : 1 - (1 - (0.52833)^3)^8 ~= 0.72096

proposed approach : 1 - (1 - 0.52833)^(0.08966 * 256) ~= 0.99999996

so, the proposed approach has almost a 100% chance of a false positive (for this example).

I know you said not to use maths, but it was the easiest way for me to prove that our intuition is right, and it seems our intuition didn't deceive us heh.

Please do check my calculations, since I just quickly jotted them down.