Compressed Bloom Filters Michael Mitzenmacher Jonathan Ledlie (jonathan@eecs) October 11, 2001 The author argues that compressed bloom filters can achieve similarly acceptable rates of false positives than their regularly-sized compatriots and use fewer bits at the same time. The author says that the possible drawback of increased computation time due to compression, decompression, and potentially different numbers of hash functions is offset by the bonus of having filters of smaller sizes when they are passed as messages, like in the Squid web cache hierarchy proposed by Fan and Cai (Go Badgers!). While I agree that having smaller messages is good, I think that the number of messages passing between the caches was the real problem and this compression mechanism leaves the number of messages the same. Still, it certainly does not increase the number of messages and it still allows filter deltas to work properly. [I'm not so sure how they work. Here's my guess] Compressed bloom filters work by choosing a different p = 1/2, where p is the probability of an entry in the filter being either 0 or 1. By building our original filter with functions which set the bits in the filter at lower rates than 1/2, the filter has less entropy, and is therefore more compressible. The compressed filter is decompressed before being probed at the receiver. I do not understand why the paper shows compressed bloom filter false positive rates vs uncompressed, unless the meaning of these experiments is to show that filters generated where p != 1/2, then compressed, sent, decompressed, and used is about as good (or better) than standard, uncompressed filters (originally the paper seemed to be saying that the compressed filters were probed directly, and those were the results). The compressed bloom filters use arithmetic encoding for three main reasons: 1) arithmetic encoding can exploit the decreased entropy caused by setting the bits in the filter to 1 with differing probabilities than 1/2, 2) it tends to output consistent sizes and this is beneficial when filters might be squeezed into IP packets, and 3) it is fast.