Network Applications of Bloom Filters: A Survey

Broder and Mitzenmacher

bloom filters hashing set membership compression

@article{broder:im-2004,
  title={Network Applications of {Bloom} Filters: A Survey},
  author={Broder, A. and Mitzenmacher, M.},
  journal={Internet Mathematics},
  volume={1},
  number={4},
  pages={485--509},
  year={2004},
  publisher={AK Peters}
}

Bloom filters: Space efficient set membership data structure

Allows for false positives, but may be very rare based on parameters
Standard structure does not permit member removal, hash must be rebuilt
Provides straightforward set operators

Side benefit: Obfuscation

Although not very secure, determining the members requires some work

Used in databases since early '70s

Emerging network applications:

Summarizing content in overlay and P2P networks
Locating resources
Speed up or simplify packet routing
Summarize measurement data

"The Bloom filter principle: Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated."

Bloom filter basics

Straight forward construction algorithm
- Set S = {x_1, x_, ..., x_n} of n elements
- Described by array of m bits, initially 0
- k independent hash functions h_1, ..., h_k with range {1, ..., m}
  - MD5 is commonly used as hash function
- For each set member for each hash function, set the array bit indexed by the hash to 1
  - Bits may be set multiple times
  - An interesting, similarly performing twist is to assign each hash function a particular subset of the bit array
    - Potentially speeds up parallel processing
To check for membership, apply each hash to the key
- If any referenced bit is 0 then the key is not in the set
- If all referenced bits are 1 then the key *might* be in the set
  - It's possible another key or keys set one or more of the bits, creating a potential false positive that must be checked through some other mechanism
From elsewhere (flipcode): If hash is high quality and have a single key, hash to (k*log m) bits and then use each set of log m bits as the hash result

Given perfect random hashing, probability that a bit is 0 is roughly p = e^(-kn/m)

Approximate within O(1/m) for actual value
Probability of false positive is then approximately (1-p)^k
- False positive rate is minimized when k = ln 2 * (m/n)
If e is a desired false positive rate bound, m = n * log_2(1/e)

There are schemes that use less bits than Bloom filters with the same error rates, but require more computation

Compressed Bloom filters, perfect hashing

Several easy standard operators

Set union: Apply bitwise or to filters
Shrink: Divide the filter in half and bitwise or the two halves
- Ignore the high order bit when hashing to do a lookup
Can estimate size of the intersection, even without knowing size of the sets

Couting Bloom filters are an extension wherein each array element is a count rather than a bit

Four bits works well for most applications (must avoid overflow)
Enables element deletion by reducing counts
Even if it does overflow (unlikely), will only cause a false negative if the counter is deleted back to zero, which would likely take a long time

Compressed Bloo filters: Use a larger but more sparse array which maintains false positive rate, but compress filter before sending

Database applications

Speed up semi-joins, e.g. by sending list of results between distributed databases (sending back to original host to remove false positives)
Tracking changed records (consulting logs to remove false positives)

Distributed caching: Collection of Web caches work together to serve local requests

Squid uses this; its Cache Digests are Bloom Filters
Each cache must know what other caches have to request contents from them rather than web
Reduce traffic, state by passing Bloom filters rather than list
Using counting filters enables contents to be updated w/out rebuilding as cache is frequently changed

Web annotation

Users cooperate to build centralized repository of web page annotations
Repository periodically pushes out Bloom filter of pages w/ annotations, which is used to determine if comments should be fetched

Moderate size P2P networks

DHTs don't necessarily scale down well to smaller networks, though keeping lists of objects stored at other nodes may also be prohibitive
However, keeping Bloom filters for each node, representing its contents, may be feasible
Bloom filters may similarly be used for keyword search, similar to database joins
Can apply a scheme to aggregate received filters together then propagate on, providing a basis for routing
- Need to know particular nodes to avoid unnecessary traffic (i.e., sending multiple copies unnecessarily, thinking they're going to different nodes)
- Need to know hop counts to pick best paths
  - Can solve this with an attenuated Bloom filter
    - For each link maintain an array a of d Bloom filters; each filter a_i represents objects accessible from that link at i hops, up to d away
    - Some other scheme, e.g. flooding, can be used to find objects that fall outside the hop count

Duplicate packet detection

Each packet keeps a Bloom filter of nodes it has visited; nodes drop packet if they find themselves in the packet
- Not obvious how to correct from false positives, since they are likely to be a problem for some time if they occur, unless the network changes
  - One mechanism to do this is to change the hash functions periodically, ensuring no one is punished forever
Can be used to then trace the packet's path backward, from the destination checking each link to see if it is in the filter and then proceeding from there
- Can be flipped, with each router keeping track of packets it has seen, probably more reasonable for this use though it requires more querying (asking neighbors to check filters, wheras the packet filter may be checked locally then forwarded only to matching node)

Queue management

Use a counting Bloom filter to track how much traffic for a flow have seen, identified e.g. by source and destination
Can similarly be used for measuring heavy flows
- The most traffic associated with the flow is the minimum of the k counts in the filter
  - Interesting trick: A packet p hashes to k counters; the most number of bytes associated with that flow is the minimum, M, of those counters; rather than add B bytes to each, set each to max(M + B, c_k), i.e. not increasing a counter if it is already higher than is possible for the current flow; this reduces the effect of false positives in practice

Of note:

Cuena-Acuna 2002
Ledlie 2002
Reynolds and Vahdat 2003
Czerwinski 1999
Rhea and Kubiatowicz 2002
Estan and Varghese
Chazelle, Kilian, Rubinfeld, and Tal 2004
Cormode and Muthukrishnan 2004

tjkopena's wiki
	Papers / Broder-IM 2004	Edit Page \| Page History \| Wiki Help Print \| Search:

Broder-IM 2004

Network Applications of Bloom Filters: A Survey

Broder and Mitzenmacher