Papers /

Broder-IM 2004

Reading

Outdoors

Games

Hobbies

LEGO

Food

Code

Events

Nook

sidebar

Broder-IM 2004

Network Applications of Bloom Filters: A Survey

Broder and Mitzenmacher

bloom filters hashing set membership compression

@article{broder:im-2004,
  title={Network Applications of {Bloom} Filters: A Survey},
  author={Broder, A. and Mitzenmacher, M.},
  journal={Internet Mathematics},
  volume={1},
  number={4},
  pages={485--509},
  year={2004},
  publisher={AK Peters}
}

Bloom filters: Space efficient set membership data structure

  • Allows for false positives, but may be very rare based on parameters
  • Standard structure does not permit member removal, hash must be rebuilt
  • Provides straightforward set operators

Side benefit: Obfuscation

  • Although not very secure, determining the members requires some work

Used in databases since early '70s

Emerging network applications:

  • Summarizing content in overlay and P2P networks
  • Locating resources
  • Speed up or simplify packet routing
  • Summarize measurement data

"The Bloom filter principle: Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated."

Bloom filter basics

  • Straight forward construction algorithm
    • Set S = {x_1, x_, ..., x_n} of n elements
    • Described by array of m bits, initially 0
    • k independent hash functions h_1, ..., h_k with range {1, ..., m}
      • MD5 is commonly used as hash function
    • For each set member for each hash function, set the array bit indexed by the hash to 1
      • Bits may be set multiple times
      • An interesting, similarly performing twist is to assign each hash function a particular subset of the bit array
        • Potentially speeds up parallel processing
  • To check for membership, apply each hash to the key
    • If any referenced bit is 0 then the key is not in the set
    • If all referenced bits are 1 then the key *might* be in the set
      • It's possible another key or keys set one or more of the bits, creating a potential false positive that must be checked through some other mechanism
  • From elsewhere (flipcode): If hash is high quality and have a single key, hash to (k*log m) bits and then use each set of log m bits as the hash result

Given perfect random hashing, probability that a bit is 0 is roughly p = e^(-kn/m)

  • Approximate within O(1/m) for actual value
  • Probability of false positive is then approximately (1-p)^k
    • False positive rate is minimized when k = ln 2 * (m/n)
  • If e is a desired false positive rate bound, m = n * log_2(1/e)

There are schemes that use less bits than Bloom filters with the same error rates, but require more computation

  • Compressed Bloom filters, perfect hashing

Several easy standard operators

  • Set union: Apply bitwise or to filters
  • Shrink: Divide the filter in half and bitwise or the two halves
    • Ignore the high order bit when hashing to do a lookup
  • Can estimate size of the intersection, even without knowing size of the sets

Couting Bloom filters are an extension wherein each array element is a count rather than a bit

  • Four bits works well for most applications (must avoid overflow)
  • Enables element deletion by reducing counts
  • Even if it does overflow (unlikely), will only cause a false negative if the counter is deleted back to zero, which would likely take a long time

Compressed Bloo filters: Use a larger but more sparse array which maintains false positive rate, but compress filter before sending

Database applications

  • Speed up semi-joins, e.g. by sending list of results between distributed databases (sending back to original host to remove false positives)
  • Tracking changed records (consulting logs to remove false positives)

Distributed caching: Collection of Web caches work together to serve local requests

  • Squid uses this; its Cache Digests are Bloom Filters
  • Each cache must know what other caches have to request contents from them rather than web
  • Reduce traffic, state by passing Bloom filters rather than list
  • Using counting filters enables contents to be updated w/out rebuilding as cache is frequently changed

Web annotation

  • Users cooperate to build centralized repository of web page annotations
  • Repository periodically pushes out Bloom filter of pages w/ annotations, which is used to determine if comments should be fetched

Moderate size P2P networks

  • DHTs don't necessarily scale down well to smaller networks, though keeping lists of objects stored at other nodes may also be prohibitive
  • However, keeping Bloom filters for each node, representing its contents, may be feasible
  • Bloom filters may similarly be used for keyword search, similar to database joins
  • Can apply a scheme to aggregate received filters together then propagate on, providing a basis for routing
    • Need to know particular nodes to avoid unnecessary traffic (i.e., sending multiple copies unnecessarily, thinking they're going to different nodes)
    • Need to know hop counts to pick best paths
      • Can solve this with an attenuated Bloom filter
        • For each link maintain an array a of d Bloom filters; each filter a_i represents objects accessible from that link at i hops, up to d away
        • Some other scheme, e.g. flooding, can be used to find objects that fall outside the hop count

Duplicate packet detection

  • Each packet keeps a Bloom filter of nodes it has visited; nodes drop packet if they find themselves in the packet
    • Not obvious how to correct from false positives, since they are likely to be a problem for some time if they occur, unless the network changes
      • One mechanism to do this is to change the hash functions periodically, ensuring no one is punished forever
  • Can be used to then trace the packet's path backward, from the destination checking each link to see if it is in the filter and then proceeding from there
    • Can be flipped, with each router keeping track of packets it has seen, probably more reasonable for this use though it requires more querying (asking neighbors to check filters, wheras the packet filter may be checked locally then forwarded only to matching node)

Queue management

  • Use a counting Bloom filter to track how much traffic for a flow have seen, identified e.g. by source and destination
  • Can similarly be used for measuring heavy flows
    • The most traffic associated with the flow is the minimum of the k counts in the filter
      • Interesting trick: A packet p hashes to k counters; the most number of bytes associated with that flow is the minimum, M, of those counters; rather than add B bytes to each, set each to max(M + B, c_k), i.e. not increasing a counter if it is already higher than is possible for the current flow; this reduces the effect of false positives in practice

Of note:

  • Cuena-Acuna 2002
  • Ledlie 2002
  • Reynolds and Vahdat 2003
  • Czerwinski 1999
  • Rhea and Kubiatowicz 2002
  • Estan and Varghese
  • Chazelle, Kilian, Rubinfeld, and Tal 2004
  • Cormode and Muthukrishnan 2004
Recent Changes (All) | Edit SideBar Page last modified on February 05, 2009, at 10:00 AM Edit Page | Page History