A Performance vs. Cost Framework for Evaluating DHT Design Tradeoffs Under Churn
Li et al
dht network analysis study performance simulation p2p experiment
@inproceedings{li:infocom-2005,
title={A Performance vs. Cost Framework for Evaluating
{DHT} Design Tradeoffs Under Churn},
author={Li, J. and Stribling, J. and Morris, R. and
Kaashoek, M.F. and Gil, T.M.},
booktitle={24th Annual Joint Conference of the {IEEE} Computer
and Communications Societies ({INFOCOM})},
volume={1},
pages={225--236},
year={2005},
organization={IEEE}
}
Evaluating lookups on static networks favors large routing tables since they have big payoffs and few maintenance penalties without churn
DHTs should adjust routing table size based on available bandwidth and churn
Should snoop on existing traffic to construct routes, rather than conduct active routing table stabilization
Important to look at cost of protocol features alongside performance gains
- Cost here: Bytes transmitted
- Metrics: Lookup latency, lookup failure rate
Parameters make it hard to gauge performance---could be significantly affected by chosen values
- Protocol parameters and scenario parameters
Two setups: Churn intensive, lookup intensive
Estimating effect of parameter by examining convex hull of performance/cost while holding that parameter fixed, compared to overall convex hull varying all parameters
Verbatim quotes:
- Minimizing lookup latency requires complex workload-dependent parameter tuning.
- The ability of a protocol to control its bandwidth usage has a direct impact on its scalability and performance under different network sizes.
- DHTs that distinguish between state used for the correctness of lookups and state used for lookup performance can more efficiently achieve low lookup failure rates under churn.
- The strictness of a DHT protocol's routing distance metric, while useful for ensuring progress during lookups, limits the number of possible paths, causing poor performance under pathological network conditions such as non-transitive connectivity.
- Increasing routing table size to reduce the number of expected lookup hops is a more cost-efficient way to cope with churn-related timeouts than stabilizing more often.
- Issuing copies of a lookup along many paths in parallel is more effective at reducing lookup latency due to timeouts than faster stabilization under a churn intensive workload.
- Learning about new nodes during the lookup process can essentially eliminate the need for stabilization in some workloads.
- Increasing the rate of lookups in the workload, relative to the rate of churn, favors all design choices that reduce the overall lookup traffic. For example, one should use extra state to reduce lookup hops (and hence forwarded lookup traffic). Less lookup parallelism is also preferred as it generates less redundant lookup traffic.
Other comments:
That said, it's important to keep in mind that it's essentially looking at two particular aspects of typical DHT scenario: membership churn and workload. In some respects, I would consider these just slightly secondary to the overriding DHT consideration of scalability. That's what DHTs stem from, and that's what they primarily address. In contrast, if churn or workload were the overriding considerations, you would see more emphasis on things like broadcast (for churn) and multicast propagation/result caching (for workload). The paper needs to be interpreted with an implicit assumption of scalability as the common strong goal of the protocols, which the paper doesn't really address up front. By not directly incorporating scalability, the study opens itself up to including protocols with slightly different goals and skewing the results. In the event, that's why you get the somewhat awkward section where OneHop comes up as the dominating protocol and they have to do a separate study justifying that you can't actually say it's the best because it doesn't scale as well.
A much more minor point but interesting question revolves around the measurement period. The experiments run for 6 simulated hours, and they collect data only in the second half. This is a reasonable but ad hoc, brute force approach to answering the question: How to determine when the system has reached a quiescent state, ready for "fair" measuring, when the scenario itself is working against quiescence? This is particularly true when you have a variety of very different protocols and no clear expectations. Obviously as a human observer we could pick up on that point as it goes, but applying an automatic mechanism seems to require a good amount of instrumentation and analysis. It's probably worth it, however, when running such a large set of computationally expensive to try to actually closely estimate the onset of the quiescent state.