Active Probing Approach for Fault Localization in Computer Networks
Natu, Sethi
fault management detection networks probing
@inproceedings{natu:e2emon-2006,
title={Active Probing Approach for Fault Localization in Computer Networks},
author={Natu, M. and Sethi, A.S.},
booktitle={{IEEE}/{IFIP} Workshop on End-to-End Monitoring
Techniques and Services ({E2EMON})},
month={April},
year={2006}
}
Measuring a variety of aspects: Connectivity, link or node failure, bandwidth, traffic levels, loss, jitter, path MTU, SLA violations (response time thresholds, loss thresholds), topology
Passive measures sample at a set point, e.g. throughput at a node and packet size distribution
Active measures capture information about paths, e.g., latency, loss, route availability
- Can selectively probe to determine the specific problem point
- May include application layer probing, e.g. HTTP requests
Three primary steps in fault localization
- Probe station selection
- Problem detection
- Problem determination
Pre-planned probling, followed by passive data mining
- Lots of management traffic?
- Can't predict faults that might occur
- Delay before scheduled probes detect problem
Active probing with a small number of probes, expanded in type and quantity to explore potential problems
- 1-packet: Estimate link bandwidth from round trip delays of different sized packets, assuming delay grows linearly with size
- Pair: Measure increase in gap between two packets to estimate bottleneck conditions
- Train: Similar idea
- Tailgating: Trains of large packets with limited TTL interleaved with small packets of higher TTL... ???
Event correlation
Could instrument everything to emit alarms when conditions change
- But alarms may not arrive, may not successfully trigger, etc
Can be difficult to determine if particular links are down
What if the probe station fails?
- Must monitor the monitoring
Of note:
- skitter, for topology probing