Active Probing Approach for Fault Localization in Computer Networks

Natu, Sethi

fault management detection networks probing

  title={Active Probing Approach for Fault Localization in Computer Networks},
  author={Natu, M. and Sethi, A.S.},
  booktitle={{IEEE}/{IFIP} Workshop on End-to-End Monitoring
             Techniques and Services ({E2EMON})},

Measuring a variety of aspects: Connectivity, link or node failure, bandwidth, traffic levels, loss, jitter, path MTU, SLA violations (response time thresholds, loss thresholds), topology

Passive measures sample at a set point, e.g. throughput at a node and packet size distribution

Active measures capture information about paths, e.g., latency, loss, route availability

  • Can selectively probe to determine the specific problem point
  • May include application layer probing, e.g. HTTP requests

Three primary steps in fault localization

  • Probe station selection
  • Problem detection
  • Problem determination

Pre-planned probling, followed by passive data mining

  • Lots of management traffic?
  • Can't predict faults that might occur
  • Delay before scheduled probes detect problem

Active probing with a small number of probes, expanded in type and quantity to explore potential problems

  • 1-packet: Estimate link bandwidth from round trip delays of different sized packets, assuming delay grows linearly with size
  • Pair: Measure increase in gap between two packets to estimate bottleneck conditions
  • Train: Similar idea
  • Tailgating: Trains of large packets with limited TTL interleaved with small packets of higher TTL... ???

Event correlation

Could instrument everything to emit alarms when conditions change

  • But alarms may not arrive, may not successfully trigger, etc

Can be difficult to determine if particular links are down

What if the probe station fails?

  • Must monitor the monitoring

Of note:

  • skitter, for topology probing
