Fault Localization and Self-Healing with Dynamic Domain Configuration
Kant, McAuley, Morera, Sethi, Steinder
fault detection network management self-healing manet
@inproceedings{kant:milcom-2003,
title={Fault Localization and Self-Healing with Dynamic Domain Configuration},
author={Kant, L. and McAuley, A. and Morera, R. and
Sethi, AS and Steinder, M.},
booktitle={Military Communications Conference ({MILCOM})},
year={2003}
}
Survivability and automation are key targets for FCS
Fault management goals
- Rapid localization (better than exponential)
- Accurate (high detection, low false positive rates)
- Automated recovery of (mission critical) applications
- Notably, survivability requirements may vary; non-mission critical apps may be allowed to drop or be fixed more slowly
- Must get input about these requirements
- Low cost, low complexity for battlefield use
Dynamic domain configuration is also important
- Automatically reconfiguring the network to suit dynamics, policies
Partition the network to better localize errors or isolate malfunctioning elements
Must work with multiple simultaneous faults, limited and noisy symptom observations
- Must manage hard (e.g., hardware fault) and soft (e.g., QoS problems) failures
One approach: Capture dependencies between systems in multi-layer Bayesion model
- Iterative belief updating on singly connected belief networks
Another approach: Incremental hypothesis updating; hypothesis is conjunction of faults explaining symptoms
Layer 1 automatic protection switching
- Deals with hard failures
- Restores quickly, but is resource expensive
Layer 2
Layer 3
- On-demand survivability-based re-routing
Use dynamic domain configuration to:
- Fully or partially isolate misbehaving nodes depending on level of malfunction, capabilities provided/needed
- Address service problems, such as crashed, partitioned, or overloaded services
- Isolate poor quality regions and apply different protocols, policies, etc