*Proceedings of the 14th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS'12)*, Springer, October 2012, pp. 16–30. Toronto, Canada.

# FESA

Fault-containing self-stabilizing algorithms for large networked systems without infra-structure

Contact | Prof. Dr. rer. nat. Volker Turau |

Start | 1. September 2008 |

End | 31. December 2011 |

Financing | German Research Foundation (DFG) |

## Project Description

The goal of this project is the development of a methodology for increasing the fault-tolerance of large infrastructureless networks. A system is fault-tolerant, if it maintains its functionality in the case of unexpected events or failures in hard- or software. For large infrastructureless networks, this can only be achieved by using decentral approaches like self-stabilization. A system is self-stabilizing, if it returns to a legitimate state without any external intervention. Until now, the design of self-stabilizing algorithms focussed on minimizing the time that the system needs to reach a ligitimate state after the occurance of a fault. During that time, considerable parts of the system may not work correctly.

This project aims at limiting the impact of a transient fault not only in terms of time, but also in terms of space. A methodology is developed to design algorithms that contain faults in a region of small size, such that only nodes in the near-by neighborhood of the source of fault are involved in the recovery. Other parts of the network can remain intact and functional during the repair. The goal is to develop a problem-independent approach that adds the property of fault-containment to existing self-stabilizing algorithms. The primary field of application are algorithms for wireless networks, for example sensor networks. The corresponding conditions, like the asynchronous model, broadcasts as communication primitives, unreliable communcation and limited resources will be taken into account.

FESA is funded by the German Research Foundation (DFG) for 3 years.

## Publications

**Abstract**: Large scale distributed systems require replication of resources to amplify availability and to provide fault tolerance. The placement of replicated resources significantly impacts performance. This paper considers local k-placements: Each node of a network has to place k replicas of a resource among its direct neighbors. The load of a node in a given local k-placement is the number of replicas it stores. The local k-placement problem is to achieve a preferably homogeneous distribution of the loads. We present a novel self-stabilizing, distributed, asynchronous, scalable algorithm for the k-placement problem such that the standard deviation of the distribution of the loads assumes a local minimum.

*Distributed Computing*, 25(3):207–224, 2012.

**Abstract**: This paper presents a new transformation which adds fault-containment properties to silent self-stabilizing algorithms. The transformation features a constant slow-down factor and the fault-gap—that is the minimal time between two containable faults—is also constant. The transformation scales well to arbitrarily large systems and avoids global synchronization. The presented transformation is the first with a constant fault-gap and requires no knowledge of the system size.

*Proceedings of the 13th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS'11)*, Springer, October 2011, pp. 311–325. Grenoble, France.

**Abstract**: Bounding the impact of transient small-scale faults by self-stabilizing protocols has been pursued with independent objectives: Optimizing the system's reaction upon topological changes (e.g. super-stabilization), and reducing system recovery time from memory corruptions (e.g. fault-containment). Even though transformations adding either super-stabilization or fault-containment to existing protocols exists, none of them preserves the other. This paper makes a first attempt to combine both objectives. We provide a transformation adding fault-containment to silent self-stabilizing protocols while simultaneously preserving the property of self-stabilization and the protocol's behavior in face of topological changes. In particular, the protocol's response to a topology change remains unchanged even if a memory corruption occurs in parallel to the topology change. The presented transformation increases the memory footprint only by a factor of 4 and adds O(1) bits per edge. All previously known transformations for fault-containing self-stabilization increase the memory footprint by a factor of 2m/n.

*Theoretical Computer Science*, 412(33):4361–4371, 2011.

**Abstract**: The non-computability of many distributed tasks in anonymous networks is well known. This paper presents a deterministic self-stabilizing algorithm to compute a 3 - (2 / (Delta+1))-approximation of a minimum vertex cover in anonymous networks. The algorithm operates under the distributed unfair scheduler, stabilizes after O(n+m) moves respectively O(Delta) rounds, and requires O(log n) storage per node. Recovery from a single fault is reached within a constant time and the contamination number is O(Delta). For trees the algorithm computes a 2-approximation of a minimum vertex cover.

*Proceedings of the 12th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS'10)*, Springer, September 2010, pp. 65–79. New York, NY, USA.

**Abstract**: Proving stabilization of a complex algorithm under the distributed scheduler is a non-trivial task. This paper introduces a new method which allows to extend proofs for the central scheduler to the distributed scheduler. The practicability of the method is shown by applying it to two existing algorithms, for which stabilization under the distributed scheduler was an open problem.

*Proceedings of the 30th IEEE International Conference on Distributed Computing Systems (ICDCS'10)*, IEEE Computer Society, June 2010, pp. 418–427. Genoa, Italy.

**Abstract**: This paper presents a new transformation which adds fault-containment properties to any silent self-stabilizing protocol. The transformation features a constant slow-down factor and the fault-gap – that is the minimal time between two containable faults – is constant. The transformation scales well to arbitrarily large systems and avoids global synchronization.