Next:Conclusions Up:Contents Previous:Performance Evaluation

7 Related Work

Most previous work on replication techniques assumed benign faults, e.g., [17,23,18,19] or a synchronous system model, e.g., [28]. Earlier Byzantine-fault-tolerant systems [26,16,20], including the algorithm we described in [6], could guarantee safety only if fewer than 1/3 of the replicas were faulty during the lifetime of the system. This guarantee is too weak for long-lived systems. Our system improves this guarantee by recovering replicas proactively and frequently; it can tolerate any number of faults if fewer than 1/3 of the replicas become faulty within a window of vulnerability, which can be made small under normal load conditions with low impact on performance.

In a previous paper [6], we described a system that tolerated Byzantine faults in asynchronous systems and performed well. This paper extends that work by providing recovery, a state transfer mechanism, and a new view change mechanism that enables both recovery and an important optimization -- the use of MACs instead of public-key cryptography.

Rampart [26] and SecureRing [16] provide group membership protocols that can be used to implement recovery, but only in the presence of benign faults. These approaches cannot be guaranteed to work in the presence of Byzantine faults for two reasons. First, the system may be unable to provide safety if a replica that is not faulty is removed from the group to be recovered. Second, the algorithms rely on messages signed by replicas even after they are removed from the group and there is no way to prevent attackers from impersonating removed replicas that they controlled.

The problem of efficient state transfer has not been addressed by previous work on Byzantine-fault-tolerant replication. We present an efficient state transfer mechanism that enables frequent proactive recoveries with low performance degradation.

Public-key cryptography was the major performance bottleneck in previous systems [26,16] despite the fact that these systems include sophisticated techniques to reduce the cost of public-key cryptography at the expense of security or latency. They cannot use MACs instead of signatures because they rely on the extra power of digital signatures to work correctly: signatures allow the receiver of a message to prove to others that the message is authentic, whereas this may be impossible with MACs. The view change mechanism described in this paper does not require signatures. It allows public-key cryptography to be eliminated, except for obtaining new secret keys. This approach improves performance by up to two orders of magnitude without loosing security.

The concept of a system that can tolerate more than f faults provided no more than f nodes in the system become faulty in some time window was introduced in [24]. This concept has previously been applied in synchronous systems to secret-sharing schemes [13], threshold cryptography [14], and more recently secure information storage and retrieval [10] (which provides single-writer single-reader replicated variables). But our algorithm is more general; it allows a group of nodes in an asynchronous system to implement an arbitrary state machine.

Miguel Castro and Barbara Liskov, "Proactive Recovery in a Byzantine-Fault-Tolerant System", in Proceedings of the Fourth Symposium on Operating Systems Design and Implementation, San Diego, USA, October 2000.