My research interests include distributed computing, fault-tolerance and real-time. I work on methodologies, paradigms, and algorithms for fault-tolerant distributed systems, in both message-passing and shared-memory systems. My long-term goal is to bridge the gap between theoretical results and the need for efficient and practical solutions. In collaboration with Tushar Chandra and Prasad Jayanti, two Ph.D. Computer Science students, we continued our work on unreliable failure detectors for message-passing systems, and on wait-free objects for shared-memory systems.
A fundamental result of fault-tolerant distributed computing states that the Consensus problem cannot be solved (with a deterministic algorithm) in asynchronous systems. This impossibility result is due to the inherent difficulty of determining whether a process has crashed (or is merely very slow) in such a system. In our work, we were able to determine exactly how much information about failures is necessary and sufficient to solve Consensus. We first showed one can use W, an unreliable failure detector that can make an infinite number of mistakes, to solve Consensus in systems with a majority of correct processes. We then proved that to solve Consensus, any failure detector has to provide at least as much information about failures as W. Thus, W is the weakest failure detector for solving Consensus in asynchronous systems with a majority of correct processes. We are now exploring the practicality of implementing W, and of applications that rely on W for their correctness.
A concurrent system consists of processes communicating via shared objects. A shared object is wait-free if each process that accesses this object is guaranteed to get a response even if all the other processes crash. We are now exploring wait-free hierarchies of object types, where each object (type) is assigned to a level that corresponds to its ability in implementing other wait-free objects. In particular, Prasad Jayanti has shown that a well-known hierarchy (Herlihy's) is not robust: Informally, in this hierarchy there is an object at level 2 that can be used to implement wait-free objects at any level. We are now exploring the question of whether robust wait-free hierarchies exist.