Fault Tolerance

Fault Tolerance

Types of issues that can cause an RPC to fail
  • Unreliable networks
    • Network errors can be transient or persistent. Transient errors disappear upon retrying the same network request and are usually caused by a spike in network traffic. Persistent errors require external intervention to be resolved. For instance, incorrect DNS configurations can cause a microservice or even an entire cluster to be unreachable.
  • Application-level bugs
    • Bugs are an inevitable part of software development, even the best testing suites and CI pipelines can't prevent them from disrupting RPCs. When using distributed deployments, bugs may only affect a subset of a microservice's replicas. Retrying the request with a different replica can solve the problem. If the bug has affected all replicas, the RPC will fail without external intervention. Systems should be designed to continue normal operations despite bugs.
  • Database-level errors
    • Issues can arise if a microservice is dependent on a database. For example, if the connection fails or if the database server is down or under heavy load, the microservice may return an empty response or an exception. It is important for the client to detect these errors and respond accordingly to avoid problems.
Strategies to handle RPC failures
  • Rerouting traffic from a faulty microservice to a healthy one
    • When the load balancer detects an unresponsive replica, it can redirect traffic going to that replica to healthy ones.
  • Adding a fallback path to the RPC
    • Fallbacks use an alternative communication link to recover from persistent communication link errors. If none of the available microservice replicas can process an RPC, the result is fetched from another source. This service should be independent from the primary path. Degraded performance is preferred over unavailability. Fallbacks.


Failure models

Failure models provide us a framework to reason about the impact of failures and possible ways to deal with them.
Failure Type
Difficulty Level
Node halts permanently but can still be detected by other nodes
Power outage on a single node
Node halts silently and cannot be detected by other nodes
Node failure due to hardware malfunction
Node fails to send or receive messages
Network congestion causing packet drops
Node generates correct results, but too late to be useful
Software bug causing delays in processing
Node exhibits random behavior, possibly due to an attack or software bug
Malicious node intentionally transmitting false data