Note--Fault Tolerance

Types of issues that can cause an RPC to fail

Unreliable networks

Network errors can be transient or persistent. Transient errors disappear upon retrying the same network request and are usually caused by a spike in network traffic. Persistent errors require external intervention to be resolved. For instance, incorrect DNS configurations can cause a microservice or even an entire cluster to be unreachable.

Application-level bugs

Bugs are an inevitable part of software development, even the best testing suites and CI pipelines can't prevent them from disrupting RPCs. When using distributed deployments, bugs may only affect a subset of a microservice's replicas. Retrying the request with a different replica can solve the problem. If the bug has affected all replicas, the RPC will fail without external intervention. Systems should be designed to continue normal operations despite bugs.

Database-level errors

Issues can arise if a microservice is dependent on a database. For example, if the connection fails or if the database server is down or under heavy load, the microservice may return an empty response or an exception. It is important for the client to detect these errors and respond accordingly to avoid problems.

Strategies to handle RPC failures

Retrying a failed RPC

exponential backoff algorithm.

Rerouting traffic from a faulty microservice to a healthy one

When the load balancer detects an unresponsive replica, it can redirect traffic going to that replica to healthy ones.

Adding a fallback path to the RPC

Fallbacks use an alternative communication link to recover from persistent communication link errors. If none of the available microservice replicas can process an RPC, the result is fetched from another source. This service should be independent from the primary path. Degraded performance is preferred over unavailability. Fallbacks.

Ref

Building Fault Tolerance with RPC Fallbacks in DoorDash's Microservices

Failures are inevitable, so building fault tolerance through retries, replication, and fallbacks is critical to ensuring a positive user experience

https://doordash.engineering/2022/06/07/improving-fault-tolerance-with-rpc-fallbacks-in-doordashs-microservices/

Failure models

Failure models provide us a framework to reason about the impact of failures and possible ways to deal with them.

Failure Type	Description	Detectability	Difficulty Level	Example
Fail-stop	Node halts permanently but can still be detected by other nodes	Detectable	Low	Power outage on a single node
Crash	Node halts silently and cannot be detected by other nodes	Undetectable	Moderate	Node failure due to hardware malfunction
Omission	Node fails to send or receive messages	-	Moderate	Network congestion causing packet drops
Temporal	Node generates correct results, but too late to be useful	-	Moderate	Software bug causing delays in processing
Byzantine	Node exhibits random behavior, possibly due to an attack or software bug	-	High	Malicious node intentionally transmitting false data