How to design a robust service

After resigning recently, during the preparation for interviews and resumes, I reviewed the service issues that occurred during my previous job. In some incidents, it was because one or several services were overwhelmed by peak traffic, which could be due to database issues or various other problems such as memory. At a deeper level, the root cause was that the services themselves were not robust enough. After one or several nodes of a service were overwhelmed and caused a retry storm, it led to a cascade failure. If these services were part of the main process, it could potentially lead to a system-wide cascade failure.

Many of our services perform as shown in the graph above under pressure. They can provide normal service until they reach their maximum load. However, once the traffic exceeds the maximum capacity of the service, the service's external capabilities plummet. Even if the traffic is not overloaded at this point, or even if the traffic is completely degraded to 0, the service's capabilities take a long time to recover.

In some cases, even after waiting for the service to recover and the upper-level traffic to switch back, the service is instantly overwhelmed by a large number of manual or automatic retry requests.

If our backend services are so "fragile" that they collapse immediately under overload and take a long time to recover, then our stability assurance work cannot be done. In large-scale applications, there are thousands of services in the entire chain, and each service has many nodes. In this situation, it is difficult to guarantee and monitor the stability of all nodes and prevent the system from being overwhelmed by traffic. It is impossible to achieve this because unexpected local exceptions will always occur.

Therefore, we need to ensure the robustness of the entire application, and each service should have a certain level of robustness. For an individual service (whether it is standalone or distributed), this requirement is not high and not difficult to achieve.

We need to achieve that a "robust" service performs as shown in the graph above under pressure. When the load increases, the output should increase linearly. When the external load exceeds the maximum capacity of the service, after a brief jitter, the service should be able to stably output to the outside regardless of how high the external load is. It is best if this output is close to the theoretical maximum load. In other words, even in the case of overload, a "robust" service should be able to provide stable output to the outside, and requests that exceed the processing capacity should be promptly rejected.

Each node of a service needs to have metrics that can accurately represent its own service load. When the metrics exceed the service's capacity limit, requests that exceed the service's capacity should be rejected.
There should be a feedback mechanism between the server and the client (here, the client refers not only to the app but also to the service's own client, which generally integrates with the backend's load situation and adjusts its own strategy. In another service's SDK, or even in the SDK of the app, the same applies). The client should be aware of the backend's load situation and adjust its own strategy.

In simple terms, for each node of each service: 1. You need to know whether you are busy or not, whether you are close to your limit. 2. If you are busy, you need to let your client know so that it doesn't accidentally overwhelm you...

Next, let's discuss these two points in detail: how to find metrics that represent the service load and how to negotiate with the client during overload.

Metrics Representing Service Load#

Metrics representing service load refer to the indicators that cause the service node's service capacity to deteriorate sharply and take a long time to recover when the indicators exceed a certain threshold.

We can use load testing to find reasonable QPS, TPS, and other values under standard server environments. Then, we can use tools such as rate limiting and circuit breaking to impose peak limits. However, depending on the actual situation, the peak limit can be set within a range:

If the service node can be horizontally scaled, it does not need to be too precise. There is no need to excessively pursue running the service at the limit of CPU or other resources. Based on the results of load testing, set a less aggressive value. Resource utilization can be further squeezed through mixed deployment or overselling.
Consider the timeout period; otherwise, even if the request is processed, it may be useless because the client may have already timed out and the response will be discarded.
Consider whether some machines will run other functional services, such as machine monitoring tools, primary-backup node synchronization, and heartbeat functions. Therefore, there should be some leeway in setting the threshold. More sophisticated services can even consider request grading and set different counters for each priority.
Some people may wonder if the threshold of this counter can be made adaptive, that is, dynamically adjust the threshold based on the internal resource situation of the service node. To be honest, in most cases, this is not necessary. Simplicity is less prone to errors. In very few cases, it is necessary to focus on the performance of a single node. In the era of exploding machines, service stability is more valuable than machines.

Collaboration Between the Client and the Server#

To be honest, I am not very familiar with the client, but this problem is actually an optimization of the retry logic under backend busyness.

The client's retry strategy should be a very sophisticated logic, but it is easily overlooked. The most common retry strategies that everyone can think of are simple interval retries, and more thoughtful ones may consider gradually increasing the interval using geometric or Fibonacci sequences.

But if you think about it carefully, retries are actually used to solve two types of problems. One is failover for network packet loss or momentary disconnection, and the other is failover for server node failures. For the former, the smaller the retry interval, the better, preferably instantaneous retries. However, for the latter, if the interval is too small, it is easy to cause a retry storm and make the backend nodes even more miserable. However, the tragedy is that from the client's perspective, we cannot distinguish which situation has occurred.

Therefore, the simplest solution is that when the server's node is busy, instead of simply discarding the received requests, the server should return RET_BUSY for these requests. Once the client receives RET_BUSY, it should exit the current retry logic and directly extend the retry interval to avoid causing a retry storm. Furthermore, the server can classify the return values based on its own situation, for example, RET_RETRY, which prompts the client to immediately initiate a retry, and RET_BUSY and RET_VERY_BUSY, which cause the client to extend the retry period to different lengths. There are many fine-grained techniques that can solve complex problems. You can try them out in your own engineering scenarios.

So far, we have explained how to create a "robust" service. Essentially, it is to use very simple engineering methods to implement a service that can preliminarily recognize its own busyness and a client that can adaptively retry based on the server's busyness. This way, it can provide controlled output regardless of the external request pressure.

In actual engineering practice, the service nodes should be able to provide continuous and stable external services when the average CPU utilization exceeds 70% and the instantaneous CPU utilization exceeds 90%. In the testing environment, they should be able to provide stable output under extreme conditions.

Can the above robustness be achieved without modifying the service itself?#

Can this mechanism be implemented externally through monitoring combined with traffic degradation plans so that the service itself does not need to be modified?

First of all, monitoring combined with traffic degradation plans is essential and is the ultimate solution for everything. However, monitoring has lag, and traffic degradation is lossy, which means that degradation needs to be handled with caution. These factors determine that the monitoring combined with traffic degradation mechanism is used to solve failures rather than make the service more robust. The robustness of the service itself depends on the service itself to solve.

Which services need to be modified?#

The above mechanism can be implemented in RPC frameworks and is sufficient for most services that do not have high requirements.

Distributed storage systems have higher performance requirements. The engineering difficulty of stability in storage systems is indeed the most complex and fundamental. Storage nodes cannot be simply restarted to solve problems. In addition to CPU, there are also considerations for memory, page cache, network throughput and interrupts, disk IOPS and throughput, and many other resource dimensions. Furthermore, once traffic is degraded, online storage systems usually cannot quickly recover (for example, they may need to wait for master-slave synchronization, minor or major compaction, or disk flushing). Therefore, the robustness of online storage systems requires consideration of more factors.