Stability Construction from the Perspective of Microservices

This article mainly introduces the common problems faced by stability construction from the aspects of "preventing stability risks" and "reducing the impact of failures" based on microservices.

1. Preventing Stability Risks#

Microservices architecture makes the functionality of microservices more cohesive and the iteration speed faster. However, it also increases the complexity of service dependencies, thereby increasing the difficulty of stability construction. Although the dependency relationship is complex, it can be abstracted into the relationship between upstream services, self services, and downstream services. The main idea of preventing stability risks is to prevent risks in these three areas.

Up, Middle, Down

1.1 Preventing Upstream Risks#

Rate limiting, input validation.

The common risks to prevent upstream are "increased traffic" and "input errors". The expected increase in traffic can be evaluated in advance and relevant response plans can be prepared. For unexpected increases in traffic, rely on pre-set rate limiting plans.

The purpose of rate limiting is self-protection or isolation of impact. After core traffic is rate limited, the impact can be evaluated, and then capacity can be expanded or the rate limiting threshold can be temporarily adjusted.

"Input errors" are often caused by unrestricted range parameters. For example, if only one day of data is expected to be queried, but the request parameter is set to query one month, the database may crash due to the lack of restrictions in the interface.

1.2 Preventing Downstream Risks#

Remove strong dependencies, implement degradation, conduct weak dependency verification, and have flow-cutting plans.

In the industry, dependencies that do not affect the core business process or system availability when exceptions occur are called weak dependencies, while the rest are called strong dependencies. The most direct way to prevent downstream risks is to remove strong dependencies.

When designing the system, it is necessary to comprehensively analyze the strong and weak dependency relationships of the system. After the system goes online, the dependency relationships can be further analyzed by collecting online traffic using tools.
It is necessary to transform historical business and make trade-offs in terms of functionality, user experience, and stability. To ensure stability, minimize the strong dependencies of downstream systems. For non-core functions, when there is a failure in downstream dependencies, the functionality should be cut to ensure that the core functions are always available.

Weak dependencies require the establishment of degradation plans. Various open-source traffic governance components such as Sentinel can be used. To ensure the execution efficiency of the plans, it is recommended to combine business code fault tolerance with automatic circuit breaking.

The choice of degradation method is closely related to the impact of business degradation. Generally, for functions with a significant impact after degradation, manual degradation is used, while for functions with a small impact after degradation or functions that can be quickly automatically repaired in the future, automatic degradation can be considered.

It is necessary to regularly verify the governance of strong and weak dependencies. If the interfaces or services are relatively simple, unit testing can be used for verification. If there are many and complex services, regular fault drills are needed to identify issues.

For strong dependencies that cannot be removed, we can consider methods to reduce risks, improve stability, and prevent major incidents.

For MySQL, adding enough shards can reduce the impact of a single shard failure.
Develop sound emergency response plans as a backup, while providing a good user experience.
In the event of a failure in a single data center, flow cutting should be prioritized.

1.3 Preventing Self Risks#

Architectural risks, capacity risks, flow-cutting plans, online change specifications, and development and testing quality assurance.

The basic approach is to avoid single points of failure through redundant deployment and active-active flow cutting; use elastic cloud and automatic scaling to reduce capacity risks. Periodic sentinel load testing, end-to-end load testing, and module-level load testing are conducted for capacity evaluation.
From the frequent causes of online incidents, code changes and configuration changes account for the majority. Therefore, improving development and testing quality and strictly adhering to online change specifications are the key to preventing self risks.

To improve development quality, from the perspective of stability, developers need to have the awareness of writing automated test cases. Although writing test cases may increase the time cost in the short term, they can greatly improve the testing efficiency and code quality in the later stages. For core business systems, continuous iteration is inevitable, so the long-term cost of writing test cases should be acceptable.

2. Reducing the Impact of Failures#

Mistakes are inevitable for humans, so failures are inevitable. In addition to preventing risks, we also need measures to reduce the impact of failures.

2.1 Self Interface Degradation#

Clarify the upstream dependencies of the core links and degrade the capabilities of the interfaces.

As part of the business chain, we need to clarify the strong and weak dependency relationships of our services in the upstream core links. If our services are weakly dependent on upstream services, we need to ensure that the interfaces being relied upon support interface degradation. If our services are strongly dependent on upstream services, we need to consider promoting the upstream to remove the strong dependency on our services. If it cannot be removed, we need to consider building alternative channels or other solutions to reduce the impact of upstream services, such as user-oriented fault guidance messages, announcements, etc.

In summary, we not only need to focus on the stability of our own services but also pay attention to the upstream dependency on our services and build response plans to reduce the impact of our own service failures on upstream services. Please note that the interface degradation here is different from the dependency degradation mentioned earlier. The interface degradation here refers to the degradation of our own service capabilities, aiming to reduce the impact on upstream services. The dependency degradation mentioned earlier is the degradation of downstream dependencies to reduce the impact of downstream failures on our own services. These are different plans when services are at different levels.

2.2 Fault Perception and Localization#

Monitoring and alerting, fault root cause localization, emergency response processes.

It is crucial to monitor and alert the core service indicators and business indicators as comprehensively as possible, not only in terms of coverage but also in terms of the timeliness and accuracy of alerts. Building observable links, traceable logs, and visualizing server performance are all effective tools for fault perception and root cause localization.

When building indicators, it is recommended to standardize metric indicators, which can reduce the cost of understanding and improve problem localization efficiency.

To improve the timeliness and accuracy of alerts for core indicators, it is recommended to focus on monitoring from a certain direction to reduce maintenance costs. Monitoring based on business result indicators is recommended (process indicators can be used to assist in problem localization). The reason is that business process indicators are numerous, change frequently, and may involve multiple systems, making them more dispersed. On the other hand, the results of business indicators tend to converge.