Chaos Engineering Principles
Mind Map Summary
- Topic: Chaos Engineering
- Definition: The discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
- Principles:
- Hypothesize about Steady State: Start by defining a measurable output of your system that indicates normal behavior.
- Vary Real-World Events: Introduce failures into the system, such as server crashes, network latency, and disk failures.
- Run Experiments in Production: Run experiments in a production environment to get a realistic understanding of how the system behaves under stress.
- Automate Experiments to Run Continuously: Automate experiments to run continuously to ensure that the system remains resilient over time.
- Benefits:
- Increased Resilience: Proactively identify and fix weaknesses in the system before they cause outages.
- Improved Understanding of the System: Gain a better understanding of how the system behaves under stress.
- Increased Confidence in the System: Build confidence in the system’s ability to withstand failures.
Practice Exercise
Design a chaos experiment for a microservices-based application. For example, what would you test by injecting latency between the API gateway and a downstream service? What tool could you use (e.g., Chaos Monkey, Azure Chaos Studio)?
Answer
Chaos Experiment: Injecting Latency
- Hypothesis: If we inject 500ms of latency between the API gateway and the product service, the user-facing dashboard will still load within 2 seconds.
- Experiment:
- Select a tool: We will use Azure Chaos Studio to inject the latency.
- Define the blast radius: We will run the experiment in a single region in our pre-production environment.
- Inject the failure: We will use Azure Chaos Studio to inject 500ms of latency between the API gateway and the product service for 10 minutes.
- Measure the impact: We will monitor the load time of the user-facing dashboard during the experiment.
- Verification: We will verify that the dashboard load time remains within the 2-second SLA.
- Learnings: If the dashboard load time exceeds the SLA, we will investigate the cause and implement a fix, such as adding a timeout or a circuit breaker.