Chaos Engineering Principles

Mind Map Summary

  • Topic: Chaos Engineering
  • Definition: The discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
  • Principles:
    • Hypothesize about Steady State: Start by defining a measurable output of your system that indicates normal behavior.
    • Vary Real-World Events: Introduce failures into the system, such as server crashes, network latency, and disk failures.
    • Run Experiments in Production: Run experiments in a production environment to get a realistic understanding of how the system behaves under stress.
    • Automate Experiments to Run Continuously: Automate experiments to run continuously to ensure that the system remains resilient over time.
  • Benefits:
    • Increased Resilience: Proactively identify and fix weaknesses in the system before they cause outages.
    • Improved Understanding of the System: Gain a better understanding of how the system behaves under stress.
    • Increased Confidence in the System: Build confidence in the system’s ability to withstand failures.

Practice Exercise

Design a chaos experiment for a microservices-based application. For example, what would you test by injecting latency between the API gateway and a downstream service? What tool could you use (e.g., Chaos Monkey, Azure Chaos Studio)?

Answer

Chaos Experiment: Injecting Latency

  • Hypothesis: If we inject 500ms of latency between the API gateway and the product service, the user-facing dashboard will still load within 2 seconds.
  • Experiment:
    1. Select a tool: We will use Azure Chaos Studio to inject the latency.
    2. Define the blast radius: We will run the experiment in a single region in our pre-production environment.
    3. Inject the failure: We will use Azure Chaos Studio to inject 500ms of latency between the API gateway and the product service for 10 minutes.
    4. Measure the impact: We will monitor the load time of the user-facing dashboard during the experiment.
  • Verification: We will verify that the dashboard load time remains within the 2-second SLA.
  • Learnings: If the dashboard load time exceeds the SLA, we will investigate the cause and implement a fix, such as adding a timeout or a circuit breaker.