Chaos Engineering with Litmus and Okteto Cloud
Cloud Native applications are, by definition, highly distributed, elastic, resistent to failure and loosely coupled. That's easy to say, and even diagram. But how do we validate that our applications will perform as expected under different failure conditions?
Enter Chaos engineering. Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions. Chaos Engineering is a great tool to help us find weaknesses and misconfiguration in our services. It is particularly important for Cloud Native applications, which, due to their distributed and elastic nature, need to be resilient by default.
Litmus is a CNCF sandbox project for practicing Chaos Engineering in Cloud Native environments. Litmus provides a chaos-operator, a large set of chaos experiments in its hub, detailed documentation, quick Demo, and a friendly community. In this blog we'll show you how you can use Litmus and Okteto together to start Chaos testing your applications in a few seconds.
When chaos testing an application with LitmusChaos, there are four components that you'll need to keep in mind.
This is the core part of LitmusChaos. The operator is in charge of executing the experiments, and reporting the results once the experiment is finished.
This is the chaos action that will performed on your application. This goes from Kubernetes specific like deleting a pod, or hogging the network, to application specific actions like randomly deleting an OpenEBS drive.
The LitmusChaos community maintains an online hub of chaos experiments.
The Chaos Engine is the link between the chaos experiment and the application under test. This is where you specify any parameters of your experiment such as its duration, enable/disable policies (e.g enable/disable monitoring) as well as information on how to find the targets of the experiment (typically, this is the application under test).
This is the application that will be the "target" of the chaos experiment. Currently, LitmusChaos supports Deployments, StatefulSets and DaemonSets. Under the default configuration, you need to add the
litmuschaos.io/chaos: "true" tag to the resource for the Chaos Operator to be able to find them, and to prevent other applications from being affected.
To chaos-test your application you'll need to install:
- The okteto CLI.
- A free Okteto Cloud account.
kubectlconfigured to talk to Okteto Cloud.
- Your favorite IDE or text editor.
You can always manually install every component by hand. But instead, I'm taking advantage of Okteto's pre-configured development environments. Just click on the Develop on Okteto button below and deploy your chaos-ready development environment:
This will automatically deploy the following resources on your Okteto Cloud account:
- The Litmus Chaos operator, from the Okteto catalog
- The pod-delete chaos experiment
- The application under test
Now that we have our development environment, let's chaos test the application. For this example, we are using the traditional Hello World application, deployed with two replicas. Click on the link and call it a few times to verify that it works fine.
With the application running, we are ready to start the chaos experiment. In Litmus-speak, this means creating the
ChaosEngine resource. Create a file
engine.yaml, open it in your favorite IDE, and paste the content below:
apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: pod-killer-chaos spec: annotationCheck: 'true' engineState: 'active' appinfo: applabel: 'app=hello-world' appkind: 'deployment' chaosServiceAccount: default monitoring: false jobCleanUpPolicy: 'delete' experiments: - name: pod-delete spec: components: env: - name: KILL_COUNT value: '1' - name: TOTAL_CHAOS_DURATION value: '60s' - name: CHAOS_INTERVAL value: '15s'
ChaosEngine resource has three main sections:
appinfo: This tells the Litmus operator which application to target. You have to specify a label selector and the type of resource.
experiments: A list of experiments to run. In this case, we are running the Pod Delete experiment.
experiments.spec.components: The experiment-specific value overrides. In this case, we are telling the experiment to kill 1 pod over 60 seconds. The available values come from the
Start the chaos experiment by creating the
ChaosEngine resource with
$ kubectl apply -f engine.yaml
The experiment will kill one of our application's pods. If you run the command below once the experiment has started, you'll see how a random pod is killed and then automatically recreated:
$ kubectl get pod -l=app=hello-world
NAME READY STATUS RESTARTS AGE hello-world-75947547d4-2fcbc 1/1 Running 0 57m hello-world-75947547d4-c6wsv 0/1 ContainerCreating 0 10s
While the experiment is running, keep refreshing the browser. Notice how the calls will display different pod names, but they were never interrupted? That's because our application is resilient to pod destruction 💪🏻!
When an experiment is created, a
ChaosResult resource will created to hold the result of the experiment. The
status.verdict key is set to
Awaited while the experiment is in progress. Once it finishes, it will change to either
$ kubectl describe chaosresult pod-killer-chaos-pod-delete
Name: pod-killer-chaos-pod-delete Namespace: rberrelleza Labels: name=pod-killer-chaos-pod-delete Annotations: <none> API Version: litmuschaos.io/v1alpha1 Kind: ChaosResult Metadata: Creation Timestamp: 2020-08-05T21:14:05Z Generation: 5 Resource Version: 165298631 Self Link: /apis/litmuschaos.io/v1alpha1/namespaces/rberrelleza/chaosresults/pod-killer-chaos-pod-delete UID: a7f50d28-1f14-4a03-9013-94e72d69eb72 Spec: Engine: pod-killer-chaos Experiment: pod-delete Status: Experimentstatus: Fail Step: N/A Phase: Running Verdict: Awaited Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Summary 45m experiment-l0k004-5x2fj pod-delete experiment has been Passed
In a future post I'll show how you can take your chaos testing to the next level and write your own application-specific experiments. Can't wait? Karthik from MayaData wrote a pretty cool getting started guide. And it happens to use the
okteto CLI as part of the dev flow. How cool is that?
Litmus has a monthly community call where the community gets together and talks about their cool use cases and needs. The team was nice enough to invite me to this month's call to demo the workflow I showed you on this post. It's a great place to talk and learn from other practitioners.
In this post, we showed how you can deploy a replicable development environment that includes an application, the LitmusChaos operator, and your chaos experiment, all in one click. Then, we ran a chaos experiment, validating that our application is resilient to a pod failure.
This is a great example of how you can use Okteto to accelerate your entire team. One person configures the app with the chaos tools, and everyone else can create their own namespace on demand, deploy a pre-configured, chaos-ready development environment, and start running experiments without having to think twice about installation scripts or infrastructure configuration.
Thanks to Karthik Satchitanand and Prithvi Raj for reading drafts of this.