Chaos Engineering with Litmus and Okteto Cloud

Image of Chaos Engineering with Litmus and Okteto Cloud

Cloud Native applications are, by definition, highly distributed, elastic, resistent to failure and loosely coupled. That’s easy to say, and even diagram. But how do we validate that our applications will perform as expected under different failure conditions?

Enter Chaos engineering. Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions. Chaos Engineering is a great tool to help us find weaknesses and misconfiguration in our services. It is particularly important for Cloud Native applications, which, due to their distributed and elastic nature, need to be resilient by default.

Litmus is a CNCF sandbox project for practicing Chaos Engineering in Cloud Native environments. Litmus provides a chaos-operator, a large set of chaos experiments in its hub, detailed documentation, quick Demo, and a friendly community. In this blog we’ll show you how you can use Litmus and Okteto together to start Chaos testing your applications in a few seconds.

Chaos Testing with Litmus

When chaos testing an application with LitmusChaos, there are four components that you’ll need to keep in mind.

Chaos Operator

This is the core part of LitmusChaos. The operator is in charge of executing the experiments, and reporting the results once the experiment is finished.

You can install it directly from the command line, the official helm chart, or from the Okteto Cloud catalog.

Chaos Experiment

This is the chaos action that will performed on your application. This goes from Kubernetes specific like deleting a pod, or hogging the network, to application specific actions like randomly deleting an OpenEBS drive.

The LitmusChaos community maintains an online hub of chaos experiments.

Chaos Engine

The Chaos Engine is the link between the chaos experiment and the application under test. This is where you specify any parameters of your experiment such as its duration, enable/disable policies (e.g enable/disable monitoring) as well as information on how to find the targets of the experiment (typically, this is the application under test).

Application Under Test

This is the application that will be the “target” of the chaos experiment. Currently, LitmusChaos supports Deployments, StatefulSets and DaemonSets. Under the default configuration, you need to add the litmuschaos.io/chaos: "true" tag to the resource for the Chaos Operator to be able to find them, and to prevent other applications from being affected.

Prerequistes

To chaos-test your application you’ll need to install:

  1. The okteto CLI.
  2. A free Okteto Cloud account.
  3. kubectl configured to talk to Okteto Cloud.
  4. Your favorite IDE or text editor.

Deploy your Chaos-ready Development Environment

You can always manually install every component by hand. But instead, I’m taking advantage of Okteto’s pre-configured development environments. Just click on the Develop on Okteto button below and deploy your chaos-ready development environment:

This will automatically deploy the following resources on your Okteto Cloud account:

Chaos Test the Application

Now that we have our development environment, let’s chaos test the application. For this example, we are using the traditional Hello World application, deployed with two replicas. Click on the link and call it a few times to verify that it works fine.

With the application running, we are ready to start the chaos experiment. In Litmus-speak, this means creating the ChaosEngine resource. Create a file engine.yaml, open it in your favorite IDE, and paste the content below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-killer-chaos
spec:
annotationCheck: 'true'
engineState: 'active'
appinfo:
applabel: 'app=hello-world'
appkind: 'deployment'
chaosServiceAccount: default
monitoring: false
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
- name: KILL_COUNT
value: '1'
- name: TOTAL_CHAOS_DURATION
value: '60s'
- name: CHAOS_INTERVAL
value: '15s'

The ChaosEngine resource has three main sections:

  • appinfo: This tells the Litmus operator which application to target. You have to specify a label selector and the type of resource.
  • experiments: A list of experiments to run. In this case, we are running the Pod Delete experiment.
  • experiments.spec.components: The experiment-specific value overrides. In this case, we are telling the experiment to kill 1 pod over 60 seconds. The available values come from the ChaosExperiment resource.

Start the chaos experiment by creating the ChaosEngine resource with kubectl:

1
$ kubectl apply -f engine.yaml
1
chaosengine.litmuschaos.io/pod-killer-chaos created

Witness the Chaos

The experiment will kill one of our application’s pods. If you run the command below once the experiment has started, you’ll see how a random pod is killed and then automatically recreated:

1
$ kubectl get pod -l=app=hello-world
1
2
3
NAME                                 READY   STATUS              RESTARTS   AGE
hello-world-75947547d4-2fcbc 1/1 Running 0 57m
hello-world-75947547d4-c6wsv 0/1 ContainerCreating 0 10s

While the experiment is running, keep refreshing the browser. Notice how the calls will display different pod names, but they were never interrupted? That’s because our application is resilient to pod destruction 💪🏻!

When an experiment is created, a ChaosResult resource will created to hold the result of the experiment. The status.verdict key is set to Awaited while the experiment is in progress. Once it finishes, it will change to either Pass or Fail.

1
$ kubectl describe chaosresult pod-killer-chaos-pod-delete
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Name:         pod-killer-chaos-pod-delete
Namespace: rberrelleza
Labels: name=pod-killer-chaos-pod-delete
Annotations: <none>
API Version: litmuschaos.io/v1alpha1
Kind: ChaosResult
Metadata:
Creation Timestamp: 2020-08-05T21:14:05Z
Generation: 5
Resource Version: 165298631
Self Link: /apis/litmuschaos.io/v1alpha1/namespaces/rberrelleza/chaosresults/pod-killer-chaos-pod-delete
UID: a7f50d28-1f14-4a03-9013-94e72d69eb72
Spec:
Engine: pod-killer-chaos
Experiment: pod-delete
Status:
Experimentstatus:
Fail Step: N/A
Phase: Running
Verdict: Awaited
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Summary 45m experiment-l0k004-5x2fj pod-delete experiment has been Passed

Extra Chaos

In a future post I’ll show how you can take your chaos testing to the next level and write your own application-specific experiments. Can’t wait? Karthik from MayaData wrote a pretty cool getting started guide. And it happens to use the okteto CLI as part of the dev flow. How cool is that?

Litmus has a monthly community call where the community gets together and talks about their cool use cases and needs. The team was nice enough to invite me to this month’s call to demo the workflow I showed you on this post. It’s a great place to talk and learn from other practitioners.

Conclusion

In this post, we showed how you can deploy a replicable development environment that includes an application, the LitmusChaos operator, and your chaos experiment, all in one click. Then, we ran a chaos experiment, validating that our application is resilient to a pod failure.

This is a great example of how you can use Okteto to accelerate your entire team. One person configures the app with the chaos tools, and everyone else can create their own namespace on demand, deploy a pre-configured, chaos-ready development environment, and start running experiments without having to think twice about installation scripts or infrastructure configuration.

Let’s keep the conversation going! Join the Okteto and Litmus communities to talk more about Cloud Native development and Chaos Engineering.

Thanks to Karthik Satchitanand and Prithvi Raj for reading drafts of this.