Posted on August 20, 2020

Chaos Engineering with Litmus and Okteto Cloud

Cloud Native applications are, by definition, highly distributed, elastic, resistent to failure and loosely coupled. That's easy to say, and even diagram. But how do we validate that our applications will perform as expected under different failure conditions?

Enter Chaos engineering. Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions. Chaos Engineering is a great tool to help us find weaknesses and misconfiguration in our services. It is particularly important for Cloud Native applications, which, due to their distributed and elastic nature, need to be resilient by default.

Litmus is a CNCF sandbox project for practicing Chaos Engineering in Cloud Native environments. Litmus provides a chaos-operator, a large set of chaos experiments in its hub, detailed documentation, quick Demo, and a friendly community. In this blog we'll show you how you can use Litmus and Okteto together to start Chaos testing your applications in a few seconds.

Chaos Testing with Litmus

When chaos testing an application with LitmusChaos, there are four components that you'll need to keep in mind.

Chaos Operator

This is the core part of LitmusChaos. The operator is in charge of executing the experiments, and reporting the results once the experiment is finished.

You can install it directly from the command line, the official helm chart, or from the Okteto Cloud catalog.

Chaos Experiment

This is the chaos action that will performed on your application. This goes from Kubernetes specific like deleting a pod, or hogging the network, to application specific actions like randomly deleting an OpenEBS drive.

The LitmusChaos community maintains an online hub of chaos experiments.

Chaos Engine

The Chaos Engine is the link between the chaos experiment and the application under test. This is where you specify any parameters of your experiment such as its duration, enable/disable policies (e.g enable/disable monitoring) as well as information on how to find the targets of the experiment (typically, this is the application under test).

Application Under Test

This is the application that will be the "target" of the chaos experiment. Currently, LitmusChaos supports Deployments, StatefulSets and DaemonSets. Under the default configuration, you need to add the litmuschaos.io/chaos: "true" tag to the resource for the Chaos Operator to be able to find them, and to prevent other applications from being affected.

Prerequistes

To chaos-test your application you'll need to install:

The okteto CLI.
A free Okteto Cloud account.
kubectl configured to talk to Okteto Cloud.
Your favorite IDE or text editor.

Deploy your Chaos-ready Development Environment

You can always manually install every component by hand. But instead, I'm taking advantage of Okteto's pre-configured development environments. Just click on the Develop on Okteto button below and deploy your chaos-ready development environment:

This will automatically deploy the following resources on your Okteto Cloud account:

The Litmus Chaos operator, from the Okteto catalog
The pod-delete chaos experiment
The application under test

Development Environment deployed

Chaos Test the Application

Now that we have our development environment, let's chaos test the application. For this example, we are using the traditional Hello World application, deployed with two replicas. Click on the link and call it a few times to verify that it works fine.

The application running

With the application running, we are ready to start the chaos experiment. In Litmus-speak, this means creating the ChaosEngine resource. Create a file engine.yaml, open it in your favorite IDE, and paste the content below:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-killer-chaos
spec:
  annotationCheck: 'true'
  engineState: 'active'
  appinfo:
    applabel: 'app=hello-world'
    appkind: 'deployment'
  chaosServiceAccount: default
  monitoring: false
  jobCleanUpPolicy: 'delete'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: KILL_COUNT
              value: '1'
            - name: TOTAL_CHAOS_DURATION
              value: '60s'
            - name: CHAOS_INTERVAL
              value: '15s'

The ChaosEngine resource has three main sections:

appinfo: This tells the Litmus operator which application to target. You have to specify a label selector and the type of resource.
experiments: A list of experiments to run. In this case, we are running the Pod Delete experiment.
experiments.spec.components: The experiment-specific value overrides. In this case, we are telling the experiment to kill 1 pod over 60 seconds. The available values come from the ChaosExperiment resource.

Start the chaos experiment by creating the ChaosEngine resource with kubectl:

$ kubectl apply -f engine.yaml

chaosengine.litmuschaos.io/pod-killer-chaos created

Witness the Chaos

The experiment will kill one of our application's pods. If you run the command below once the experiment has started, you'll see how a random pod is killed and then automatically recreated:

$ kubectl get pod -l=app=hello-world

NAME                                 READY   STATUS              RESTARTS   AGE
hello-world-75947547d4-2fcbc         1/1     Running             0          57m
hello-world-75947547d4-c6wsv         0/1     ContainerCreating   0          10s

While the experiment is running, keep refreshing the browser. Notice how the calls will display different pod names, but they were never interrupted? That's because our application is resilient to pod destruction 💪🏻!

When an experiment is created, a ChaosResult resource will created to hold the result of the experiment. The status.verdict key is set to Awaited while the experiment is in progress. Once it finishes, it will change to either Pass or Fail.

$ kubectl describe chaosresult pod-killer-chaos-pod-delete

Name:         pod-killer-chaos-pod-delete
Namespace:    rberrelleza
Labels:       name=pod-killer-chaos-pod-delete
Annotations:  <none>
API Version:  litmuschaos.io/v1alpha1
Kind:         ChaosResult
Metadata:
  Creation Timestamp:  2020-08-05T21:14:05Z
  Generation:          5
  Resource Version:    165298631
  Self Link:           /apis/litmuschaos.io/v1alpha1/namespaces/rberrelleza/chaosresults/pod-killer-chaos-pod-delete
  UID:                 a7f50d28-1f14-4a03-9013-94e72d69eb72
Spec:
  Engine:      pod-killer-chaos
  Experiment:  pod-delete
Status:
  Experimentstatus:
    Fail Step:  N/A
    Phase:      Running
    Verdict:    Awaited
Events:
  Type    Reason   Age   From                     Message
  ----    ------   ----  ----                     -------
  Normal  Summary  45m   experiment-l0k004-5x2fj  pod-delete experiment has been Passed

Extra Chaos

In a future post I'll show how you can take your chaos testing to the next level and write your own application-specific experiments. Can't wait? Karthik from MayaData wrote a pretty cool getting started guide. And it happens to use the okteto CLI as part of the dev flow. How cool is that?

Litmus has a monthly community call where the community gets together and talks about their cool use cases and needs. The team was nice enough to invite me to this month's call to demo the workflow I showed you on this post. It's a great place to talk and learn from other practitioners.

Conclusion

In this post, we showed how you can deploy a replicable development environment that includes an application, the LitmusChaos operator, and your chaos experiment, all in one click. Then, we ran a chaos experiment, validating that our application is resilient to a pod failure.

This is a great example of how you can use Okteto to accelerate your entire team. One person configures the app with the chaos tools, and everyone else can create their own namespace on demand, deploy a pre-configured, chaos-ready development environment, and start running experiments without having to think twice about installation scripts or infrastructure configuration.

Let's keep the conversation going! Join the Okteto and Litmus communities to talk more about Cloud Native development and Chaos Engineering.