<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Uma Mukkara</title>
    <description>The latest articles on Forem by Uma Mukkara (@umamukkara).</description>
    <link>https://forem.com/umamukkara</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F395404%2F38747198-924e-4a89-8f20-4473c7beeedb.jpeg</url>
      <title>Forem: Uma Mukkara</title>
      <link>https://forem.com/umamukkara</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/umamukkara"/>
    <language>en</language>
    <item>
      <title>Cloud native chaos engineering principles - Version 2</title>
      <dc:creator>Uma Mukkara</dc:creator>
      <pubDate>Mon, 25 Oct 2021 14:26:38 +0000</pubDate>
      <link>https://forem.com/umamukkara/cloud-native-chaos-engineering-principles-version-2-4kl9</link>
      <guid>https://forem.com/umamukkara/cloud-native-chaos-engineering-principles-version-2-4kl9</guid>
      <description>&lt;p&gt;When we started with the &lt;a href="https://litmuschaos.io"&gt;Litmus project&lt;/a&gt;, we defined a sub category in Chaos Engineering called &lt;strong&gt;Cloud Native Chaos Engineering&lt;/strong&gt; and assigned some architectural goals for building a generic stack around this category. They are published &lt;a href="https://www.cncf.io/blog/2019/11/06/cloud-native-chaos-engineering-enhancing-kubernetes-application-resiliency/"&gt;here&lt;/a&gt;. As we spent more time with the Litmus community and as the new technologies such as GitOps evolved around cloud native, we updated the core chaos engineering principles around which Litmus has evolved into a fully featured platform to practice end-to-end chaos engineering for cloud native services and applications.&lt;/p&gt;

&lt;p&gt;The first version started with four principles - Open source, Chaos APIs/CRDs, Plugins and Community Chaos. We realised that the principle around Plugins is more about integration into other DevOps tools and can be realised by having a good API. During the evolution of chaos community, we observed two additional patterns: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Observability and chaos engineering are closely related and&lt;/li&gt;
&lt;li&gt;Scaling of chaos engineering or automation of chaos engineering is an important aspect&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With these observations, we defined the following five cloud native chaos engineering principles:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open source
&lt;/li&gt;
&lt;li&gt;Community collaborated experiments&lt;/li&gt;
&lt;li&gt;Open API and chaos life cycle management&lt;/li&gt;
&lt;li&gt;Scaling and automating through GitOps&lt;/li&gt;
&lt;li&gt;Observability through generic chaos metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Open Source
&lt;/h2&gt;

&lt;p&gt;Cloud native communities and technologies have been revolving around open source. Chaos engineering frameworks being open source in nature benefit in building strong communities around them and help them make them more comprehensive, rugged and feature-rich.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chaos Experiments as Building Blocks
&lt;/h2&gt;

&lt;p&gt;Chaos experiments need to be simple to use, highly flexible and tunable. Chaos experiments have to be rugged, with little or no chance of resulting in false negatives or false positives. Chaos experiments are like Lego blocks: You can use them to build meaningful chaos workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Manageable Chaos Experiments and API
&lt;/h2&gt;

&lt;p&gt;Chaos engineering has to employ well-known software engineering practices. Managing chaos scenarios can quickly become complex, with more team members getting involved, more changes happening frequently and the requirements being altered. Upgrading chaos experiments becomes common. The chaos engineering framework should enable management of chaos experiments in an easy and simple manner, and should be done in the Kubernetes way. Developers and operators should think of chaos experiments as Kubernetes customer resources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scale Through GitOps
&lt;/h2&gt;

&lt;p&gt;Start with low-hanging fruit in terms of obvious and simple issues. As you start fixing them, chaos scenarios become more comprehensive and large; the number of chaos scenarios also increases. Chaos scenarios need to be automated or need to be triggered when a change is made to the applications of the service. Tools around GitOps may be used to trigger chaos when a configuration change happens to either application or chaos experiments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Observability
&lt;/h2&gt;

&lt;p&gt;Observability and chaos engineering go together when fixing issues related to reliability. There are many observability stacks and systems that are well-developed and put into practice. Introduction of chaos engineering should not require a new observability system. Rather, the context of chaos engineering should fit nicely into the existing system. For this, chaos metrics from the system where chaos is introduced are exported into the existing observability database, and the context of chaos is painted on to the existing dashboards. The below dashboard consists of red chaos rectangles to depict the chaos periods, making the chaos context very clear to the user.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6NTzq7cr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jmn7eek47bczmqlftgmk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6NTzq7cr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jmn7eek47bczmqlftgmk.png" alt="Chaos Engineering observability"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  LitmusChaos 2.0
&lt;/h2&gt;

&lt;p&gt;Litmus recently achieved 2.0 GA status with all the principles of cloud native chaos engineering implemented in it. Litmus makes the practice of chaos engineering easy. Get started with Litmus &lt;a href="https://docs.litmuschaos.io"&gt;here&lt;/a&gt;. &lt;/p&gt;

</description>
      <category>cloudnative</category>
      <category>chaosengineering</category>
      <category>litmuschaos</category>
      <category>devops</category>
    </item>
    <item>
      <title>Introduction to LitmusChaos</title>
      <dc:creator>Uma Mukkara</dc:creator>
      <pubDate>Fri, 31 Jul 2020 12:20:22 +0000</pubDate>
      <link>https://forem.com/umamukkara/introduction-to-litmuschaos-4ibl</link>
      <guid>https://forem.com/umamukkara/introduction-to-litmuschaos-4ibl</guid>
      <description>&lt;p&gt;&lt;em&gt;LitmusChaos&lt;/em&gt; is a CNCF sandbox project. Its mission is to help &lt;em&gt;Kubernetes SREs&lt;/em&gt; and developers to find weaknesses in Kubernetes platform and applications running on Kubernetes by providing a complete &lt;em&gt;Chaos Engineering&lt;/em&gt; framework and associated chaos experiments. In this article, we will discuss&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why is resilience important for Kubernetes?&lt;/li&gt;
&lt;li&gt;How to achieve it using Litmus?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  State of Kubernetes
&lt;/h2&gt;

&lt;p&gt;Thanks to Kubernetes and the ecosystem built by CNCF, common API for microservices orchestration has been a reality for developers. Kubernetes is believed to have crossed the chasm in terms of the technology adoption cycle. As the adoption continues to increase, we need more tools to ensure the adoption process of Kubernetes is seamless and stable. One area that Kubernetes developers and SREs need to focus on is Resilience. As an SRE, how do I make sure that my application is resilient against possible failures? What is the process in which I can tackle Resilience? These are the questions to which we try to provide answers in this article.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F1fux19ningw5sz8ormcc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F1fux19ningw5sz8ormcc.png" alt="Kubernetes crossing the chasm"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;** Kubernetes needs tools and infrastructure to validate the resilience of the platform and applications running on the platform. **&lt;/p&gt;

&lt;h3&gt;
  
  
  Definition of Resilience
&lt;/h3&gt;

&lt;p&gt;Resilience is the system’s ability to stay afloat when a fault occurs. Staying afloat means different things to different people under different circumstances. Whoever is looking at resilience will usually have a steady-state hypothesis for their system. If that steady state is regained after a fault occurs, then the system is said to be resilient against that fault. Again many types of defects can occur, and they can happen in any sequence. Though you would want your system to be resilient against all faults individually or a combination of these faults all the time, it is practical to assume that it is far fetched. No system is 100% resilient. &lt;/p&gt;

&lt;p&gt;In the context of Kubernetes, resilience is even more important because Kubernetes architecture works on the principle of reconciling to the desired state. The state of a resource inside Kubernetes can be changed by Kubernetes itself or by external means. &lt;/p&gt;

&lt;p&gt;Some examples of general faults on Kubernetes are shown below. Pod evictions, nodes going to “not-ready” state are not uncommon in Kubernetes environments, but the steady-state hypothesis varies widely depending on where you look inside Kubernetes. If a pod is evicted and is rescheduled quickly on another node, then Kubernetes is resilient. Still, if the service that depends on this pod goes down or becomes slow, then that service is not resilient against pod eviction failure. In summary, Resilience is context-sensitive and improving it needs to be looked at as a practice rather than a specific set of tasks. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fh1yivkg6blrut81tfrvm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fh1yivkg6blrut81tfrvm.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Importance of Resilience
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F6ykia9aje8544h6dj6yg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F6ykia9aje8544h6dj6yg.png" alt="Kubernetes resilience stack"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above diagram shows the dependency stack of resilience for you&lt;br&gt;
r application. At the bottom of the dependency, the stack is the physical infrastructure. Kubernetes is being run on a variety of infrastructure, ranging from virtual machines to bare metal and a combination of them. The platform’s physical nature is a source of faults to the application that runs inside containers, as shown in the tip of the above diagram. The next layer of dependency is Kubernetes itself. Gone are the days, when the platform software like Linux changes once in a year, but expect Kubernetes upgrades every quarter at least for the next few years. Each upgrade has to be considered for careful testing to ensure that the upgraded software is solving the expected problem and not introducing any breaking scenarios. On top of Kubernetes, you have other services like CoreDNS, Envoy, Istio, Prometheus, databases, middleware, etc., which are necessary for your functioning of cloud-native environments. These cloud-native services also go through frequent upgrades. &lt;/p&gt;

&lt;p&gt;If you look at the above facts, your application resilience really depends more on the underlying stack than your application itself. It is possible that once your application is stabilized, the resilience of your service that runs on Kubernetes depends on other components and infrastructure more than 90% of the time. &lt;/p&gt;

&lt;p&gt;Thus it is important to verify your application resilience whenever a change has happened in the underlying stack. “Keep verifying” is the key. Robust testing before upgrades is not good enough, mainly because you cannot possibly consider all sorts of faults during upgrade testing. This introduces the concept of chaos engineering. The process of “continuously verifying if your service is resilient against faults” is called chaos engineering. For the reasons stated above, overall stack resilience has to be achieved, and chaos engineering against the entire stack must be practiced.&lt;/p&gt;

&lt;h3&gt;
  
  
  Achieving Resilience on Kubernetes
&lt;/h3&gt;

&lt;p&gt;Achieving resilience requires the practice of Chaos Engineering. SRE teams are prioritizing the practice of chaos engineering at the early stage of their move to Kubernetes. It is also known as the “Chaos First” principle per Adrian Cockrift.&lt;/p&gt;


&lt;blockquote&gt;
&lt;p&gt;Chaos first is a really important principle. It’s too hard to add resilience to a finished project. Needs to be a gate for deployment in the first place. &lt;a href="https://t.co/xi98y8JZpJ" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;a href="https://t.co/xi98y8JZpJ" rel="noopener noreferrer"&gt;https://t.co/xi98y8JZpJ&lt;/a&gt;&lt;/p&gt;— adrian cockcroft (&lt;a class="mentioned-user" href="https://dev.to/adrianco"&gt;@adrianco&lt;/a&gt;) &lt;a href="https://twitter.com/adrianco/status/1201703004014907392?ref_src=twsrc%5Etfw" rel="noopener noreferrer"&gt;December 3, 2019&lt;/a&gt;
&lt;/blockquote&gt; 

&lt;p&gt;Chaos engineering is more practice and religion than a few tasks or steps. But the fundamental block remains the following.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fdoks78mwblilityfevj3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fdoks78mwblilityfevj3.png" alt="Chaos engineering principle"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If this practice has to be adopted as a natural choice by SREs on Kubernetes, it must be done in a cloud-native way. The basic principle of being cloud-native is being declarative. Doing chaos engineering openly, declaratively, and keeping the community in mind are some of the principles of “Cloud-Native Chaos Engineering. I had covered this topic in detail in a previous post &lt;a href="https://www.cncf.io/blog/2019/11/06/cloud-native-chaos-engineering-enhancing-kubernetes-application-resiliency/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fi9twbe4ewyolnty4ecrj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fi9twbe4ewyolnty4ecrj.png" alt="Cloud native chaos engineering "&gt;&lt;/a&gt;&lt;br&gt;
In summary, in cloud-native chaos engineering, you will have a set of chaos CRDs, and chaos experiments are available as custom resources. Chaos is managed using well known declarative APIs.&lt;/p&gt;







&lt;h1&gt;
  
  
  Litmus Project introduction
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F57e5dyq5dm3kmo3o54or.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F57e5dyq5dm3kmo3o54or.png" alt="Chaos Engineering Litmus"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Litmus is a chaos engineering framework for Kubernetes. It provides a complete set of tools required by Kubernetes developers and SREs to carry out chaos tests easily and in Kubernetes-native way. The project has “Declarative Chaos” as the fundamental design goal and keeps the community at the center for growing the chaos experiments. &lt;/p&gt;

&lt;p&gt;Litmus has the following components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chaos Operator&lt;/strong&gt;&lt;br&gt;
This operator is built using the Operator SDK framework and manages the lifecycle of a chaos experiment. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chaos CRDs&lt;/strong&gt;&lt;br&gt;
Primarily, there are three chaos CRDs - ChaosEngine, ChaosExperiment, and ChaosResult. Chaos is built, run, and managed using the above CRDs. ChaosEngine CRD ties the target application and ChaosExperiment CRs. When run by the operator, the result will be stored in ChaosResult CR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chaos experiments or the ChaosHub&lt;/strong&gt;&lt;br&gt;
Chaos experiments are the custom resources on Kubernetes. The YAML specifications for these custom resources are hosted at the public ChaosHub (&lt;a href="https://hub.litmuschaos.io" rel="noopener noreferrer"&gt;https://hub.litmuschaos.io&lt;/a&gt;). &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chaos Scheduler&lt;/strong&gt;&lt;br&gt;
Chaos scheduler supports the granular scheduling of chaos experiments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chaos metrics exporter&lt;/strong&gt;&lt;br&gt;
This is a Prometheus metrics exporter. Chaos metrics such as the number, type of experiments, and their results are exported into Prometheus. These metrics and target application metrics are combined to plot graphs that can show the effect of chaos to the application service or performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chaos events exporter&lt;/strong&gt;&lt;br&gt;
Litmus generates a chaos event for every chaos action that it takes. These chaos events are stored in etcd, and later exported to an event receiver for doing correlation or debugging of a service affected by chaos injection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Portal&lt;/strong&gt;&lt;br&gt;
Litmus Portal is a centralized web portal for creating, scheduling, and monitoring chaos workflows. A chaos workflow is a set of chaos experiments. Chaos workflows can be scheduled on remote Kubernetes clusters from the portal. SRE teams can share the portal while managing chaos through the portal. &lt;br&gt;
Note: Litmus Portal is currently under development.&lt;/p&gt;

&lt;h3&gt;
  
  
  Litmus use cases
&lt;/h3&gt;

&lt;p&gt;Chaos tests can be done anywhere in the DevOps cycle. The extent of chaos tests varies from CI pipelines to production. In development pipelines, you might use chaos tests specific to applications being developed. As you move towards operations or production, you will expect a lot of failure scenarios for which you want to be resilient against, hence the number of chaos tests grows significantly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnvqjv92tuxe4b9bc158y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnvqjv92tuxe4b9bc158y.png" alt="Litmus use cases"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Typical use cases of Litmus include - failure or chaos testing in CI pipelines, increased chaos testing in staging and production and production environments, Kubernetes upgrades certification, post-upgrade validation of services, and resilience benchmarking, etc.&lt;/p&gt;

&lt;p&gt;We keep hearing from SREs that they typically see a lot of resistance for introducing chaos from both developers and management. In the practice of chaos engineering, starting with small chaos tests and showing the benefits to developers and management will result in the initially required credibility. With time, the number of tests and associated resilience also will increase. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fyfitfl9pvduodf168tcc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fyfitfl9pvduodf168tcc.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Chaos Engineering is a practice. As seen above, with time, management buying and the SRE confidence will increase, and they move the chaos tests into production. This process will increase resilience metrics, as well. &lt;/p&gt;

&lt;h3&gt;
  
  
  Litmus architecture
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F9ssfhzn099jzkh32fq86.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F9ssfhzn099jzkh32fq86.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Litmus architecture considers declarative chaos and scalability of chaos as two important design goals. Chaos Operator watches for changes to ChaosEngine CRs and spins off a chaos experiment job with each experiment on a target. Multiple chaos jobs can be run in parallel. The results are stored conveniently as a CR. So, you don’t need to worry about tracking the experiment for a result. They are always available in Kubernetes etcd database. Chaos metrics scraper is also a deployment that scrapes the chaos metrics from ChaosEngine and ChaosResult CRs in etcd database. The above diagram of Litmus chaos execution in one cluster. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fb1c6v2c9a3wz8x5p4zr0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fb1c6v2c9a3wz8x5p4zr0.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For scalability and deeper chaos, a set of chaos experiments are put together into a workflow, and they are executed through argo workflow. The construction and management of chaos workflows are done at the Litmus portal and run on the target Kubernetes cluster. Portal also includes intuitive chaos analytics. Portal also provides an easy experience for teams to develop new chaos experiments through a private chaoshub called myhub. &lt;/p&gt;

&lt;h3&gt;
  
  
  Security considerations
&lt;/h3&gt;

&lt;p&gt;Chaos is disruptive by design. One needs to be careful about who can introduce chaos and where. Litmus provides various configurations to control the chaos through policies. Some of them are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Annotations&lt;/strong&gt;: Annotations can be enabled at the application level. When they are enabled, which is the setting by default, the target application needs to be annotated with “chaos=true”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ServiceAccounts&lt;/strong&gt;: RBACs are configurable at each experiment level. Each experiment may require different permissions based on the type of chaos being introduced.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, Litmus can also be run in Admin mode where chaos experiments themselves are run in the Litmus namespace, and there is a service account with admin privileges attached to litmus. This mode is recommended only if you are running Litmus for learning purposes or pure dev environments. It is advised to run Litmus in namespace mode in staging and production environments. &lt;/p&gt;

&lt;h3&gt;
  
  
  Anatomy of a chaos experiment
&lt;/h3&gt;

&lt;p&gt;A chaos experiment in Litmus is designed to be flexible and expandable. Some of the tunables are&lt;br&gt;
Chaos parameters like chaos interval, frequency&lt;br&gt;
Chaos library itself. Sometimes it is possible that different chaos libraries can do a particular chaos action. You can choose a library of your choice at that time. For example, Litmus supports Native-Litmus, PowerfulSeal, and ChaosToolkit for doing a pod-delete. This model allows you to use Litmus framework if you have already been using some chaos and make use of them as they are.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fv6dl6rkgbtr31vbdfnhp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fv6dl6rkgbtr31vbdfnhp.png" alt="Details of Litmus chaos experiment"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Litmus experiment also consists of steady state checks and post chaos checks which can be changed as per your requirement at the time of chaos implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  LitmusChaos experiments
&lt;/h3&gt;

&lt;p&gt;Chaos experiments are the core part of Litmus. We expect that these experiments are going to be continuously tuned and new ones added through ChaosHub. &lt;/p&gt;

&lt;p&gt;There is a group of experiments categorized as generic. These will cover chaos for any generic Kubernetes resources or physical components. &lt;br&gt;
The other group is called application-specific chaos experiments. Chaos specific to an application logic is covered in these experiments. We encourage the cloud-native application maintainers and developers to share their fail path tests into the ChaosHub so that their users can run the same experiments in production or staging through Litmus.&lt;/p&gt;

&lt;p&gt;Hub has around 32 experiments currently. Some of the important experiments are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Experiment Name&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Node graceful loss/maintenance (drain, eviction via taints)&lt;/td&gt;
&lt;td&gt;K8s Nodes forced into NotReady state via graceful eviction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Node resource exhaustion (CPU, Memory Hog)&lt;/td&gt;
&lt;td&gt;Stresses the CPU &amp;amp; Memory of the K8s Node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Node ungraceful loss (kubelet, docker service kill)&lt;/td&gt;
&lt;td&gt;K8s Nodes forced into NotReady state via ungraceful eviction due to loss of services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk Loss (EBS, GPD)&lt;/td&gt;
&lt;td&gt;Detaches EBS/GPD PVs or Backing stores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNS Service Disruption&lt;/td&gt;
&lt;td&gt;DNS pods killed on the Cluster DNS Service deleted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pod Kill&lt;/td&gt;
&lt;td&gt;Random deletion of the pod(s) belonging to an application&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Container Kill&lt;/td&gt;
&lt;td&gt;SIGKILL of an application pod’s container&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pod Network Faults (Latency, Loss, Corruption, Duplication)&lt;/td&gt;
&lt;td&gt;Network packet faults resulting in lossy access to microservices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pod Resource faults (CPU, Memory hog)&lt;/td&gt;
&lt;td&gt;Simulates resource utilization spikes in application pods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk fill (Ephemeral, Persistent)&lt;/td&gt;
&lt;td&gt;Fills up disk space of ephemeral or persistent storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenEBS&lt;/td&gt;
&lt;td&gt;Chaos on control and data plane components of OpenEBS, a containerized storage solution for K8s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Creating your own experiments
&lt;/h3&gt;

&lt;p&gt;ChaosHub provides ready to use experiments that can be tuned to your needs. These experiments cover your initial Chaos Engineering needs and help you in getting started with the practice of chaos engineering. Soon, you will have to develop LitmusChaos experiments specific to your application, and Litmus provides an easy-to-use SDK for that. Litmus SDK is available in GO, Python, and Ansible languages. By using this SDK, you can create the skeleton of your new experiment in a few steps and start adding your chaos logic. Adding your chaos logic, pre- and post-experiment checks will make your experiment complete and ready to be used on the Litmus infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring chaos
&lt;/h3&gt;

&lt;p&gt;The litmus portal, which is under development, is adding many charts to help monitor chaos experiments, workflows, and interpret their results. You can currently use the chaos metrics exported to Prometheus to plot the chaos events right onto your existing application monitoring graphs. The below diagram is an example of showing chaos injected into the microservices demo application “sock-shop.” The red areas are chaos injections.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fsb4hotcba80pn5jk0g5h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fsb4hotcba80pn5jk0g5h.png" alt="Chaos Monitoring with Litmus"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How to get started?
&lt;/h3&gt;

&lt;p&gt;You can create your first chaos in three simple steps. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install Litmus through helm&lt;/li&gt;
&lt;li&gt;Choose your application and your chaos experiment from ChaosHub (for example a pod-delete experiment)&lt;/li&gt;
&lt;li&gt;Create a ChaosEngine manifest and run it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Read this blog that gives you a quick start guide experience for Litmus on a demo application. &lt;a href="https://dev.to/uditgaurav/get-started-with-litmuschaos-in-minutes-4ke1"&gt;https://dev.to/uditgaurav/get-started-with-litmuschaos-in-minutes-4ke1&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Roadmap
&lt;/h3&gt;

&lt;p&gt;The Litmus project roadmap is summarized at &lt;a href="https://github.com/litmuschaos/litmus/blob/master/ROADMAP.md" rel="noopener noreferrer"&gt;https://github.com/litmuschaos/litmus/blob/master/ROADMAP.md&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Monthly community meetings are used to take the feedback from the community and use them to prioritize the next month or quarter roadmap. Chaos workflow management and monitoring are one of the current features under active development. Long term roadmap of Litmus is to add application-specific experiments to the hub to cover the entire spectrum of CNCF landscape.&lt;/p&gt;

&lt;h3&gt;
  
  
  Contributing to LitmusChaos
&lt;/h3&gt;

&lt;p&gt;Contributing guidelines are here &lt;a href="https://github.com/litmuschaos/litmus/blob/master/CONTRIBUTING.md" rel="noopener noreferrer"&gt;https://github.com/litmuschaos/litmus/blob/master/CONTRIBUTING.md&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Litmus portal is under active development, and so are a lot of new chaos experiments. Visit our roadmap section or issues to see if you can find what matches your interest. If not, no problem, join our community at #litmus channel on Kubernetes slack and just say hello.&lt;/p&gt;

&lt;h3&gt;
  
  
  Community
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Monthly sync up meetings happens on the third Wednesday of every month. Join this meeting to speak to maintainers online and also discuss how you can help with your contributions or seek prioritization of an issue or a feature request. &lt;/li&gt;
&lt;li&gt;The project encourages open communication and governance. We have created Special Interest Groups or SIGs to allow participation in their areas of contributor’s interest. See &lt;a href="https://github.com/orgs/litmuschaos/teams?query=SIG" rel="noopener noreferrer"&gt;https://github.com/orgs/litmuschaos/teams?query=SIG&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Community interactions happen at the #litmus channel in Kubernetes slack. Join at &lt;a href="https://slack.litmuschaos.io" rel="noopener noreferrer"&gt;https://slack.litmuschaos.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Contributor interactions happen at the #litmus-dev channel in Kubernetes slack. Join at &lt;a href="https://slack.litmuschaos.io" rel="noopener noreferrer"&gt;https://slack.litmuschaos.io&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Kubernetes SREs can achieve the resilience improvements gradually with the adoption of &lt;code&gt;Chaos-First&lt;/code&gt; principle and cloud-native chaos engineering principles. Litmus is a toolset of framework to get you started and complete this mission all the way. &lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>sre</category>
      <category>chaosengineering</category>
      <category>litmuschaos</category>
    </item>
    <item>
      <title>How Kubernetes is used? Own Namespaces?</title>
      <dc:creator>Uma Mukkara</dc:creator>
      <pubDate>Sun, 19 Jul 2020 04:11:52 +0000</pubDate>
      <link>https://forem.com/umamukkara/how-kubernetes-is-used-own-namespaces-3jkl</link>
      <guid>https://forem.com/umamukkara/how-kubernetes-is-used-own-namespaces-3jkl</guid>
      <description>&lt;p&gt;Hey Kubernetes practitioners, one of the questions that keeps coming up is how large teams are using Kubernetes? Do you use namespace as a ownership boundary in your teams? Do you share Kubernetes cluster among your team by configuring hierarchical ownership policies? Can you share your experience?&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this question?
&lt;/h2&gt;

&lt;p&gt;The Litmus team is considering chaos at namespace level. You will be able to run the complete chaos infrastructure within the namespace. &lt;/p&gt;

&lt;h2&gt;
  
  
  Here is the description of a possible scenario:
&lt;/h2&gt;

&lt;p&gt;Kubernetes cluster is setup on one of the cloud service providers like EKS, GKE, AKS or DOKS where the management of the cluster is not managed by the team. Then your team has a set of SREs or admins who have cluster wide access through service accounts to help manage the administration of applications and the cluster itself. When a developer wants the Kubernetes environment, a new namespace is created with service account settings with access to that developer. The developer has enough levers within the namespace and gets the Kubernetes environment that is required for development needs.&lt;/p&gt;

&lt;p&gt;Are there any scenarios in which you felt there are limitations? Is this a common practice? Or developers are better off with their own small clusters?&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>kubernetes</category>
      <category>question</category>
      <category>litmuschaos</category>
    </item>
    <item>
      <title>LitmusChaos in CNCF Sandbox</title>
      <dc:creator>Uma Mukkara</dc:creator>
      <pubDate>Thu, 02 Jul 2020 15:06:45 +0000</pubDate>
      <link>https://forem.com/umamukkara/litmuschaos-in-cncf-sandbox-3j57</link>
      <guid>https://forem.com/umamukkara/litmuschaos-in-cncf-sandbox-3j57</guid>
      <description>&lt;p&gt;LitmusChaos was accepted as a CNCF sandbox project last week. Maintainers, community and the team are thrilled to join CNCF's larger ecosystem. What does this really mean to the Litmus project and the community or the users? Cloud-native chaos engineering gets a boost for broader community involvement.  I have covered the journey of LitmusChaos and the future roadmap in the announcement blog &lt;a href="https://blog.mayadata.io/litmuschaos-is-now-a-cncf-sandbox-project"&gt;here&lt;/a&gt;. &lt;/p&gt;

&lt;h1&gt;
  
  
  The project
&lt;/h1&gt;

&lt;p&gt;Well, the &lt;a href="https://github.com/litmuschaos/litmus"&gt;project&lt;/a&gt; is now under vendor-neutral governance. This is the point of joining CNCF as a project. This allows large companies and end-users to invest their resources to the development of the project and build a larger community of the users. So, the project grows, perhaps faster. Of course, MayaData will continue to sponsor this project but the maintainers actively seek more contributions from the community. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bIc79CEt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/a9udhdruzwpvss7jwaua.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bIc79CEt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/a9udhdruzwpvss7jwaua.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  The Community
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://github.com/litmuschaos/litmus"&gt;LitmusChaos&lt;/a&gt; has two primary components. First is the actual engine that orchestrates the declarative chaos and the second is ChaosHub. The project now seeks contributions to both areas, however, the experiments to the chaos hub are crucial to the growth of adoption of the project. Adoption and contribution of new chaos experiments are cyclic in nature and help each other.&lt;/p&gt;

&lt;h1&gt;
  
  
  Kubernetes SREs can help
&lt;/h1&gt;

&lt;p&gt;LitmusChaos primary persona is SREs. Cloud-native SREs are moving towards the “chaos first” principle in their DevOps strategy where plans for chaos engineering are devised before the deployment and operations. With this shift, it is only a question of “when” and not “if” to adopt chaos engineering for them. LitmusChaos helps SREs to do a litmus test on the resilience of Kubernetes implementation and the applications running on Kubernetes. We assume both Kubernetes and the microservices/applications on Kubernetes undergo rigorous testing practices before ending up in the hands of SREs. However, each implementation is different in the resource configurations, scale, load, and the combinations of various applications running on Kubernetes. This necessitates to do a litmus test on the resilience of the platform and applications periodically. In other words, adopt chaos first principle.&lt;/p&gt;

&lt;p&gt;Litmus project helps SREs to jump-start their chaos journey by giving the required operator and experiments. There are numerous Kubernetes resource-specific experiments already along with a few application-specific ones. It is possible for you to introduce your first chaos experiments in minutes. I.e., it takes a few minutes from installing Litmus to injecting a fault.&lt;/p&gt;

&lt;p&gt;SREs can use Litmus, its experiments and develop new experiments seamlessly. There is an SDK available in Ansible, Python, and GO. If the newly developed experiment is generic and useful to the community, they can contribute it back to ChaosHub. &lt;/p&gt;

&lt;h1&gt;
  
  
  What can you contribute other than chaos experiments?
&lt;/h1&gt;

&lt;p&gt;The project is under active development. The roadmap is updated here &lt;a href="https://github.com/litmuschaos/litmus/blob/master/ROADMAP.md#in-progress-near-term"&gt;https://github.com/litmuschaos/litmus/blob/master/ROADMAP.md#in-progress-near-term&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Some of the experiments are being extended for Python and GO SDK. You can put your Python or GO skills into active mode and fulfill your open source karma. &lt;/p&gt;

&lt;p&gt;I am also excited to say that &lt;a href="https://github.com/litmuschaos/litmus/wiki/portal-design-spec"&gt;litmus portal&lt;/a&gt; is being developed which can receive contributions from frontend developers. Are you interested in contributing to the portal with your ReactJS, GraphQL, Database, GoLang, Cypress skills? You are welcome - join slack to talk to the team or directly pick up an issue at GitHub. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wATLrSh4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/06dyagv6fhj6x2cu61r9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wATLrSh4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/06dyagv6fhj6x2cu61r9.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;We are thrilled to have joined as a CNCF sandbox project and looking forward to taking the Litmus project for greater development and adoption. &lt;/p&gt;

&lt;p&gt;Join litmus community &lt;a href="https://kubernetes.slack.com/messages/CNXNB0ZTN"&gt;slack&lt;/a&gt; to ask any questions on how to get started.&lt;/p&gt;

&lt;p&gt;Follow us on &lt;a href="https://github.com/litmuschaos/litmus"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>litmuschaos</category>
      <category>chaosengineering</category>
      <category>sre</category>
    </item>
    <item>
      <title>Chaos Engineering for cloud-native systems</title>
      <dc:creator>Uma Mukkara</dc:creator>
      <pubDate>Sun, 14 Jun 2020 08:32:37 +0000</pubDate>
      <link>https://forem.com/umamukkara/chaos-engineering-for-cloud-native-systems-2fjn</link>
      <guid>https://forem.com/umamukkara/chaos-engineering-for-cloud-native-systems-2fjn</guid>
      <description>&lt;p&gt;With great enthusiasm and excitement, I am writing my first post on the dev.to platform. My posts will mostly cover cloud-native data management and cloud-native Chaos Engineering: sometimes on existing technology and sometimes on future thoughts. I started my Kubernetes journey four years ago when I pivoted from closed source to open source as a platform to build future technologies. Having co-created two open-source projects OpenEBS and LitmusChaos, now I can only say that the choice was the right one, and a brilliant one.&lt;/p&gt;

&lt;p&gt;Open source is the vehicle or platform for innovation. Kubernetes is the most recent example. We, at MayaData, started solving issues around data management for Kubernetes in 2017 and quickly realized that there is a common issue that needs to be resolved amongst all Kubernetes applications i.e., How to realize the promise of Kubernetes - the agility of DevOps? Developers/SREs have understood the advantages of Kubernetes and are moving to microservices at an unprecedented rate. They need a framework or toolset to help them make the transition to microservices quickly and keep the resilience of the final deployment at acceptable levels. The choice of the framework needs to be architecturally a correct one. &lt;/p&gt;

&lt;p&gt;What's the answer?&lt;br&gt;
The answer is - adopt "&lt;em&gt;Chaos Engineering&lt;/em&gt;" from the beginning and as a first principle.&lt;/p&gt;




&lt;h2&gt;
  
  
  “Chaos First” principle
&lt;/h2&gt;

&lt;p&gt;There is so much written about Chaos Engineering in recent times. But one thing that stuck out to me is the following from &lt;a href="https://dev.to/adrianco"&gt;Adrian Cockroft&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DJmgRthD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/67i5h3sg0u1ojw8aeb4h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DJmgRthD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/67i5h3sg0u1ojw8aeb4h.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;“Chaos first” is a great strategy for cloud-native teams in order to achieve resilience when your deployment scales up.&lt;/p&gt;




&lt;p&gt;Chaos Engineering has been the answer to achieving greater reliability in production systems. The cloud-native ecosystem has to adopt this discipline more so because of the dynamism inside the components and the sheer number of components itself. Most importantly they are all loosely coupled, which is a requirement for microservices architecture. I had recently written an &lt;a href="https://t.co/7sFJvvGh1A?amp=1"&gt;article at CNCF blog platform&lt;/a&gt;  around the principles and want to write them here again in my first post.&lt;/p&gt;

&lt;h1&gt;
  
  
  What is Cloud-Native Chaos Engineering?
&lt;/h1&gt;

&lt;p&gt;It is Chaos Engineering done in a cloud-native way or Kubernetes-way. Here are the four principles that define how cloud-native your Chaos Engineering framework or practice is.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2DKRnL2P--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/ibyk5hxztgljr43oql3t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2DKRnL2P--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/ibyk5hxztgljr43oql3t.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  4 Principles of a Cloud Native Chaos Engineering Framework
&lt;/h1&gt;

&lt;p&gt;These principles are for your guidance and they help you when you are choosing a strategy for your Chaos Engineering stack or framework for your Kubernetes platform.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Open source
&lt;/h3&gt;

&lt;p&gt;The framework has to be completely open-source under the Apache2 License to encourage broader community participation and inspection. The number of applications moving to the Kubernetes platform is limitless. At such a large scale, only the Open Chaos model will thrive and get the required adoption.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. CRDs for Chaos Management
&lt;/h3&gt;

&lt;p&gt;The framework should have clearly defined CRDs for orchestrating chaos on Kubernetes. These CRDs act as standard APIs to provision and manage the chaos in complex production environments. These are the building blocks for chaos workflow orchestration.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Extensible and pluggable
&lt;/h3&gt;

&lt;p&gt;One lesson learned why cloud-native approaches are winning is that their components can be relatively easily swapped out and new ones introduced as needed. Any standard chaos library or functionality developed by other open-source developers should be able to be integrated into and orchestrated for testing via this pluggable framework.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Broad Community adoption
&lt;/h3&gt;

&lt;p&gt;Once we have the APIs, Operator, and plugin framework, we have all the ingredients needed for a common way of injecting chaos. The chaos will be run against a well-known infrastructure like Kubernetes or applications like databases or other infrastructure components like storage or networking. These chaos experiments can be reused, and a broad-based community is useful for identifying and contributing to other high-value scenarios. Hence a Chaos Engineering framework should provide a central hub or forge where open-source chaos experiments are shared, and collaboration via code is enabled.&lt;/p&gt;

&lt;p&gt;An example of cloud-native Chaos Engineering framework is &lt;a href="https://github.com/litmuschaos/litmus"&gt;LitmusChaos&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Brief introduction to LitmusChaos
&lt;/h1&gt;

&lt;p&gt;LitmusChaos is a cloud-native Chaos Engineering framework for Kubernetes. It is unique in fulfilling all 4 of the above principles. Litmus community publishes it's chaos experiments at  &lt;a href="https://hub.litmuschaos.io"&gt;hub.litmuschaos.io&lt;/a&gt;. The hub contains chaos experiments for Kubernetes resources and specific applications. Developers can bring in their own chaos experiments as a Docker container image and make use of the Litmus framework very easily to orchestrate and monitor chaos.&lt;/p&gt;

&lt;p&gt;I will write another detailed blog to introduce Litmus here to the Dev community.&lt;/p&gt;

&lt;p&gt;I will write another detailed blog to introduce Litmus here to Dev community. &lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;Doing Chaos Engineering on Kubernetes should be an important first step towards high resilience levels in the deployments. The strategy of “Chaos First” will help set up the mindset of both developers and SREs. For practicing Chaos Engineering in cloud-native environments, you need not start writing the experiments from scratch. Instead choose the open-source framework that has a well-defined API, a lot of well-tested experiments, and a good community around it. Cloud-native Chaos Engineering fits into Kubernetes' scheme of things and SREs often find it as plug-and-play.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join our community on &lt;a href="https://kubernetes.slack.com/messages/CNXNB0ZTN"&gt;Slack&lt;/a&gt;
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Contribute to LitmusChaos and share your feedback on &lt;a href="https://github.com/litmuschaos/litmus"&gt;GitHub&lt;/a&gt;
&lt;/h3&gt;

&lt;h3&gt;
  
  
  If you like LitmusChaos, become one of the many stargazers &lt;a href="https://github.com/litmuschaos/litmus/stargazers"&gt;here&lt;/a&gt;
&lt;/h3&gt;

</description>
      <category>chaosengineering</category>
      <category>litmuschaos</category>
      <category>kubernetes</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
