Forem: Uma Mukkara

Cloud native chaos engineering principles - Version 2

Uma Mukkara — Mon, 25 Oct 2021 14:26:38 +0000

When we started with the Litmus project, we defined a sub category in Chaos Engineering called Cloud Native Chaos Engineering and assigned some architectural goals for building a generic stack around this category. They are published here. As we spent more time with the Litmus community and as the new technologies such as GitOps evolved around cloud native, we updated the core chaos engineering principles around which Litmus has evolved into a fully featured platform to practice end-to-end chaos engineering for cloud native services and applications.

The first version started with four principles - Open source, Chaos APIs/CRDs, Plugins and Community Chaos. We realised that the principle around Plugins is more about integration into other DevOps tools and can be realised by having a good API. During the evolution of chaos community, we observed two additional patterns:

Observability and chaos engineering are closely related and
Scaling of chaos engineering or automation of chaos engineering is an important aspect

With these observations, we defined the following five cloud native chaos engineering principles:

Open source
Community collaborated experiments
Open API and chaos life cycle management
Scaling and automating through GitOps
Observability through generic chaos metrics

Open Source

Cloud native communities and technologies have been revolving around open source. Chaos engineering frameworks being open source in nature benefit in building strong communities around them and help them make them more comprehensive, rugged and feature-rich.

Chaos Experiments as Building Blocks

Chaos experiments need to be simple to use, highly flexible and tunable. Chaos experiments have to be rugged, with little or no chance of resulting in false negatives or false positives. Chaos experiments are like Lego blocks: You can use them to build meaningful chaos workflows.

Manageable Chaos Experiments and API

Chaos engineering has to employ well-known software engineering practices. Managing chaos scenarios can quickly become complex, with more team members getting involved, more changes happening frequently and the requirements being altered. Upgrading chaos experiments becomes common. The chaos engineering framework should enable management of chaos experiments in an easy and simple manner, and should be done in the Kubernetes way. Developers and operators should think of chaos experiments as Kubernetes customer resources.

Scale Through GitOps

Start with low-hanging fruit in terms of obvious and simple issues. As you start fixing them, chaos scenarios become more comprehensive and large; the number of chaos scenarios also increases. Chaos scenarios need to be automated or need to be triggered when a change is made to the applications of the service. Tools around GitOps may be used to trigger chaos when a configuration change happens to either application or chaos experiments.

Open Observability

Observability and chaos engineering go together when fixing issues related to reliability. There are many observability stacks and systems that are well-developed and put into practice. Introduction of chaos engineering should not require a new observability system. Rather, the context of chaos engineering should fit nicely into the existing system. For this, chaos metrics from the system where chaos is introduced are exported into the existing observability database, and the context of chaos is painted on to the existing dashboards. The below dashboard consists of red chaos rectangles to depict the chaos periods, making the chaos context very clear to the user.

LitmusChaos 2.0

Litmus recently achieved 2.0 GA status with all the principles of cloud native chaos engineering implemented in it. Litmus makes the practice of chaos engineering easy. Get started with Litmus here.

Introduction to LitmusChaos

Uma Mukkara — Fri, 31 Jul 2020 12:20:22 +0000

LitmusChaos is a CNCF sandbox project. Its mission is to help Kubernetes SREs and developers to find weaknesses in Kubernetes platform and applications running on Kubernetes by providing a complete Chaos Engineering framework and associated chaos experiments. In this article, we will discuss

Why is resilience important for Kubernetes?
How to achieve it using Litmus?

State of Kubernetes

Thanks to Kubernetes and the ecosystem built by CNCF, common API for microservices orchestration has been a reality for developers. Kubernetes is believed to have crossed the chasm in terms of the technology adoption cycle. As the adoption continues to increase, we need more tools to ensure the adoption process of Kubernetes is seamless and stable. One area that Kubernetes developers and SREs need to focus on is Resilience. As an SRE, how do I make sure that my application is resilient against possible failures? What is the process in which I can tackle Resilience? These are the questions to which we try to provide answers in this article.

** Kubernetes needs tools and infrastructure to validate the resilience of the platform and applications running on the platform. **

Definition of Resilience

Resilience is the system’s ability to stay afloat when a fault occurs. Staying afloat means different things to different people under different circumstances. Whoever is looking at resilience will usually have a steady-state hypothesis for their system. If that steady state is regained after a fault occurs, then the system is said to be resilient against that fault. Again many types of defects can occur, and they can happen in any sequence. Though you would want your system to be resilient against all faults individually or a combination of these faults all the time, it is practical to assume that it is far fetched. No system is 100% resilient.

In the context of Kubernetes, resilience is even more important because Kubernetes architecture works on the principle of reconciling to the desired state. The state of a resource inside Kubernetes can be changed by Kubernetes itself or by external means.

Some examples of general faults on Kubernetes are shown below. Pod evictions, nodes going to “not-ready” state are not uncommon in Kubernetes environments, but the steady-state hypothesis varies widely depending on where you look inside Kubernetes. If a pod is evicted and is rescheduled quickly on another node, then Kubernetes is resilient. Still, if the service that depends on this pod goes down or becomes slow, then that service is not resilient against pod eviction failure. In summary, Resilience is context-sensitive and improving it needs to be looked at as a practice rather than a specific set of tasks.

Importance of Resilience

The above diagram shows the dependency stack of resilience for you
r application. At the bottom of the dependency, the stack is the physical infrastructure. Kubernetes is being run on a variety of infrastructure, ranging from virtual machines to bare metal and a combination of them. The platform’s physical nature is a source of faults to the application that runs inside containers, as shown in the tip of the above diagram. The next layer of dependency is Kubernetes itself. Gone are the days, when the platform software like Linux changes once in a year, but expect Kubernetes upgrades every quarter at least for the next few years. Each upgrade has to be considered for careful testing to ensure that the upgraded software is solving the expected problem and not introducing any breaking scenarios. On top of Kubernetes, you have other services like CoreDNS, Envoy, Istio, Prometheus, databases, middleware, etc., which are necessary for your functioning of cloud-native environments. These cloud-native services also go through frequent upgrades.

If you look at the above facts, your application resilience really depends more on the underlying stack than your application itself. It is possible that once your application is stabilized, the resilience of your service that runs on Kubernetes depends on other components and infrastructure more than 90% of the time.

Thus it is important to verify your application resilience whenever a change has happened in the underlying stack. “Keep verifying” is the key. Robust testing before upgrades is not good enough, mainly because you cannot possibly consider all sorts of faults during upgrade testing. This introduces the concept of chaos engineering. The process of “continuously verifying if your service is resilient against faults” is called chaos engineering. For the reasons stated above, overall stack resilience has to be achieved, and chaos engineering against the entire stack must be practiced.

Achieving Resilience on Kubernetes

Achieving resilience requires the practice of Chaos Engineering. SRE teams are prioritizing the practice of chaos engineering at the early stage of their move to Kubernetes. It is also known as the “Chaos First” principle per Adrian Cockrift.

Chaos first is a really important principle. It’s too hard to add resilience to a finished project. Needs to be a gate for deployment in the first place. https://t.co/xi98y8JZpJ
— adrian cockcroft (@adrianco) December 3, 2019

Chaos engineering is more practice and religion than a few tasks or steps. But the fundamental block remains the following.

If this practice has to be adopted as a natural choice by SREs on Kubernetes, it must be done in a cloud-native way. The basic principle of being cloud-native is being declarative. Doing chaos engineering openly, declaratively, and keeping the community in mind are some of the principles of “Cloud-Native Chaos Engineering. I had covered this topic in detail in a previous post here.

In summary, in cloud-native chaos engineering, you will have a set of chaos CRDs, and chaos experiments are available as custom resources. Chaos is managed using well known declarative APIs.

Litmus Project introduction

Litmus is a chaos engineering framework for Kubernetes. It provides a complete set of tools required by Kubernetes developers and SREs to carry out chaos tests easily and in Kubernetes-native way. The project has “Declarative Chaos” as the fundamental design goal and keeps the community at the center for growing the chaos experiments.

Litmus has the following components:

Chaos Operator
This operator is built using the Operator SDK framework and manages the lifecycle of a chaos experiment.

Chaos CRDs
Primarily, there are three chaos CRDs - ChaosEngine, ChaosExperiment, and ChaosResult. Chaos is built, run, and managed using the above CRDs. ChaosEngine CRD ties the target application and ChaosExperiment CRs. When run by the operator, the result will be stored in ChaosResult CR.

Chaos experiments or the ChaosHub
Chaos experiments are the custom resources on Kubernetes. The YAML specifications for these custom resources are hosted at the public ChaosHub (https://hub.litmuschaos.io).

Chaos Scheduler
Chaos scheduler supports the granular scheduling of chaos experiments.

Chaos metrics exporter
This is a Prometheus metrics exporter. Chaos metrics such as the number, type of experiments, and their results are exported into Prometheus. These metrics and target application metrics are combined to plot graphs that can show the effect of chaos to the application service or performance.

Chaos events exporter
Litmus generates a chaos event for every chaos action that it takes. These chaos events are stored in etcd, and later exported to an event receiver for doing correlation or debugging of a service affected by chaos injection.

Portal
Litmus Portal is a centralized web portal for creating, scheduling, and monitoring chaos workflows. A chaos workflow is a set of chaos experiments. Chaos workflows can be scheduled on remote Kubernetes clusters from the portal. SRE teams can share the portal while managing chaos through the portal.
Note: Litmus Portal is currently under development.

Litmus use cases

Chaos tests can be done anywhere in the DevOps cycle. The extent of chaos tests varies from CI pipelines to production. In development pipelines, you might use chaos tests specific to applications being developed. As you move towards operations or production, you will expect a lot of failure scenarios for which you want to be resilient against, hence the number of chaos tests grows significantly.

Typical use cases of Litmus include - failure or chaos testing in CI pipelines, increased chaos testing in staging and production and production environments, Kubernetes upgrades certification, post-upgrade validation of services, and resilience benchmarking, etc.

We keep hearing from SREs that they typically see a lot of resistance for introducing chaos from both developers and management. In the practice of chaos engineering, starting with small chaos tests and showing the benefits to developers and management will result in the initially required credibility. With time, the number of tests and associated resilience also will increase.

Chaos Engineering is a practice. As seen above, with time, management buying and the SRE confidence will increase, and they move the chaos tests into production. This process will increase resilience metrics, as well.

Litmus architecture

Litmus architecture considers declarative chaos and scalability of chaos as two important design goals. Chaos Operator watches for changes to ChaosEngine CRs and spins off a chaos experiment job with each experiment on a target. Multiple chaos jobs can be run in parallel. The results are stored conveniently as a CR. So, you don’t need to worry about tracking the experiment for a result. They are always available in Kubernetes etcd database. Chaos metrics scraper is also a deployment that scrapes the chaos metrics from ChaosEngine and ChaosResult CRs in etcd database. The above diagram of Litmus chaos execution in one cluster.

For scalability and deeper chaos, a set of chaos experiments are put together into a workflow, and they are executed through argo workflow. The construction and management of chaos workflows are done at the Litmus portal and run on the target Kubernetes cluster. Portal also includes intuitive chaos analytics. Portal also provides an easy experience for teams to develop new chaos experiments through a private chaoshub called myhub.

Security considerations

Chaos is disruptive by design. One needs to be careful about who can introduce chaos and where. Litmus provides various configurations to control the chaos through policies. Some of them are:

Annotations: Annotations can be enabled at the application level. When they are enabled, which is the setting by default, the target application needs to be annotated with “chaos=true”
ServiceAccounts: RBACs are configurable at each experiment level. Each experiment may require different permissions based on the type of chaos being introduced.

However, Litmus can also be run in Admin mode where chaos experiments themselves are run in the Litmus namespace, and there is a service account with admin privileges attached to litmus. This mode is recommended only if you are running Litmus for learning purposes or pure dev environments. It is advised to run Litmus in namespace mode in staging and production environments.

Anatomy of a chaos experiment

A chaos experiment in Litmus is designed to be flexible and expandable. Some of the tunables are
Chaos parameters like chaos interval, frequency
Chaos library itself. Sometimes it is possible that different chaos libraries can do a particular chaos action. You can choose a library of your choice at that time. For example, Litmus supports Native-Litmus, PowerfulSeal, and ChaosToolkit for doing a pod-delete. This model allows you to use Litmus framework if you have already been using some chaos and make use of them as they are.

Litmus experiment also consists of steady state checks and post chaos checks which can be changed as per your requirement at the time of chaos implementation.

LitmusChaos experiments

Chaos experiments are the core part of Litmus. We expect that these experiments are going to be continuously tuned and new ones added through ChaosHub.

There is a group of experiments categorized as generic. These will cover chaos for any generic Kubernetes resources or physical components.
The other group is called application-specific chaos experiments. Chaos specific to an application logic is covered in these experiments. We encourage the cloud-native application maintainers and developers to share their fail path tests into the ChaosHub so that their users can run the same experiments in production or staging through Litmus.

Hub has around 32 experiments currently. Some of the important experiments are:

Experiment Name	Description
Node graceful loss/maintenance (drain, eviction via taints)	K8s Nodes forced into NotReady state via graceful eviction
Node resource exhaustion (CPU, Memory Hog)	Stresses the CPU & Memory of the K8s Node
Node ungraceful loss (kubelet, docker service kill)	K8s Nodes forced into NotReady state via ungraceful eviction due to loss of services
Disk Loss (EBS, GPD)	Detaches EBS/GPD PVs or Backing stores
DNS Service Disruption	DNS pods killed on the Cluster DNS Service deleted
Pod Kill	Random deletion of the pod(s) belonging to an application
Container Kill	SIGKILL of an application pod’s container
Pod Network Faults (Latency, Loss, Corruption, Duplication)	Network packet faults resulting in lossy access to microservices
Pod Resource faults (CPU, Memory hog)	Simulates resource utilization spikes in application pods
Disk fill (Ephemeral, Persistent)	Fills up disk space of ephemeral or persistent storage
OpenEBS	Chaos on control and data plane components of OpenEBS, a containerized storage solution for K8s

Creating your own experiments

ChaosHub provides ready to use experiments that can be tuned to your needs. These experiments cover your initial Chaos Engineering needs and help you in getting started with the practice of chaos engineering. Soon, you will have to develop LitmusChaos experiments specific to your application, and Litmus provides an easy-to-use SDK for that. Litmus SDK is available in GO, Python, and Ansible languages. By using this SDK, you can create the skeleton of your new experiment in a few steps and start adding your chaos logic. Adding your chaos logic, pre- and post-experiment checks will make your experiment complete and ready to be used on the Litmus infrastructure.

Monitoring chaos

The litmus portal, which is under development, is adding many charts to help monitor chaos experiments, workflows, and interpret their results. You can currently use the chaos metrics exported to Prometheus to plot the chaos events right onto your existing application monitoring graphs. The below diagram is an example of showing chaos injected into the microservices demo application “sock-shop.” The red areas are chaos injections.

How to get started?

You can create your first chaos in three simple steps.

Install Litmus through helm
Choose your application and your chaos experiment from ChaosHub (for example a pod-delete experiment)
Create a ChaosEngine manifest and run it.

Read this blog that gives you a quick start guide experience for Litmus on a demo application. https://dev.to/uditgaurav/get-started-with-litmuschaos-in-minutes-4ke1

Roadmap

The Litmus project roadmap is summarized at https://github.com/litmuschaos/litmus/blob/master/ROADMAP.md

Monthly community meetings are used to take the feedback from the community and use them to prioritize the next month or quarter roadmap. Chaos workflow management and monitoring are one of the current features under active development. Long term roadmap of Litmus is to add application-specific experiments to the hub to cover the entire spectrum of CNCF landscape.

Contributing to LitmusChaos

Contributing guidelines are here https://github.com/litmuschaos/litmus/blob/master/CONTRIBUTING.md

Litmus portal is under active development, and so are a lot of new chaos experiments. Visit our roadmap section or issues to see if you can find what matches your interest. If not, no problem, join our community at #litmus channel on Kubernetes slack and just say hello.

Community

Monthly sync up meetings happens on the third Wednesday of every month. Join this meeting to speak to maintainers online and also discuss how you can help with your contributions or seek prioritization of an issue or a feature request.
The project encourages open communication and governance. We have created Special Interest Groups or SIGs to allow participation in their areas of contributor’s interest. See https://github.com/orgs/litmuschaos/teams?query=SIG
Community interactions happen at the #litmus channel in Kubernetes slack. Join at https://slack.litmuschaos.io
Contributor interactions happen at the #litmus-dev channel in Kubernetes slack. Join at https://slack.litmuschaos.io

Conclusion

Kubernetes SREs can achieve the resilience improvements gradually with the adoption of Chaos-First principle and cloud-native chaos engineering principles. Litmus is a toolset of framework to get you started and complete this mission all the way.

How Kubernetes is used? Own Namespaces?

Uma Mukkara — Sun, 19 Jul 2020 04:11:52 +0000

Hey Kubernetes practitioners, one of the questions that keeps coming up is how large teams are using Kubernetes? Do you use namespace as a ownership boundary in your teams? Do you share Kubernetes cluster among your team by configuring hierarchical ownership policies? Can you share your experience?

Why this question?

The Litmus team is considering chaos at namespace level. You will be able to run the complete chaos infrastructure within the namespace.

Here is the description of a possible scenario:

Kubernetes cluster is setup on one of the cloud service providers like EKS, GKE, AKS or DOKS where the management of the cluster is not managed by the team. Then your team has a set of SREs or admins who have cluster wide access through service accounts to help manage the administration of applications and the cluster itself. When a developer wants the Kubernetes environment, a new namespace is created with service account settings with access to that developer. The developer has enough levers within the namespace and gets the Kubernetes environment that is required for development needs.

Are there any scenarios in which you felt there are limitations? Is this a common practice? Or developers are better off with their own small clusters?

LitmusChaos in CNCF Sandbox

Uma Mukkara — Thu, 02 Jul 2020 15:06:45 +0000

LitmusChaos was accepted as a CNCF sandbox project last week. Maintainers, community and the team are thrilled to join CNCF's larger ecosystem. What does this really mean to the Litmus project and the community or the users? Cloud-native chaos engineering gets a boost for broader community involvement. I have covered the journey of LitmusChaos and the future roadmap in the announcement blog here.

The project

Well, the project is now under vendor-neutral governance. This is the point of joining CNCF as a project. This allows large companies and end-users to invest their resources to the development of the project and build a larger community of the users. So, the project grows, perhaps faster. Of course, MayaData will continue to sponsor this project but the maintainers actively seek more contributions from the community.

The Community

LitmusChaos has two primary components. First is the actual engine that orchestrates the declarative chaos and the second is ChaosHub. The project now seeks contributions to both areas, however, the experiments to the chaos hub are crucial to the growth of adoption of the project. Adoption and contribution of new chaos experiments are cyclic in nature and help each other.

Kubernetes SREs can help

LitmusChaos primary persona is SREs. Cloud-native SREs are moving towards the “chaos first” principle in their DevOps strategy where plans for chaos engineering are devised before the deployment and operations. With this shift, it is only a question of “when” and not “if” to adopt chaos engineering for them. LitmusChaos helps SREs to do a litmus test on the resilience of Kubernetes implementation and the applications running on Kubernetes. We assume both Kubernetes and the microservices/applications on Kubernetes undergo rigorous testing practices before ending up in the hands of SREs. However, each implementation is different in the resource configurations, scale, load, and the combinations of various applications running on Kubernetes. This necessitates to do a litmus test on the resilience of the platform and applications periodically. In other words, adopt chaos first principle.

Litmus project helps SREs to jump-start their chaos journey by giving the required operator and experiments. There are numerous Kubernetes resource-specific experiments already along with a few application-specific ones. It is possible for you to introduce your first chaos experiments in minutes. I.e., it takes a few minutes from installing Litmus to injecting a fault.

SREs can use Litmus, its experiments and develop new experiments seamlessly. There is an SDK available in Ansible, Python, and GO. If the newly developed experiment is generic and useful to the community, they can contribute it back to ChaosHub.

What can you contribute other than chaos experiments?

The project is under active development. The roadmap is updated here https://github.com/litmuschaos/litmus/blob/master/ROADMAP.md#in-progress-near-term.

Some of the experiments are being extended for Python and GO SDK. You can put your Python or GO skills into active mode and fulfill your open source karma.

I am also excited to say that litmus portal is being developed which can receive contributions from frontend developers. Are you interested in contributing to the portal with your ReactJS, GraphQL, Database, GoLang, Cypress skills? You are welcome - join slack to talk to the team or directly pick up an issue at GitHub.

Conclusion

We are thrilled to have joined as a CNCF sandbox project and looking forward to taking the Litmus project for greater development and adoption.

Join litmus community slack to ask any questions on how to get started.

Chaos Engineering for cloud-native systems

Uma Mukkara — Sun, 14 Jun 2020 08:32:37 +0000

With great enthusiasm and excitement, I am writing my first post on the dev.to platform. My posts will mostly cover cloud-native data management and cloud-native Chaos Engineering: sometimes on existing technology and sometimes on future thoughts. I started my Kubernetes journey four years ago when I pivoted from closed source to open source as a platform to build future technologies. Having co-created two open-source projects OpenEBS and LitmusChaos, now I can only say that the choice was the right one, and a brilliant one.

Open source is the vehicle or platform for innovation. Kubernetes is the most recent example. We, at MayaData, started solving issues around data management for Kubernetes in 2017 and quickly realized that there is a common issue that needs to be resolved amongst all Kubernetes applications i.e., How to realize the promise of Kubernetes - the agility of DevOps? Developers/SREs have understood the advantages of Kubernetes and are moving to microservices at an unprecedented rate. They need a framework or toolset to help them make the transition to microservices quickly and keep the resilience of the final deployment at acceptable levels. The choice of the framework needs to be architecturally a correct one.

What's the answer?
The answer is - adopt "Chaos Engineering" from the beginning and as a first principle.

“Chaos First” principle

There is so much written about Chaos Engineering in recent times. But one thing that stuck out to me is the following from Adrian Cockroft.

“Chaos first” is a great strategy for cloud-native teams in order to achieve resilience when your deployment scales up.

Chaos Engineering has been the answer to achieving greater reliability in production systems. The cloud-native ecosystem has to adopt this discipline more so because of the dynamism inside the components and the sheer number of components itself. Most importantly they are all loosely coupled, which is a requirement for microservices architecture. I had recently written an article at CNCF blog platform around the principles and want to write them here again in my first post.

What is Cloud-Native Chaos Engineering?

It is Chaos Engineering done in a cloud-native way or Kubernetes-way. Here are the four principles that define how cloud-native your Chaos Engineering framework or practice is.

4 Principles of a Cloud Native Chaos Engineering Framework

These principles are for your guidance and they help you when you are choosing a strategy for your Chaos Engineering stack or framework for your Kubernetes platform.

1. Open source

The framework has to be completely open-source under the Apache2 License to encourage broader community participation and inspection. The number of applications moving to the Kubernetes platform is limitless. At such a large scale, only the Open Chaos model will thrive and get the required adoption.

2. CRDs for Chaos Management

The framework should have clearly defined CRDs for orchestrating chaos on Kubernetes. These CRDs act as standard APIs to provision and manage the chaos in complex production environments. These are the building blocks for chaos workflow orchestration.

3. Extensible and pluggable

One lesson learned why cloud-native approaches are winning is that their components can be relatively easily swapped out and new ones introduced as needed. Any standard chaos library or functionality developed by other open-source developers should be able to be integrated into and orchestrated for testing via this pluggable framework.

4. Broad Community adoption

Once we have the APIs, Operator, and plugin framework, we have all the ingredients needed for a common way of injecting chaos. The chaos will be run against a well-known infrastructure like Kubernetes or applications like databases or other infrastructure components like storage or networking. These chaos experiments can be reused, and a broad-based community is useful for identifying and contributing to other high-value scenarios. Hence a Chaos Engineering framework should provide a central hub or forge where open-source chaos experiments are shared, and collaboration via code is enabled.

An example of cloud-native Chaos Engineering framework is LitmusChaos.

Brief introduction to LitmusChaos

LitmusChaos is a cloud-native Chaos Engineering framework for Kubernetes. It is unique in fulfilling all 4 of the above principles. Litmus community publishes it's chaos experiments at hub.litmuschaos.io. The hub contains chaos experiments for Kubernetes resources and specific applications. Developers can bring in their own chaos experiments as a Docker container image and make use of the Litmus framework very easily to orchestrate and monitor chaos.

I will write another detailed blog to introduce Litmus here to the Dev community.

I will write another detailed blog to introduce Litmus here to Dev community.

Summary

Doing Chaos Engineering on Kubernetes should be an important first step towards high resilience levels in the deployments. The strategy of “Chaos First” will help set up the mindset of both developers and SREs. For practicing Chaos Engineering in cloud-native environments, you need not start writing the experiments from scratch. Instead choose the open-source framework that has a well-defined API, a lot of well-tested experiments, and a good community around it. Cloud-native Chaos Engineering fits into Kubernetes' scheme of things and SREs often find it as plug-and-play.

Forem: Uma Mukkara

Cloud native chaos engineering principles - Version 2

Open Source

Chaos Experiments as Building Blocks

Manageable Chaos Experiments and API

Scale Through GitOps

Open Observability

LitmusChaos 2.0

Introduction to LitmusChaos

State of Kubernetes

Definition of Resilience

Importance of Resilience

Achieving Resilience on Kubernetes

Litmus Project introduction

Litmus use cases

Litmus architecture

Security considerations

Anatomy of a chaos experiment

LitmusChaos experiments

Creating your own experiments

Monitoring chaos

How to get started?

Roadmap

Contributing to LitmusChaos

Community

Conclusion

How Kubernetes is used? Own Namespaces?

Why this question?

Here is the description of a possible scenario:

LitmusChaos in CNCF Sandbox

The project

The Community

Kubernetes SREs can help

What can you contribute other than chaos experiments?

Conclusion

Chaos Engineering for cloud-native systems

“Chaos First” principle

What is Cloud-Native Chaos Engineering?

4 Principles of a Cloud Native Chaos Engineering Framework

1. Open source

2. CRDs for Chaos Management

3. Extensible and pluggable

4. Broad Community adoption

Brief introduction to LitmusChaos

Summary

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join our community on Slack

Contribute to LitmusChaos and share your feedback on GitHub

If you like LitmusChaos, become one of the many stargazers here