Forem: CWen

Celebrating One Year of Chaos Mesh: Looking Back and Ahead

CWen — Thu, 01 Apr 2021 08:59:31 +0000

It's been a year since Chaos Mesh was first open-sourced on GitHub. Chaos Mesh started out as a mere fault injection tool and is now heading towards the goal of building a chaos engineering ecology. Meanwhile, the Chaos Mesh community was also built from scratch and has helped Chaos Mesh join CNCF as a Sandbox project.

In this article, we will share with you how Chaos Mesh has grown and changed in the past year and also discuss its future goals and plans.

The project: thrive with a clear goal in mind

In this past year, Chaos Mesh has grown at an impressive speed with the joint efforts of the community. From the very first version to the recently released v1.1.0, Chaos Mesh has been greatly improved in terms of functionality, ease of use, and security.

Functionality

When first open-sourced, Chaos Mesh supported only three fault types: PodChaos, NetworkChaos, and IOChaos. Within only a year, Chaos Mesh can perform all around fault injections into the network, system clock, JVM applications, filesystems, operating systems, and so on.

Chaos Mesh's functionalities

After continuous optimization, Chaos Mesh now provides a flexible scheduling mechanism, which enables users to better design their own chaos experiments. This laid the foundation for chaos orchestration.

In the meantime, we are happy to see that a number of users have started to test Chaos Mesh on major cloud platforms, such as Amazon Web Services (AWS), Google Kubernetes Engine (GKE), Alibaba Cloud, and Tencent Cloud. We have continuously conducted compatibility testing and adaptations, in order to support fault injection for specific cloud platforms.

To better support Kubernetes native components and node-level failures, we developed Chaosd, which provides physical node-level fault injection. We're extensively testing and refining this feature for release within the next few months.

Ease of use

Ease of use has been one of the guiding principles of Chaos Mesh development since day one. You can deploy Chaos Mesh with a single command line. The V1.0 release brought the long-awaited Chaos Dashboard, a one-stop web interface for users to orchestrate chaos experiments. You can define the scope of the chaos experiment, specify the type of chaos injection, define scheduling rules, and observe the results of the chaos experiment—all in the same web interface with only a few clicks.

Chaos Mesh's dashboard

Prior to V1.0, many users reported being blocked by various configuration problems when injecting IOChaos faults. After intense investigations and discussions, we gave up the original SideCar implementation. Instead, we used chaos-daemon to dynamically invade the target Pod, which significantly simplifies the logic. This optimization has made dynamic I/O fault injection possible with Chaos Mesh, and users can focus solely on their experiments without having to worry about additional configurations.

Security

We have improved the security of Chaos Mesh. It now provides a comprehensive set of selectors to control the scope of the experiments and supports setting specific namespaces to protect important applications. What's more, the support of namespace permissions allows users to limit the "explosion radius" of a chaos experiment to a specific namespace.

In addition, Chaos Mesh directly reuses Kubernetes' native permission mechanism and supports verification on the Chaos Dashboard. This protects you from other users' errors, which can cause chaos experiments to fail or become uncontrollable.

Cloud native ecosystem: integrations and cooperations

In July 2020, Chaos Mesh was successfully accepted as a CNCF Sandbox project. This shows that Chaos Mesh has received initial recognition from the cloud native community. At the same time, it means that Chaos Mesh has a clear mission: to promote the application of chaos engineering in the cloud native field and to cooperate with other cloud native projects so we can grow together.

Grafana

To further improve the observability of chaos experiments, we have included a separate Grafana plug-in for Chaos Mesh, which allows users to directly display real-time chaos experiment information on the application monitoring panel. This way, users can simultaneously observe the running status of the application and the current chaos experiment information.

Github Action

To enable users to run chaos experiments even during the development phase, we developed the chaos-mesh-action project, allowing Chaos Mesh to run in the workflow of GitHub Actions. This way, Chaos Mesh can easily be integrated into daily system development and testing.

TiPocket

TiPocket is an automated test platform that integrates Chaos Mesh and Argo, a workflow engine designed for Kubernetes. TiPocket is designed to be a fully automated chaos engineering testing loop for TiDB, a distributed database. There are a number of steps when we conduct chaos experiments, including deploying applications, running workloads, injecting exceptions, and business checks. To fully automate these steps, Argo was integrated into TiPocket. Chaos Mesh provides rich fault injection, while Argo provides flexible orchestration and scheduling.

TiPocket

The community: built from the ground up

Chaos Mesh is a community-driven project and cannot progress without an active, friendly, and open community. Since it was open-sourced, Chaos Mesh has quickly become one of the most eye-catching open-source projects in the chaos engineering world. Within a year, it has accumulated more than 3k stars on GitHub and 70+ contributors. Adopters include Tencent Cloud, XPeng Motors, Dailymotion, NetEase Fuxi Lab, JuiceFS, APISIX, and Meituan. Looking back on the past year, the Chaos Mesh community was built from scratch, and has laid the foundation for a transparent, open, friendly, and autonomous open source community.

Becoming part of the CNCF family

Cloud native has been in the DNA of Chaos Mesh since the very beginning. Joining CNCF was a natural choice, which marks a critical step for Chaos Mesh to becoming a vendor-neutral, open and transparent open-source community. Aside from integration within the cloud native ecosystem, joining CNCF gives Chaos Mesh:

More community and project exposure

Collaborations with other projects and various cloud native community activities such as Kubernetes Meetup and KubeCon have presented us great opportunities to communicate with the community. We are amazed how the high-quality content produced by the community has also played a positive and far-reaching role in promoting Chaos Mesh.
A more complete and open community framework

CNCF provides a rather mature framework for open-source community operations. Under CNCF's guidance, we established our basic community framework, including a Code of Conduct, Contributing Guide, and Roadmap. We've also created our own channel, #project-chaos-mesh, under CNCF's Slack.

A friendly and supportive community

The quality of the open source community determines whether our adopters and contributors are willing to stick around and get involved in the community for the long run. In this regard, we've been working hard on:

Continuously enriching documentation and optimizing its structure. So far, we have developed a complete set of documentation for different groups of audiences, including a user guide and developer guide, quick start guides, use cases, and a contributing guide. All are constantly updated per each release.
Working with the community to publish blog posts, tutorials, use cases, and chaos engineering practices. So far, we've produced 26 Chaos Mesh related articles. Among them is an interactive tutorial, published on O'Reilly's Katakoda site. These materials make a great complement to the documentation.
Repurposing and amplifying videos and tutorials generated in community meetings, webinars, and meetups.
Valuing and responding to community feedback and queries.

Looking ahead

Google's recent global outage reminded us of the importance of system reliability, and it highlighted the importance of chaos engineering. Liz Rice, CNCF TOC Chair, shared The 5 technologies to watch in 2021, and chaos engineering is on top of the list. We boldly predict that chaos engineering is about to enter a new stage in the near future.

Chaos Mesh 2.0 is now in active development, and it includes community requirements such as an embedded workflow engine to support the definition and management of more flexible chaos scenarios, application state checking mechanisms, and more detailed experiments reports. Follow along through the project roadmap.

Last but not least

Chaos Mesh has grown so much in the past year, yet it is still young, and we have just set sail towards our goal. In the meantime, we call for all of you to participate and help build the Chaos Engineering system ecology together!

If you are interested in Chaos Mesh and would like to help us improve it, you're welcome to join our Slack channel or submit your pull requests or issues to our GitHub repository.

Chaos Mesh X Hacktoberfest 2020 - An Invitation to Open Source

CWen — Thu, 15 Oct 2020 06:14:54 +0000

Chaos Mesh is proud to be in Hacktoberfest 2020!

Hosted by DigitalOcean, Intel and DEV, Hacktoberfest is an open source celebration open to everyone in our global community. This month-long (Oct 1 - Oct 31) event encourages everyone to help drive the growth of open source and make positive contributions to an ever-growing community, whether you’re an experienced developer or open-source newbie learning to code. As long as you submit 4 PRs before Oct 31, you are eligible to claim a limit edition T-shirt (70000 in total on a first-come-first-served basis)!

Open source is the spirit

Chaos Mesh has always been a dedicated and firm advocate of open source from day 1. Within only 10 months since it was open-sourced on December 31st, 2019, Chaos Mesh has already received 2.5k GitHub stars, with 59 contributors from multiple organizations. And it was accepted as a CNCF sandbox project in July 2020. The amazing growth of the project as well as the community could not have been possible without our shared commitment to the open-source community and spirit.

We hereby invite you to be part of us, starting from our handpicked issues with proper mentoring and assistance along your journey, which we hope you will find rewarding, inspiring, and most of all, fun.

How can you participate

So we are all set up for you in Hacktoberfest - labeled suitable issues with “Hacktoberfest”, and updated the Contributing Guide.

How can you participate? It could not be easier with the following steps:

Sign up for Hacktoberfest using your GitHub account between Oct 1 and Oct 31.
Pick up an issue. Note that the issues are still being updated, but you don’t have to be limited to issues with the Hacktoberfest label, which only serve as a starting point.
Start coding and submit your PRs. Again the PR does not need to be corresponding to the labeled issue.
Our maintainers review your PRs. Once you successfully merged, or have gained approval for 4 or more of them, the PRs will be automatically counted on the Hacktoberfest end, and you will be eligible to claim your SWAG.

Note: If your PRs are merged or approved but you haven’t seen the number reflected on Hacktoberfest, comment under your PR.

Strive for quality, learning, and no spammy

In the spirit of open source and Hacktoberfest, we welcome all contributions and honor only valid PRs. However, we would not encourage or tolerate spammy contributions, which would not only cause waste to our maintainer’s time but also hurt the feelings and the integrity of the entire open source community. Spammy PRs will be labeled as "invalid" or "spam", and will be closed as invalid.

Happy hacking! But don’t hack alone. Join #project-chaos-mesh in the CNCF Slack to share your experience, provide your feedback on your experience, or let us help with any problem you have.

Building an Automated Testing Framework Based on Chaos Mesh® and Argo

CWen — Thu, 20 Aug 2020 11:07:34 +0000

This article was originally published at www.pingcap.com on Apr 20, 2020

Chaos Mesh® is an open-source chaos engineering platform for Kubernetes. Although it provides rich capabilities to simulate abnormal system conditions, it still only solves a fraction of the Chaos Engineering puzzle. Besides fault injection, a full chaos engineering application consists of hypothesizing around defined steady states, running experiments in production, validating the system via test cases, and automating the testing.

This article describes how we use TiPocket, an automated testing framework to build a full Chaos Engineering testing loop for TiDB, our distributed database.

Why do we need TiPocket?

Before we can put a distributed system like TiDB into production, we have to ensure that it is robust enough for day-to-day use. For this reason, several years ago we introduced Chaos Engineering into our testing framework. In our testing framework, we:

Observe the normal metrics and develop our testing hypothesis.
Inject a list of failures into TiDB.
Run various test cases to verify TiDB in fault scenarios.
Monitor and collect test results for analysis and diagnosis.

This sounds like a solid process, and we’ve used it for years. However, as TiDB evolves, the testing scale multiplies. We have multiple fault scenarios, against which dozens of test cases run in the Kubernetes testing cluster. Even with Chaos Mesh helping to inject failures, the remaining work can still be demanding—not to mention the challenge of automating the pipeline to make the testing scalable and efficient.

This is why we built TiPocket, a fully-automated testing framework based on Kubernetes and Chaos Mesh. Currently, we mainly use it to test TiDB clusters. However, because of TiPocket’s Kubernetes-friendly design and extensible interface, you can use Kubernetes’ create and delete logic to easily support other applications.

How does it work

Based on the above requirements, we need an automatic workflow that:

Injects chaos
Verifies the impact of that chaos
Automates the chaos pipeline
Visualizes the results

Injecting chaos - Chaos Mesh

Fault injection is the core chaos testing. In a distributed database, faults can happen anytime, anywhere—from node crashes, network partitions, and file system failures, to kernel panics. This is where Chaos Mesh comes in.

Currently, TiPocket supports the following types of fault injection:

Network: Simulates network partitions, random packet loss, disorder, duplication, or delay of links.
Time skew: Simulates clock skew of the container to be tested.
Kill: Kills the specified pod, either randomly in a cluster or within a component (TiDB, TiKV, or Placement Driver (PD)).
I/O: Injects I/O delays in TiDB’s storage engine, TiKV, to identify I/O related issues.

With fault injection handled, we need to think about verification. How do we make sure TiDB can survive these faults?

Verifying chaos impacts: test cases

To validate how TiDB withstands chaos, we implemented dozens of test cases in TiPocket, combined with a variety of inspection tools. To give you an overview of how TiPocket verifies TiDB in the event of failures, consider the following test cases. These cases focus on SQL execution, transaction consistency, and transaction isolation.

Fuzz testing: SQLsmith

SQLsmith is a tool that generates random SQL queries. TiPocket creates a TiDB cluster and a MySQL instance.. The random SQL generated by SQLsmith is executed on TiDB and MySQL, and various faults are injected into the TiDB cluster to test. In the end, execution results are compared. If we detect inconsistencies, there are potential issues with our system.

Transaction consistency testing: Bank and Porcupine

Bank is a classical test case that simulates the transfer process in a banking system. Under snapshot isolation, all transfers must ensure that the total amount of all accounts must be consistent at every moment, even in the face of system failures. If there are inconsistencies in the total amount, there are potential issues with our system.

Porcupine is a linearizability checker in Go built to test the correctness of distributed systems. It takes a sequential specification as executable Go code, along with a concurrent history, and it determines whether the history is linearizable with respect to the sequential specification. In TiPocket, we use the Porcupine checker in multiple test cases to check whether TiDB meets the linearizability constraint.

Transaction Isolation testing: Elle

Elle is an inspection tool that verifies a database’s transaction isolation level. TiPocket integrates go-elle, the Go implementation of the Elle inspection tool, to verify TiDB’s isolation level.

These are just a few of the test cases TiPocket uses to verify TiDB’s accuracy and stability. For more test cases and verification methods, see our source code.

Automating the chaos pipeline - Argo

Now that we have Chaos Mesh to inject faults, a TiDB cluster to test, and ways to validate TiDB, how can we automate the chaos testing pipeline? Two options come to mind: we could implement the scheduling functionality in TiPocket, or hand over the job to existing open-source tools. To make TiPocket more dedicated to the testing part of our workflow, we chose the open-source tools approach. This, plus our all-in-K8s design, lead us directly to Argo.

Argo is a workflow engine designed for Kubernetes. It has been an open source product for a long time, and has received widespread attention and application.

Argo has abstracted several custom resource definitions (CRDs) for workflows. The most important ones include Workflow Template, Workflow, and Cron Workflow. Here is how Argo fits in TiPocket:

Workflow Template is a template defined in advance for each test task. Parameters can be passed in when the test is running.
Workflow schedules multiple workflow templates in different orders, which form the tasks to be executed. Argo also lets you add conditions, loops, and directed acyclic graphs (DAGs) in the pipeline.
Cron Workflow lets you schedule a workflow like a cron job. It is perfectly suitable for scenarios where you want to run test tasks for a long time.

The sample workflow for our predefined bank test is shown below:

spec:
  entrypoint: call-tipocket-bank
  arguments:
    parameters:
      - name: ns
        value: tipocket-bank
            - name: nemesis
        value: random_kill,kill_pd_leader_5min,partition_one,subcritical_skews,big_skews,shuffle-leader-scheduler,shuffle-region-scheduler,random-merge-scheduler
  templates:
    - name: call-tipocket-bank
      steps:
        - - name: call-wait-cluster
            templateRef:
              name: wait-cluster
              template: wait-cluster
        - - name: call-tipocket-bank
            templateRef:
              name: tipocket-bank
              template: tipocket-bank

In this example, we use the workflow template and nemesis parameters to define the specific failure to inject. You can reuse the template to define multiple workflows that suit different test cases. This allows you to add more customized failure injections in the flow.

Besides TiPocket’s sample workflows and templates, the design also allows you to add your own failure injection flows. Handling complicated logics using codable workflows makes Argo developer-friendly and an ideal choice for our scenarios.

Now, our chaos experiment is running automatically. But if our results do not meet our expectations? How do we locate the problem? TiDB saves a variety of monitoring information, which makes log collecting essential for enabling observability in TiPocket.

Visualizing the results: Loki

In cloud-native systems, observability is very important. Generally speaking, you can achieve observability through metrics, logging, and tracing. TiPocket’s main test cases evaluate TiDB clusters, so metrics and logs are our default sources for locating issues.

On Kubernetes, Prometheus is the de-facto standard for metrics. However, there is no common way for log collection. Solutions such as Elasticsearch, Fluent Bit, and Kibana perform well, but they may cause system resource contention and high maintenance costs. We decided to use Loki, the Prometheus-like log aggregation system from Grafana.

Prometheus processes TiDB’s monitoring information. Prometheus and Loki have a similar labeling system, so we can easily combine Prometheus' monitoring indicators with the corresponding pod logs and use a similar query language. Grafana also supports the Loki dashboard, which means we can use Grafana to display monitoring indicators and logs at the same time. Grafana is the built-in monitoring component in TiDB, which Loki can reuse.

Putting them all together: TiPocket

Now, everything is ready. Here is a simplified diagram of TiPocket:

As you can see, the Argo workflow manages all chaos experiments and test cases. Generally, a complete test cycle involves the following steps:

Argo creates a Cron Workflow, which defines the cluster to be tested, the faults to inject, the test case, and the duration of the task. If necessary, the Cron Workflow also lets you view case logs in real-time.
At a specified time, a separate TiPocket thread is started in the workflow, and the Cron Workflow is triggered. TiPocket sends TiDB-Operator the definition of the cluster to test. In turn, TiDB-Operator creates a target TiDB cluster. Meanwhile, Loki collects the related logs.
Chaos Mesh injects faults in the cluster.
Using the test cases mentioned above, the user validates the health of the system. Any test case failure leads to workflow failure in Argo, which triggers Alertmanager to send the result to the specified Slack channel. If the test cases complete normally, the cluster is cleared, and Argo stands by until the next test.

This is the complete TiPocket workflow.

Join us

Chaos Mesh and TiPocket are both in active iterations. We have donated Chaos Mesh donated to CNCF, and we look forward to more community members joining us in building a complete Chaos Engineering ecosystem. If this sounds interesting to you, check out our website, or join #chaos-mesh in Slack.

Simulating Clock Skew in K8s Without Affecting Other Containers on the Node

CWen — Wed, 22 Apr 2020 08:16:48 +0000

This article was originally published at www.pingcap.com on Apr 20, 2020

Chaos Mesh™, an easy-to-use, open-source, cloud-native chaos engineering platform for Kubernetes (K8s), has a new feature, TimeChaos, which simulates the clock skew phenomenon. Usually, when we modify clocks in a container, we want a minimized blast radius, and we don't want the change to affect the other containers on the node. In reality, however, implementing this can be harder than you think. How does Chaos Mesh solve this problem?

In this post, I'll describe how we hacked through different approaches of clock skew and how TimeChaos in Chaos Mesh enables time to swing freely in containers.

Simulating clock skew without affecting other containers on the node

Clock skew refers to the time difference between clocks on nodes within a network. It might cause reliability problems in a distributed system, and it's a concern for designers and developers of complex distributed systems. For example, in a distributed SQL database, it's vital to maintain a synchronized local clock across nodes to achieve a consistent global snapshot and ensure the ACID properties for transactions.

Currently, there are well-recognized solutions to synchronize clocks, but without proper testing, you can never be sure that your implementation is solid.

Then how can we test global snapshot consistency in a distributed system? The answer is obvious: we can simulate clock skew to test whether distributed systems can keep a consistent global snapshot under abnormal clock conditions. Some testing tools support simulating clock skew in containers, but they have an impact on physical nodes.

TimeChaos is a tool that simulates clock skew in containers to test how it impacts your application without affecting the whole node. This way, we can precisely identify the potential consequences of clock skew and take measures accordingly.

Various approaches for simulating clock skew we've explored

Reviewing the existing choices, we know clearly that they cannot be applied to Chaos Mesh, which runs on Kubernetes. Two common ways of simulating clock skew--changing the node clock directly and using the Jepsen framework--change the time for all processes on the node. These are not acceptable solutions for us. In a Kubernetes container, if we inject a clock skew error that affects the entire node, other containers on the same node will be disturbed. Such a clumsy approach is not tolerable.

Then how are we supposed to tackle this problem? Well, the first thing that comes into our mind is finding solutions in the kernel using Berkeley Packet Filter (BPF).

`LD_PRELOAD`

LD_PRELOAD is a Linux environment variable that lets you define which dynamic link library is loaded before the program execution.

This variable has two advantages:

We can call our own functions without being aware of the source code.
We can inject code into other programs to achieve specific purposes.

For some languages that use applications to call the time function in glibc, such as Rust and C, using LD_PRELOAD is enough to simulate clock skew. But things are trickier for Golang. Because languages such as Golang directly parse virtual Dynamic Shared Object (vDSO), a mechanism to speed up system calls. To obtain the time function address, we can't simply use LD_PRELOAD to intercept the glic interface. Therefore, LD_PRELOAD is not our solution.

Use BPF to modify the return value of `clock_gettime` system call

We also tried to filter the task process identification number (PID) with BPF. This way, we could simulate clock skew on a specified process and modify the return value of the clock_gettime system call.

This seemed like a good idea, but we also encountered a problem: in most cases, vDSO speeds up clock_gettime, but clock_gettime doesn't make a system call. This selection didn't work, either. Oops.

Thankfully, we determined that if the system kernel version is 4.18 or later, and if we use the HPET clock, clock_gettime() gets time by making normal system calls instead of vDSO. We implemented a version of clock skew using this approach, and it works fine for Rust and C. As for Golang, the program can get the time right, but if we perform sleep during the clock skew injection, the sleep operation is very likely to be blocked. Even after the injection is canceled, the system cannot recover. Thus, we have to give up this approach, too.

TimeChaos, our final hack

From the previous section, we know that programs usually get the system time by calling clock_gettime. In our case, clock_gettime uses vDSO to speed up the calling process, so we cannot use LD_PRELOAD to hack the clock_gettime system calls.

We figured out the cause; then what's the solution? Start from vDSO. If we can redirect the address that stores the clock_gettime return value in vDSO to an address we define, we can solve the problem.

Easier said than done. To achieve this goal, we must tackle the following problems:

Know the user-mode address used by vDSO
Know vDSO's kernel-mode address, if we want to modify the clock_gettime function in vDSO by any address in the kernel mode
Know how to modify vDSO data

First, we need to peek inside vDSO. We can see the vDSO memory address in /proc/pid/maps.

$ cat /proc/pid/maps
...
7ffe53143000-7ffe53145000 r-xp 00000000 00:00 0                     [vdso]

The last line is vDSO information. The privilege of this memory space is r-xp: readable and executable, but not writable. That means the user mode cannot modify this memory. We can use ptrace to avoid this restriction.

Next, we use gdb dump memory to export the vDSO and use objdump to see what's inside. Here is what we get:

(gdb) dump memory vdso.so 0x00007ffe53143000 0x00007ffe53145000
$ objdump -T vdso.so
vdso.so:    file format elf64-x86-64
DYNAMIC SYMBOL TABLE:
ffffffffff700600  w  DF .text   0000000000000545  LINUX_2.6  clock_gettime

We can see that the whole vDSO is like a .so file, so we can use an executable and linkable format (ELF) file to format it. With this information, a basic workflow for implementing TimeChaos starts to take shape:

The chart above is the process of TimeChaos, an implementation of clock skew in Chaos Mesh.

Use ptrace to attach the specified PID process to stop the current process.
Use ptrace to create a new mapping in the virtual address space of the calling process and use process_vm_writev to write the fake_clock_gettime function we defined into the memory space.
Use process_vm_writev to write the specified parameters into fake_clock_gettime. These parameters are the time we would like to inject, such as two hours backward or two days forward.
Use ptrace to modify the clock_gettime function in vDSO and redirect to the fake_clock_gettime function.
Use ptrace to detach the PID process.

If you are interested in the details, see the Chaos Mesh GitHub repository.

Simulating clock skew on a distributed SQL database

Statistics speak volumes. Here we‘re going to try TimeChaos on TiDB, an open source, NewSQL, distributed SQL database that supports Hybrid Transactional/Analytical Processing (HTAP) workloads, to see if the chaos testing can really work.

TiDB uses a centralized service Timestamp Oracle (TSO) to obtain the globally consistent version number, and to ensure that the transaction version number increases monotonically. The TSO service is managed by the Placement Driver (PD) component. Therefore, we choose a random PD node and inject TimeChaos regularly, each with a 10-millisecond-backward clock skew. Let's see if TiDB can meet the challenge.

To better perform the testing, we use bank as the workload, which simulates the financial transfers in a banking system. It‘s often used to verify the correctness of database transactions.

This is our test configuration:

apiVersion: pingcap.com/v1alpha1
kind: TimeChaos
metadata:
  name: time-skew-example
  namespace: tidb-demo
spec:
  mode: one
  selector:
    labelSelectors:
      "app.kubernetes.io/component": "pd"
  timeOffset:
    sec: -600
  clockIds:
    - CLOCK_REALTIME
  duration: "10s"
  scheduler:
    cron: "@every 1m"

During this test, Chaos Mesh injects TimeChaos into a chosen PD Pod every 1 millisecond for 10 seconds. Within the duration, the time acquired by PD will have a 600 second offset from the actual time. For further details, see Chaos Mesh Wiki.

Let's create a TimeChaos experiment using the kubectl apply command:

kubectl apply -f pd-time.yaml

Now, we can retrieve the PD log by the following command:

kubectl logs -n tidb-demo tidb-app-pd-0 | grep "system time jump backward"

Here's the log:

[2020/03/24 09:06:23.164 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585041383060109693]
[2020/03/24 09:16:32.260 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585041992160476622]
[2020/03/24 09:20:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042231960027622]
[2020/03/24 09:23:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042411960079655]
[2020/03/24 09:25:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042531963640321]
[2020/03/24 09:28:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042711960148191]
[2020/03/24 09:33:32.063 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043011960517655]
[2020/03/24 09:34:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043071959942937]
[2020/03/24 09:35:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043131978582964]
[2020/03/24 09:36:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043191960687755]
[2020/03/24 09:38:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043311959970737]
[2020/03/24 09:41:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043491959970502]
[2020/03/24 09:45:32.061 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043731961304629]
...

From the log above, we see that every now and then, PD detects that the system time rolls back. This means:

TimeChaos successfully simulates clock skew.
PD can deal with the clock skew situation.

That's encouraging. But does TimeChaos affect services other than PD? We can check it out in the Chaos Dashboard:

It's clear that in the monitor, TimeChaos was injected every 1 millisecond and the whole duration lasted 10 seconds. What's more, TiDB was not affected by that injection. The bank program ran normally, and performance was not affected.

Try out Chaos Mesh

As a cloud-native chaos engineering platform, Chaos Mesh features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pods, the network, the file system, and even the kernel.

Wanna have some hands-on experience in chaos engineering? Welcome to Chaos Mesh. This 10-minute tutorial will help you quickly get started with chaos engineering and run your first chaos experiment with Chaos Mesh.

Run Your First Chaos Experiment in 10 Minutes

CWen — Mon, 23 Mar 2020 05:50:34 +0000

This article was originally published at www.pingcap.com on Mar 18, 2020

Chaos Engineering is a way to test a production software system's robustness by simulating unusual or disruptive conditions. For many people, however, the transition from learning Chaos Engineering to practicing it on their own systems is daunting. It sounds like one of those big ideas that require a fully-equipped team to plan ahead. Well, it doesn't have to be. To get started with chaos experimenting, you may be just one suitable platform away.

Chaos Mesh is an easy-to-use, open-source, cloud-native Chaos Engineering platform that orchestrates chaos in Kubernetes environments. This 10-minute tutorial will help you quickly get started with Chaos Engineering and run your first chaos experiment with Chaos Mesh.

For more information about Chaos Mesh, refer to our previous article or the chaos-mesh project on GitHub.

A preview of our little experiment

Chaos experiments are similar to experiments we do in a science class. It's perfectly fine to stimulate turbulent situations in a controlled environment. In our case here, we will be simulating network chaos on a small web application called web-show. To visualize the chaos effect, web-show records the latency from its pod to the kube-controller pod (under the namespace of kube-system) every 10 seconds.

The following clip shows the process of installing Chaos Mesh, deploying web-show, and creating the chaos experiment within a few commands:

Now it's your turn! It's time to get your hands dirty.

Let's get started!

For our simple experiment, we use Kubernetes in the Docker (Kind) for Kubernetes development. You can feel free to use Minikube or any existing Kubernetes clusters to follow along.

Prepare the environment

Before moving forward, make sure you have Git and Docker installed on your local computer, with Docker up and running. For macOS, it's recommended to allocate at least 6 CPU cores to Docker. For details, see Docker configuration for Mac.

Get Chaos Mesh:

git clone https://github.com/pingcap/chaos-mesh.git
cd chaos-mesh/

Install Chaos Mesh with the install.sh script:
```
./install.sh --local kind 
```
install.sh is an automated shell script that checks your environment, installs Kind, launches Kubernetes clusters locally, and deploys Chaos Mesh. To see the detailed description of install.sh, you can include the --help option.

Note:

If your local computer cannot pull images from docker.io or gcr.io, use the local gcr.io mirror and execute ./install.sh --local kind --docker-mirror instead.
Set the system environment variable:
```
source ~/.bash_profile
```

Note:

Depending on your network, these steps might take a few minutes.

If you see an error message like this:
ERROR: failed to create cluster: failed to generate kubeadm config content: failed to get kubernetes version from node: failed to get file: command "docker exec --privileged kind-control-plane cat /kind/version" failed with error: exit status 1
increase the available resources for Docker on your local computer and execute the following command:
./install.sh --local kind --force-local-kube

When the process completes you will see a message indicating Chaos Mesh is successfully installed.

Deploy the application

The next step is to deploy the application for testing. In our case here, we choose web-show because it allows us to directly observe the effect of network chaos. You can also deploy your own application for testing.

Deploy web-show with the deploy.sh script:
```
# Make sure you are in the Chaos Mesh directory 
cd examples/web-show && 
./deploy.sh 
```
Note:

If your local computer cannot pull images from docker.io, use the local gcr.io mirror and execute ./deploy.sh --docker-mirror instead.
Access the web-show application. From your web browser, go to http://localhost:8081.

Create the chaos experiment

Now that everything is ready, it's time to run your chaos experiment!

Chaos Mesh uses CustomResourceDefinitions (CRD) to define chaos experiments. CRD objects are designed separately based on different experiment scenarios, which greatly simplifies the definition of CRD objects. Currently, CRD objects that have been implemented in Chaos Mesh include PodChaos, NetworkChaos, IOChaos, TimeChaos, and KernelChaos. Later, we'll support more fault injection types.

In this experiment, we are using NetworkChaos for the chaos experiment. The NetworkChaos configuration file, written in YAML, is shown below:

apiVersion: pingcap.com/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-example
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      "app": "web-show"
  delay:
    latency: "10ms"
    correlation: "100"
    jitter: "0ms"
  duration: "30s"
  scheduler:
    cron: "@every 60s"

For detailed descriptions of NetworkChaos actions, see Chaos Mesh wiki. Here, we just rephrase the configuration as:

target: web-show
mission: inject a 10ms network delay every 60s
attack duration: 30s each time

To start NetworkChaos, do the following:

Run network-delay.yaml:

# Make sure you are in the chaos-mesh/examples/web-show directory
kubectl apply -f network-delay.yaml

Access the web-show application. In your web browser, go to http://localhost:8081.

From the line graph, you can tell that there is a 10 ms network delay every 60 seconds.

Congratulations! You just stirred up a little bit of chaos. If you are intrigued and want to try out more chaos experiments with Chaos Mesh, check out examples/web-show.

Delete the chaos experiment

Once you're finished testing, terminate the chaos experiment.

Delete network-delay.yaml:

# Make sure you are in the chaos-mesh/examples/web-show directory
kubectl delete -f network-delay.yaml

Access the web-show application. From your web browser, go to http://localhost:8081.

From the line graph, you can see the network latency level is back to normal.

Delete Kubernetes clusters

After you're done with the chaos experiment, execute the following command to delete the Kubernetes clusters:

kind delete cluster --name=kind

Note:

If you encounter the kind: command not found error, execute source ~/.bash_profile command first and then delete the Kubernetes clusters.

Cool! What's next?

Congratulations on your first successful journey into Chaos Engineering. How does it feel? Chaos Engineering is easy, right? But perhaps Chaos Mesh is not that easy-to-use. Command-line operation is inconvenient, writing YAML files manually is a bit tedious, or checking the experiment results is somewhat clumsy? Don't worry, Chaos Dashboard is on its way! Running chaos experiments on the web sure does sound exciting! If you'd like to help us build testing standards for cloud platforms or make Chaos Mesh better, we'd love to hear from you!

If you find a bug or think something is missing, feel free to file an issue, open a pull request (PR), or join us on the #sig-chaos-mesh channel in the TiDB Community slack workspace.

GitHub: https://github.com/pingcap/chaos-mesh

Chaos Mesh - Your Chaos Engineering Solution for System Resiliency on Kubernetes

CWen — Fri, 17 Jan 2020 12:57:47 +0000

Why Chaos Mesh?

In the world of distributed computing, faults can happen to your clusters unpredictably any time, anywhere. Traditionally we have unit tests and integration tests that guarantee a system is production ready, but these cover just the tip of the iceberg as clusters scale, complexities amount, and data volumes increase by PB levels. To better identify system vulnerabilities and improve resilience, Netflix invented Chaos Monkey and injects various types of faults into the infrastructure and business systems. This is how Chaos Engineering was originated.

At PingCAP, we are facing the same problem while building TiDB, an open source distributed NewSQL database. To be fault tolerant, or resilient holds especially true us, because the most important asset for any database users, the data itself, is at stake. To ensure resilience, we started practicing Chaos Engineering internally in our testing framework from a very early stage. However, as TiDB grew, so did the testing requirements. We realized that we needed a universal chaos testing platform, not just for TiDB, but also for other distributed systems.

Therefore, we present to you Chaos Mesh, a cloud-native Chaos Engineering platform that orchestrates chaos experiments on Kubernetes environments. It's an open source project available at https://github.com/pingcap/chaos-mesh.

In the following sections, I will share with you what Chaos Mesh is, how we design and implement it, and finally I will show you how you can use it in your environment.

What can Chaos Mesh do?

Chaos Mesh is a versatile Chaos Engineering platform that features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pod, network, file system, and even the kernel.

Here is an example of how we use Chaos Mesh to locate a TiDB system bug. In this example, we simulate Pod downtime with our distributed storage engine (TiKV) and observe changes in queries per second (QPS). Regularly, if one TiKV node is down, the QPS may experience a transient jitter before it returns to the level before the failure. This is how we guarantee high availability.

As you can see from the dashboard:

During the first two downtimes, the QPS returns to normal after about 1 minute.
After the third downtime, however, the QPS takes much longer to recover—about 9 minutes. Such a long downtime is unexpected, and it would definitely impact online services.

After some diagnosis, we found the TiDB cluster version under test (V3.0.1) had some tricky issues when handling TiKV downtimes. We resolved these issues in later versions.

But Chaos Mesh can do a lot more than just simulate downtime. It also includes these fault injection methods:

pod-kill: Simulates Kubernetes Pods being killed
pod-failure: Simulates Kubernetes Pods being continuously unavailable
network-delay: Simulates network delay
network-loss: Simulates network packet loss
network-duplication: Simulates network packet duplication
network-corrupt: Simulates network packet corruption
network-partition: Simulates network partition
I/O delay: Simulates file system I/O delay
I/O errno: Simulates file system I/O errors

Design principles

We designed Chaos Mesh to be easy to use, scalable, and designed for Kubernetes.

Easy to use

To be easy to use, Chaos Mesh must:

Require no special dependencies, so that it can be deployed directly on Kubernetes clusters, including Minikube.
Require no modification to the deployment logic of the system under test (SUT), so that chaos experiments can be performed in a production environment.
Easily orchestrate fault injection behaviors in chaos experiments, and easily view experiment status and results. You should also be able to quickly rollback injected failures.
Hide underlying implementation details so that users can focus on orchestrating the chaos experiments.

Scalable

Chaos Mesh should be scalable, so that we can "plug" new requirements into it conveniently without reinventing the wheel. Specifically, Chaos Mesh must:

Leverage existing implementations so that fault injection methods can be easily scaled.
Easily integrate with other testing frameworks.

Designed for Kubernetes

In the container world, Kubernetes is the absolute leader. Its growth rate of adoption is far beyond everybody's expectations, and it has won the war of containerized orchestration. In essence, Kubernetes is an operating system for the cloud.

TiDB is a cloud-native distributed database. Our internal automated testing platform was built on Kubernetes from the beginning. We had hundreds of TiDB clusters running on Kubernetes every day for various experiments, including extensive chaos testing to simulate all kinds of failures or issues in a production environment. To support these chaos experiments, the combination of chaos and Kubernetes became a natural choice and principle for our implementation.

CustomResourceDefinitions design

Chaos Mesh uses CustomResourceDefinitions (CRD) to define chaos objects. In the Kubernetes realm, CRD is a mature solution for implementing custom resources, with abundant implementation cases and toolsets available. Using CRD makes Chaos Mesh naturally integrate with the Kubernetes ecosystem.

Instead of defining all types of fault injections in a unified CRD object, we allow flexible and separate CRD objects for different types of fault injection. If we add a fault injection method that conforms to an existing CRD object, we scale directly based on this object; if it is a completely new method, we create a new CRD object for it. With this design, chaos object definitions and the logic implementation are extracted from the top level, which makes the code structure clearer. This approach also reduces the degree of coupling and the probability of errors. In addition, Kubernetes' controller-runtime is a great wrapper for implementing controllers. This saves us a lot of time because we don't have to repeatedly implement the same set of controllers for each CRD project.

Chaos Mesh implements the PodChaos, NetworkChaos, and IOChaos objects. The names clearly identify the corresponding fault injection types.

For example, Pod crashing is a very common problem in a Kubernetes environment. Many native resource objects automatically handle such errors with typical actions such as creating a new Pod. But can our application really deal with such errors? What if the Pod won't start?

With well-defined actions such as pod-kill, PodChaos can help us pinpoint these kinds of issues more effectively. The PodChaos object uses the following code:

spec:
 action: pod-kill
 mode: one
 selector:
   namespaces:
     - tidb-cluster-demo
   labelSelectors:
     "app.kubernetes.io/component": "tikv"
  scheduler:
   cron: "@every 2m"

This code does the following:

The action attribute defines the specific error type to be injected. In this case, pod-kill kills Pods randomly.
The selector attribute limits the scope of chaos experiment to a specific scope. In this case, the scope is TiKV Pods for the TiDB cluster with the tidb-cluster-demo namespace.
The scheduler attribute defines the interval for each chaos fault action.

For more details on CRD objects such as NetworkChaos and IOChaos, see the Chaos-mesh documentation.

How does Chaos Mesh work?

With the CRD design settled, let's look at the big picture on how Chaos Mesh works. The following major components are involved:

controller-manager

Acts as the platform's "brain." It manages the life cycle of CRD objects and schedules chaos experiments. It has object controllers for scheduling CRD object instances, and the admission-webhooks controller dynamically injects sidecar containers into Pods.
chaos-daemon

Runs as a privileged daemonset that can operate network devices on the node and Cgroup.
sidecar

Runs as a special type of container that is dynamically injected into the target Pod by the admission-webhooks. For example, the chaosfs sidecar container runs a fuse-daemon to hijack the I/O operation of the application container.

Here is how these components streamline a chaos experiment:

Using a YAML file or Kubernetes client, the user creates or updates chaos objects to the Kubernetes API server.
Chaos Mesh uses the API server to watch the chaos objects and manages the lifecycle of chaos experiments through creating, updating, or deleting events. In this process, controller-manager, chaos-daemon, and sidecar containers work together to inject errors.
When admission-webhooks receives a Pod creation request, the Pod object to be created is dynamically updated; for example, it is injected into the sidecar container and the Pod.

Running chaos

The above sections introduce how we design Chaos Mesh and how it works. Now let's get down to business and show you how to use Chaos Mesh. Note that the chaos testing time may vary depending on the complexity of the application to be tested and the test scheduling rules defined in the CRD.

Preparing the environment

Chaos Mesh runs on Kubernetes v1.12 or later. Helm, a Kubernetes package management tool, deploys and manages Chaos Mesh. Before you run Chaos Mesh, make sure that Helm is properly installed in the Kubernetes cluster. To set up the environment, do the following:

Make sure you have a Kubernetes cluster. If you do, skip to step 2; otherwise, start one locally using the script provided by Chaos Mesh:

// install kind 
curl -Lo ./kind https://github.com/kubernetes-sigs/kind/releases/download/v0.6.1/kind-$(uname)-amd64
chmod +x ./kind
mv ./kind /some-dir-in-your-PATH/kind 

// get script
git clone https://github.com/pingcap/chaos-mesh
cd chaos-mesh
// start cluster
hack/kind-cluster-build.sh

Note: Starting Kubernetes clusters locally affects network-related fault injections.

If the Kubernetes cluster is ready, use Helm and Kubectl to deploy Chaos Mesh:

git clone https://github.com/pingcap/chaos-mesh.git
cd chaos-mesh
// create CRD resource
kubectl apply -f manifests/
// install chaos-mesh
helm install helm/chaos-mesh --name=chaos-mesh --namespace=chaos-testing

Wait until all components are installed, and check the installation status using:

// check chaos-mesh status
kubectl get pods --namespace chaos-testing -l app.kubernetes.io/instance=chaos-mesh

If the installation is successful, you can see all pods up and running. Now, time to play.

You can run Chaos Mesh using a YAML definition or a Kubernetes API.

Running chaos using a YAML file

You can define your own chaos experiments through the YAML file method, which provides a fast, convenient way to conduct chaos experiments after you deploy the application. To run chaos using a YAML file, follow the steps below:

Note: For illustration purposes, we use TiDB as our system under test. You can use a target system of your choice, and modify the YAML file accordingly.

Deploy a TiDB cluster named chaos-demo-1. You can use TiDB Operator to deploy TiDB.

Create the YAML file named kill-tikv.yaml and add the following content:

apiVersion: pingcap.com/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-chaos-demo
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - chaos-demo-1
    labelSelectors:
      "app.kubernetes.io/component": "tikv"
  scheduler:
    cron: "@every 1m"

Save the file.
To start chaos, kubectl apply -f kill-tikv.yaml.

The following chaos experiment simulates the TiKV Pods being frequently killed in the chaos-demo-1 cluster:

We use a sysbench program to monitor the real-time QPS changes in the TiDB cluster. When errors are injected into the cluster, the QPS show a drastic jitter, which means a specific TiKV Pod has been deleted, and Kubernetes then re-creates a new TiKV Pod.

For more YAML file examples, see https://github.com/pingcap/chaos-mesh/tree/master/examples.

Running chaos using the Kubernetes API

Chaos Mesh uses CRD to define chaos objects, so you can manipulate CRD objects directly through the Kubernetes API. This way, it is very convenient to apply Chaos Mesh to your own applications with customized test scenarios and automated chaos experiments.

In the test-infra project, we simulate potential errors in ETCD clusters on Kubernetes, including nodes restarting, network failure, and file system failure.

The following is a Chaos Mesh sample script using the Kubernetes API:

import (
    "context"

    "github.com/pingcap/chaos-mesh/api/v1alpha1"
    "sigs.k8s.io/controller-runtime/pkg/client"
)

func main() {
  ...
  delay := &chaosv1alpha1.NetworkChaos{
        Spec: chaosv1alpha1.NetworkChaosSpec{...},
      }
      k8sClient := client.New(conf, client.Options{ Scheme: scheme.Scheme })
  k8sClient.Create(context.TODO(), delay)
      k8sClient.Delete(context.TODO(), delay)
}

What does the future hold?

In this article, we introduced you to Chaos Mesh, our open source cloud-native Chaos Engineering platform. There are still many pieces in progress, with more details to unveil regarding the design, use cases, and development. Stay tuned.

Open sourcing is just a starting point. In addition to the infrastructure-level chaos experiments introduced in previous sections, we are in the process of supporting a wider range of fault types of finer granularity, such as:

Injecting errors at the system call and kernel levels with the assistance of eBPF and other tools
Injecting specific error types into the application function and statement levels by integrating failpoint, which will cover scenarios that are otherwise impossible with conventional injection methods

Moving forward, we will continuously improve the Chaos Mesh Dashboard, so that users can easily see if and how their online businesses are impacted by fault injections. In addition, our roadmap includes an easy-to-use fault orchestration interface. We're planning other cool features, such as Chaos Mesh Verifier and Chaos Mesh Cloud.

If any of these sound interesting to you, join us in building a world class Chaos Engineering platform. May our applications dance in chaos on Kubernetes!

If you find a bug or think something is missing, feel free to file an issue, open a PR, or join us on the #sig-chaos-mesh channel in the TiDB Community slack workspace.

GitHub: https://github.com/pingcap/chaos-mesh

Forem: CWen

Celebrating One Year of Chaos Mesh: Looking Back and Ahead

The project: thrive with a clear goal in mind

Functionality

Ease of use

Security

Cloud native ecosystem: integrations and cooperations

Grafana

Github Action

TiPocket

The community: built from the ground up

Becoming part of the CNCF family

A friendly and supportive community

Looking ahead

Last but not least

Chaos Mesh X Hacktoberfest 2020 - An Invitation to Open Source

Open source is the spirit

How can you participate

Strive for quality, learning, and no spammy

Building an Automated Testing Framework Based on Chaos Mesh® and Argo

Why do we need TiPocket?

How does it work

Injecting chaos - Chaos Mesh

Verifying chaos impacts: test cases

Fuzz testing: SQLsmith

Transaction consistency testing: Bank and Porcupine

Transaction Isolation testing: Elle

Automating the chaos pipeline - Argo

Visualizing the results: Loki

Putting them all together: TiPocket

Join us

Simulating Clock Skew in K8s Without Affecting Other Containers on the Node

Simulating clock skew without affecting other containers on the node

Various approaches for simulating clock skew we've explored

LD_PRELOAD

Use BPF to modify the return value of clock_gettime system call

TimeChaos, our final hack

Simulating clock skew on a distributed SQL database

Try out Chaos Mesh

Run Your First Chaos Experiment in 10 Minutes

A preview of our little experiment

Let's get started!

Prepare the environment

Deploy the application

Create the chaos experiment

Delete the chaos experiment

Delete Kubernetes clusters

Cool! What's next?

Chaos Mesh - Your Chaos Engineering Solution for System Resiliency on Kubernetes

Why Chaos Mesh?

What can Chaos Mesh do?

Design principles

Easy to use

Scalable

Designed for Kubernetes

CustomResourceDefinitions design

How does Chaos Mesh work?

Running chaos

Preparing the environment

Running chaos using a YAML file

Running chaos using the Kubernetes API

What does the future hold?

`LD_PRELOAD`

Use BPF to modify the return value of `clock_gettime` system call