<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: CWen</title>
    <description>The latest articles on Forem by CWen (@cwen).</description>
    <link>https://forem.com/cwen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F319215%2F73d5b3fd-f91a-450c-ab60-b5eab0b5eb31.jpeg</url>
      <title>Forem: CWen</title>
      <link>https://forem.com/cwen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/cwen"/>
    <language>en</language>
    <item>
      <title>Celebrating One Year of Chaos Mesh: Looking Back and Ahead</title>
      <dc:creator>CWen</dc:creator>
      <pubDate>Thu, 01 Apr 2021 08:59:31 +0000</pubDate>
      <link>https://forem.com/cwen/celebrating-one-year-of-chaos-mesh-looking-back-and-ahead-11mo</link>
      <guid>https://forem.com/cwen/celebrating-one-year-of-chaos-mesh-looking-back-and-ahead-11mo</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fcelebrating-one-year-of-chaos-mesh.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fcelebrating-one-year-of-chaos-mesh.jpg" alt="Celebrating one year of Chaos Mesh"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's been a year since Chaos Mesh was first open-sourced on GitHub. Chaos Mesh started out as a mere fault injection tool and is now heading towards the goal of building a chaos engineering ecology. Meanwhile, the Chaos Mesh community was also built from scratch and has helped &lt;a href="https://github.com/chaos-mesh/chaos-mesh" rel="noopener noreferrer"&gt;Chaos Mesh&lt;/a&gt; join CNCF as a Sandbox project.&lt;/p&gt;

&lt;p&gt;In this article, we will share with you how Chaos Mesh has grown and changed in the past year and also discuss its future goals and plans.&lt;/p&gt;

&lt;h2&gt;
  
  
  The project: thrive with a clear goal in mind
&lt;/h2&gt;

&lt;p&gt;In this past year, Chaos Mesh has grown at an impressive speed with the joint efforts of the community. From the very first version to the recently released &lt;a href="https://github.com/chaos-mesh/chaos-mesh/releases/tag/v1.1.0" rel="noopener noreferrer"&gt;v1.1.0&lt;/a&gt;, Chaos Mesh has been greatly improved in terms of functionality, ease of use, and security.&lt;/p&gt;

&lt;h3&gt;
  
  
  Functionality
&lt;/h3&gt;

&lt;p&gt;When first open-sourced, Chaos Mesh supported only three fault types: PodChaos, NetworkChaos, and &lt;a href="https://pingcap.com/blog/how-to-simulate-io-faults-at-runtime" rel="noopener noreferrer"&gt;IOChaos&lt;/a&gt;. Within only a year, Chaos Mesh can perform all around fault injections into the network, system clock, JVM applications, filesystems, operating systems, and so on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fchaos-mesh-functionalities.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fchaos-mesh-functionalities.jpg" alt="Chaos Mesh's functionalities"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Chaos Mesh's functionalities &lt;/p&gt;

&lt;p&gt;After continuous optimization, Chaos Mesh now provides a flexible scheduling mechanism, which enables users to better design their own chaos experiments. This laid the foundation for chaos orchestration.&lt;/p&gt;

&lt;p&gt;In the meantime, we are happy to see that a number of users have started to &lt;a href="https://github.com/chaos-mesh/chaos-mesh/issues/1182" rel="noopener noreferrer"&gt;test Chaos Mesh on major cloud platforms&lt;/a&gt;, such as Amazon Web Services (AWS), Google Kubernetes Engine (GKE), Alibaba Cloud, and Tencent Cloud. We have continuously conducted compatibility testing and adaptations, in order to support &lt;a href="https://github.com/chaos-mesh/chaos-mesh/pull/1330" rel="noopener noreferrer"&gt;fault injection for specific cloud platforms&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To better support Kubernetes native components and node-level failures, we developed &lt;a href="https://github.com/chaos-mesh/chaosd" rel="noopener noreferrer"&gt;Chaosd&lt;/a&gt;, which provides physical node-level fault injection. We're extensively testing and refining this feature for release within the next few months.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ease of use
&lt;/h3&gt;

&lt;p&gt;Ease of use has been one of the guiding principles of Chaos Mesh development since day one. You can deploy Chaos Mesh with a single command line. The V1.0 release brought the long-awaited Chaos Dashboard, a one-stop web interface for users to orchestrate chaos experiments. You can define the scope of the chaos experiment, specify the type of chaos injection, define scheduling rules, and observe the results of the chaos experiment—all in the same web interface with only a few clicks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fchaos-mesh-dashboard.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fchaos-mesh-dashboard.jpg" alt="Chaos Mesh's dashboard"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Chaos Mesh's dashboard &lt;/p&gt;

&lt;p&gt;Prior to V1.0, many users reported being blocked by various configuration problems when injecting IOChaos faults. After intense investigations and discussions, we gave up the original SideCar implementation. Instead, we used chaos-daemon to dynamically invade the target Pod, which significantly simplifies the logic. This optimization has made dynamic I/O fault injection possible with Chaos Mesh, and users can focus solely on their experiments without having to worry about additional configurations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security
&lt;/h3&gt;

&lt;p&gt;We have improved the security of Chaos Mesh. It now provides a comprehensive set of selectors to control the scope of the experiments and supports setting specific namespaces to protect important applications. What's more, the support of namespace permissions allows users to limit the "explosion radius" of a chaos experiment to a specific namespace.&lt;/p&gt;

&lt;p&gt;In addition, Chaos Mesh directly reuses Kubernetes' native permission mechanism and supports verification on the Chaos Dashboard. This protects you from other users' errors, which can cause chaos experiments to fail or become uncontrollable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloud native ecosystem: integrations and cooperations
&lt;/h2&gt;

&lt;p&gt;In July 2020, Chaos Mesh was successfully &lt;a href="https://pingcap.com/blog/announcing-chaos-mesh-as-a-cncf-sandbox-project" rel="noopener noreferrer"&gt;accepted as a CNCF Sandbox project&lt;/a&gt;. This shows that Chaos Mesh has received initial recognition from the cloud native community. At the same time, it means that Chaos Mesh has a clear mission: to promote the application of chaos engineering in the cloud native field and to cooperate with other cloud native projects so we can grow together.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grafana
&lt;/h3&gt;

&lt;p&gt;To further improve the observability of chaos experiments, we have included a separate &lt;a href="https://github.com/chaos-mesh/chaos-mesh-datasource" rel="noopener noreferrer"&gt;Grafana plug-in&lt;/a&gt; for Chaos Mesh, which allows users to directly display real-time chaos experiment information on the application monitoring panel. This way, users can simultaneously observe the running status of the application and the current chaos experiment information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Github Action
&lt;/h3&gt;

&lt;p&gt;To enable users to run chaos experiments even during the development phase, we developed the &lt;a href="https://github.com/chaos-mesh/chaos-mesh-action" rel="noopener noreferrer"&gt;chaos-mesh-action&lt;/a&gt; project, &lt;a href="https://pingcap.com/blog/chaos-mesh-action-integrate-chaos-engineering-into-your-ci" rel="noopener noreferrer"&gt;allowing Chaos Mesh to run in the workflow of GitHub Actions&lt;/a&gt;. This way, Chaos Mesh can easily be integrated into daily system development and testing. &lt;/p&gt;

&lt;h3&gt;
  
  
  TiPocket
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/pingcap/tipocket" rel="noopener noreferrer"&gt;TiPocket&lt;/a&gt; is &lt;a href="https://pingcap.com/blog/building-automated-testing-framework-based-on-chaos-mesh-and-argo" rel="noopener noreferrer"&gt;an automated test platform&lt;/a&gt; that integrates Chaos Mesh and Argo, a workflow engine designed for Kubernetes. TiPocket is designed to be a fully automated chaos engineering testing loop for &lt;a href="https://docs.pingcap.com/tidb/stable" rel="noopener noreferrer"&gt;TiDB&lt;/a&gt;, a distributed database. There are a number of steps when we conduct chaos experiments, including deploying applications, running workloads, injecting exceptions, and business checks. To fully automate these steps, Argo was integrated into TiPocket. Chaos Mesh provides rich fault injection, while Argo provides flexible orchestration and scheduling.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Ftipocket.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Ftipocket.jpg" alt="TiPocket"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;TiPocket &lt;/p&gt;

&lt;h2&gt;
  
  
  The community: built from the ground up
&lt;/h2&gt;

&lt;p&gt;Chaos Mesh is a community-driven project and cannot progress without an active, friendly, and open community. Since it was open-sourced, Chaos Mesh has quickly become one of the most eye-catching open-source projects in the chaos engineering world. Within a year, it has accumulated more than 3k stars on GitHub and 70+ contributors. Adopters include Tencent Cloud, XPeng Motors, Dailymotion, NetEase Fuxi Lab, JuiceFS, APISIX, and Meituan. Looking back on the past year, the Chaos Mesh community was built from scratch, and has laid the foundation for a transparent, open, friendly, and autonomous open source community.&lt;/p&gt;

&lt;h3&gt;
  
  
  Becoming part of the CNCF family
&lt;/h3&gt;

&lt;p&gt;Cloud native has been in the DNA of Chaos Mesh since the very beginning. Joining CNCF was a natural choice, which marks a critical step for Chaos Mesh to becoming a vendor-neutral, open and transparent open-source community. Aside from integration within the cloud native ecosystem, joining CNCF gives Chaos Mesh:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;More community and project exposure&lt;/p&gt;

&lt;p&gt;Collaborations with other projects and various cloud native community activities such as Kubernetes Meetup and KubeCon have presented us great opportunities to communicate with the community. We are amazed how the high-quality content produced by the community has also played a positive and far-reaching role in promoting Chaos Mesh.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A more complete and open community framework&lt;/p&gt;

&lt;p&gt;CNCF provides a rather mature framework for open-source community operations. Under CNCF's guidance, we established our basic community framework, including a Code of Conduct, Contributing Guide, and Roadmap. We've also created our own channel, #project-chaos-mesh, under CNCF's Slack.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A friendly and supportive community
&lt;/h3&gt;

&lt;p&gt;The quality of the open source community determines whether our adopters and contributors are willing to stick around and get involved in the community for the long run. In this regard, we've been working hard on: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Continuously enriching documentation and optimizing its structure. So far, we have developed a complete set of documentation for different groups of audiences, including &lt;a href="https://chaos-mesh.org/docs/user_guides/installation/" rel="noopener noreferrer"&gt;a user guide&lt;/a&gt; and &lt;a href="https://chaos-mesh.org/docs/development_guides/development_overview" rel="noopener noreferrer"&gt;developer guide&lt;/a&gt;, &lt;a href="https://chaos-mesh.org/docs/get_started/get_started_on_kind" rel="noopener noreferrer"&gt;quick start guides&lt;/a&gt;, &lt;a href="https://chaos-mesh.org/docs/use_cases/multi_data_centers" rel="noopener noreferrer"&gt;use cases&lt;/a&gt;, and &lt;a href="https://github.com/chaos-mesh/chaos-mesh/blob/master/CONTRIBUTING.md" rel="noopener noreferrer"&gt;a contributing guide&lt;/a&gt;. All are constantly updated per each release. &lt;/li&gt;
&lt;li&gt;Working with the community to publish blog posts, tutorials, use cases, and chaos engineering practices. So far, we've produced 26 Chaos Mesh related articles. Among them is an &lt;a href="https://chaos-mesh.org/interactiveTutorial" rel="noopener noreferrer"&gt;interactive tutorial&lt;/a&gt;, published on O'Reilly's Katakoda site. These materials make a great complement to the documentation.&lt;/li&gt;
&lt;li&gt;Repurposing and amplifying videos and tutorials generated in community meetings, webinars, and meetups.&lt;/li&gt;
&lt;li&gt;Valuing and responding to community feedback and queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Looking ahead
&lt;/h2&gt;

&lt;p&gt;Google's recent global outage reminded us of the importance of system reliability, and it highlighted the importance of chaos engineering. Liz Rice, CNCF TOC Chair, shared &lt;a href="https://twitter.com/CloudNativeFdn/status/1329863326428499971" rel="noopener noreferrer"&gt;The 5 technologies to watch in 2021&lt;/a&gt;, and chaos engineering is on top of the list. We boldly predict that chaos engineering is about to enter a new stage in the near future. &lt;/p&gt;

&lt;p&gt;Chaos Mesh 2.0 is now in active development, and it includes community requirements such as an embedded workflow engine to support the definition and management of more flexible chaos scenarios, application state checking mechanisms, and more detailed experiments reports. Follow along through the project &lt;a href="https://github.com/chaos-mesh/chaos-mesh/blob/master/ROADMAP.md" rel="noopener noreferrer"&gt;roadmap&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Last but not least
&lt;/h2&gt;

&lt;p&gt;Chaos Mesh has grown so much in the past year, yet it is still young, and we have just set sail towards our goal. In the meantime, we call for all of you to participate and help build the Chaos Engineering system ecology together!&lt;/p&gt;

&lt;p&gt;If you are interested in Chaos Mesh and would like to help us improve it, you're welcome to join &lt;a href="https://cloud-native.slack.com/join/shared_invite/zt-fyy3b8up-qHeDNVqbz1j8HDY6g1cY4w#/" rel="noopener noreferrer"&gt;our Slack channel&lt;/a&gt; or submit your pull requests or issues to our &lt;a href="https://github.com/chaos-mesh/chaos-mesh" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>chaosengineering</category>
      <category>chaosmesh</category>
      <category>testing</category>
    </item>
    <item>
      <title>Chaos Mesh X Hacktoberfest 2020 - An Invitation to Open Source</title>
      <dc:creator>CWen</dc:creator>
      <pubDate>Thu, 15 Oct 2020 06:14:54 +0000</pubDate>
      <link>https://forem.com/cwen/chaos-mesh-x-hacktoberfest-2020-an-invitation-to-open-source-1ade</link>
      <guid>https://forem.com/cwen/chaos-mesh-x-hacktoberfest-2020-an-invitation-to-open-source-1ade</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fchaos-mesh.org%2Fassets%2Fimages%2Fchaos-mesh-x-hacktoberfest-ef5cfeca3e10bfe176b916c75d46f468.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fchaos-mesh.org%2Fassets%2Fimages%2Fchaos-mesh-x-hacktoberfest-ef5cfeca3e10bfe176b916c75d46f468.jpg" alt="Chaos-Mesh-X-Hacktoberfest-An-Invitation-to-Open-Source"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/chaos-mesh/chaos-mesh" rel="noopener noreferrer"&gt;Chaos Mesh&lt;/a&gt; is proud to be in &lt;a href="https://hacktoberfest.digitalocean.com/" rel="noopener noreferrer"&gt;Hacktoberfest 2020&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;Hosted by DigitalOcean, Intel and DEV, Hacktoberfest is an open source celebration open to everyone in our global community. This month-long (Oct 1 - Oct 31) event encourages everyone to help drive the growth of open source and make positive contributions to an ever-growing community, whether you’re an experienced developer or open-source newbie learning to code. As long as you submit 4 PRs before Oct 31, you are eligible to claim a limit edition T-shirt (70000 in total on a first-come-first-served basis)!  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fchaos-mesh.org%2Fassets%2Fimages%2Fhacktoberfest-shirt-18cdaf9caef5ce5bd0d032f5e3ca4878.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fchaos-mesh.org%2Fassets%2Fimages%2Fhacktoberfest-shirt-18cdaf9caef5ce5bd0d032f5e3ca4878.png" alt="Hacktoberfest T-shirt"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Open source is the spirit
&lt;/h2&gt;

&lt;p&gt;Chaos Mesh has always been a dedicated and firm advocate of open source from day 1. Within only 10 months since it was open-sourced on December 31st, 2019, Chaos Mesh has already received 2.5k GitHub stars, with 59 contributors from multiple organizations. And it was accepted as a &lt;a href="https://www.cncf.io/sandbox-projects/" rel="noopener noreferrer"&gt;CNCF sandbox project&lt;/a&gt; in July 2020. The amazing growth of the project as well as the community could not have been possible without our shared commitment to the open-source community and spirit. &lt;/p&gt;

&lt;p&gt;We hereby invite you to be part of us, starting from our handpicked issues with proper mentoring and assistance along your journey, which we hope you will find rewarding, inspiring, and most of all, fun. &lt;/p&gt;

&lt;h2&gt;
  
  
  How can you participate
&lt;/h2&gt;

&lt;p&gt;So we are all set up for you in Hacktoberfest - labeled &lt;a href="https://github.com/chaos-mesh/chaos-mesh/issues?q=is%3Aissue+is%3Aopen+label%3AHacktoberfest" rel="noopener noreferrer"&gt;suitable issues&lt;/a&gt; with “Hacktoberfest”, and updated the &lt;a href="https://github.com/chaos-mesh/chaos-mesh/blob/master/CONTRIBUTING.md" rel="noopener noreferrer"&gt;Contributing Guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;How can you participate? It could not be easier with the following steps: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sign up for &lt;a href="https://hacktoberfest.digitalocean.com/login" rel="noopener noreferrer"&gt;Hacktoberfest&lt;/a&gt; using your GitHub account between Oct 1 and Oct 31.&lt;/li&gt;
&lt;li&gt;Pick up an issue. Note that the issues are still being updated, but you don’t have to be limited to issues with the Hacktoberfest label, which only serve as a starting point. &lt;/li&gt;
&lt;li&gt;Start coding and submit your PRs. Again the PR does not need to be corresponding to the labeled issue.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Our maintainers review your PRs. Once you successfully merged, or have gained approval for 4 or more of them, the PRs will be automatically counted on the Hacktoberfest end, and you will be eligible to claim your SWAG.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fchaos-mesh.org%2Fassets%2Fimages%2FPR-count-db779eed7a85e5d45e8e20d9f8e5e78e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fchaos-mesh.org%2Fassets%2Fimages%2FPR-count-db779eed7a85e5d45e8e20d9f8e5e78e.png" alt="PR count"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If your PRs are merged or approved but you haven’t seen the number reflected on Hacktoberfest, comment under your PR.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strive for quality, learning, and no spammy
&lt;/h2&gt;

&lt;p&gt;In the spirit of open source and Hacktoberfest, we welcome all contributions and honor only valid PRs. However, we would not encourage or tolerate spammy contributions, which would not only cause waste to our maintainer’s time but also hurt the feelings and the integrity of the entire open source community. Spammy PRs will be labeled as "invalid" or "spam", and will be closed as invalid. &lt;/p&gt;

&lt;p&gt;Happy hacking! But don’t hack alone. Join #project-chaos-mesh in the &lt;a href="https://join.slack.com/t/cloud-native/shared_invite/zt-fyy3b8up-qHeDNVqbz1j8HDY6g1cY4w" rel="noopener noreferrer"&gt;CNCF Slack&lt;/a&gt; to share your experience, provide your feedback on your experience, or let us help with any problem you have. &lt;/p&gt;

</description>
      <category>hacktoberfest</category>
      <category>chaosengineering</category>
      <category>chaosmesh</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Building an Automated Testing Framework Based on Chaos Mesh® and Argo</title>
      <dc:creator>CWen</dc:creator>
      <pubDate>Thu, 20 Aug 2020 11:07:34 +0000</pubDate>
      <link>https://forem.com/cwen/building-an-automated-testing-framework-based-on-chaos-mesh-and-argo-1lni</link>
      <guid>https://forem.com/cwen/building-an-automated-testing-framework-based-on-chaos-mesh-and-argo-1lni</guid>
      <description>&lt;p&gt;This article was originally published at &lt;a href="https://pingcap.com/blog/building-automated-testing-framework-based-on-chaos-mesh-and-argo" rel="noopener noreferrer"&gt;www.pingcap.com&lt;/a&gt; on Apr 20, 2020 &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fautomated-chaos-testing-framework.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fautomated-chaos-testing-framework.jpg" alt="automated-chaos-testing-framework"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/chaos-mesh/chaos-mesh" rel="noopener noreferrer"&gt;Chaos Mesh&lt;/a&gt;® is an open-source chaos engineering platform for Kubernetes. Although it provides rich capabilities to simulate abnormal system conditions, it still only solves a fraction of the Chaos Engineering puzzle. Besides fault injection, a full chaos engineering application consists of hypothesizing around defined steady states, running experiments in production, validating the system via test cases, and automating the testing. &lt;/p&gt;

&lt;p&gt;This article describes how we use &lt;a href="https://github.com/pingcap/tipocket" rel="noopener noreferrer"&gt;TiPocket&lt;/a&gt;, an automated testing framework to build a full Chaos Engineering testing loop for TiDB, our distributed database.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why do we need TiPocket?
&lt;/h2&gt;

&lt;p&gt;Before we can put a distributed system like &lt;a href="https://github.com/pingcap/tidb" rel="noopener noreferrer"&gt;TiDB&lt;/a&gt; into production, we have to ensure that it is robust enough for day-to-day use. For this reason, several years ago we introduced Chaos Engineering into our testing framework. In our testing framework, we:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Observe the normal metrics and develop our testing hypothesis.&lt;/li&gt;
&lt;li&gt;Inject a list of failures into TiDB.&lt;/li&gt;
&lt;li&gt;Run various test cases to verify TiDB in fault scenarios. &lt;/li&gt;
&lt;li&gt;Monitor and collect test results for analysis and diagnosis.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This sounds like a solid process, and we’ve used it for years. However, as TiDB evolves, the testing scale multiplies. We have multiple fault scenarios, against which dozens of test cases run in the Kubernetes testing cluster. Even with Chaos Mesh helping to inject failures, the remaining work can still be demanding—not to mention the challenge of automating the pipeline to make the testing scalable and efficient.&lt;/p&gt;

&lt;p&gt;This is why we built TiPocket, a fully-automated testing framework based on Kubernetes and Chaos Mesh. Currently, we mainly use it to test TiDB clusters. However, because of TiPocket’s Kubernetes-friendly design and extensible interface, you can use Kubernetes’ create and delete logic to easily support other applications. &lt;/p&gt;
&lt;h2&gt;
  
  
  How does it work
&lt;/h2&gt;

&lt;p&gt;Based on the above requirements, we need an automatic workflow that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Injects chaos&lt;/li&gt;
&lt;li&gt;Verifies the impact of that chaos&lt;/li&gt;
&lt;li&gt;Automates the chaos pipeline&lt;/li&gt;
&lt;li&gt;Visualizes the results&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Injecting chaos - Chaos Mesh
&lt;/h3&gt;

&lt;p&gt;Fault injection is the core chaos testing. In a distributed database, faults can happen anytime, anywhere—from node crashes, network partitions, and file system failures, to kernel panics. This is where Chaos Mesh comes in. &lt;/p&gt;

&lt;p&gt;Currently, TiPocket supports the following types of fault injection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Network&lt;/strong&gt;: Simulates network partitions, random packet loss, disorder, duplication, or delay of links.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Time skew&lt;/strong&gt;: Simulates clock skew of the container to be tested. &lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Kill&lt;/strong&gt;: Kills the specified pod, either randomly in a cluster or within a component (TiDB, TiKV, or Placement Driver (PD)). &lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;I/O&lt;/strong&gt;: Injects I/O delays in TiDB’s storage engine, TiKV,  to identify I/O related issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With fault injection handled, we need to think about verification. How do we make sure TiDB can survive these faults? &lt;/p&gt;
&lt;h2&gt;
  
  
  Verifying chaos impacts: test cases
&lt;/h2&gt;

&lt;p&gt;To validate how TiDB withstands chaos, we implemented dozens of test cases in TiPocket, combined with a variety of inspection tools. To give you an overview of how TiPocket verifies TiDB in the event of failures, consider the following test cases. These cases focus on SQL execution, transaction consistency, and transaction isolation. &lt;/p&gt;
&lt;h3&gt;
  
  
  Fuzz testing: SQLsmith
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/pingcap/tipocket/tree/master/pkg/go-sqlsmith" rel="noopener noreferrer"&gt;SQLsmith&lt;/a&gt; is a tool that generates random SQL queries. TiPocket creates a TiDB cluster and a MySQL instance.. The random SQL generated by SQLsmith is executed on TiDB and MySQL, and various faults are injected into the TiDB cluster to test. In the end, execution results are compared. If we detect inconsistencies, there are potential issues with our system.&lt;/p&gt;
&lt;h3&gt;
  
  
  Transaction consistency testing: Bank and Porcupine
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/pingcap/tipocket/tree/master/cmd/bank" rel="noopener noreferrer"&gt;Bank&lt;/a&gt; is a classical test case that simulates the transfer process in a banking system. Under snapshot isolation, all transfers must ensure that the total amount of all accounts must be consistent at every moment, even in the face of system failures. If there are inconsistencies in the total amount, there are potential issues with our system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/anishathalye/porcupine" rel="noopener noreferrer"&gt;Porcupine&lt;/a&gt; is a linearizability checker in Go built to test the correctness of distributed systems. It takes a sequential specification as executable Go code, along with a concurrent history, and it determines whether the history is linearizable with respect to the sequential specification. In TiPocket, we use the &lt;a href="https://github.com/pingcap/tipocket/tree/master/pkg/check/porcupine" rel="noopener noreferrer"&gt;Porcupine&lt;/a&gt; checker in multiple test cases to check whether TiDB meets the linearizability constraint.  &lt;/p&gt;
&lt;h3&gt;
  
  
  Transaction Isolation testing: Elle
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/jepsen-io/elle" rel="noopener noreferrer"&gt;Elle&lt;/a&gt; is an inspection tool that verifies a database’s transaction isolation level. TiPocket integrates &lt;a href="https://github.com/pingcap/tipocket/tree/master/pkg/elle" rel="noopener noreferrer"&gt;go-elle&lt;/a&gt;, the Go implementation of the Elle inspection tool, to verify TiDB’s isolation level.&lt;/p&gt;

&lt;p&gt;These are just a few of the test cases TiPocket uses to verify TiDB’s accuracy and stability. For more test cases and verification methods, see our &lt;a href="https://github.com/pingcap/tipocket" rel="noopener noreferrer"&gt;source code&lt;/a&gt;.  &lt;/p&gt;
&lt;h2&gt;
  
  
  Automating the chaos pipeline - Argo
&lt;/h2&gt;

&lt;p&gt;Now that we have Chaos Mesh to inject faults, a TiDB cluster to test, and ways to validate TiDB, how can we automate the chaos testing pipeline? Two options come to mind: we could implement the scheduling functionality in TiPocket, or hand over the job to existing open-source tools. To make TiPocket more dedicated to the testing part of our workflow, we chose the open-source tools approach. This, plus our all-in-K8s design, lead us directly to &lt;a href="https://github.com/argoproj/argo" rel="noopener noreferrer"&gt;Argo&lt;/a&gt;.  &lt;/p&gt;

&lt;p&gt;Argo is a workflow engine designed for Kubernetes. It has been an open source product for a long time, and has received widespread attention and application.  &lt;/p&gt;

&lt;p&gt;Argo has abstracted several custom resource definitions (CRDs) for workflows.  The most important ones include Workflow Template, Workflow, and Cron Workflow. Here is how Argo fits in TiPocket: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Workflow Template&lt;/strong&gt; is a template defined in advance for each test task. Parameters can be passed in when the test is running.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Workflow&lt;/strong&gt; schedules multiple workflow templates in different orders, which form the tasks to be executed. Argo also lets you add conditions, loops, and directed acyclic graphs (DAGs) in the pipeline.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Cron Workflow&lt;/strong&gt; lets you schedule a workflow like a cron job. It is perfectly suitable for scenarios where you want to run test tasks for a long time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The sample workflow for our predefined bank test is shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spec:
  entrypoint: call-tipocket-bank
  arguments:
    parameters:
      - name: ns
        value: tipocket-bank
            - name: nemesis
        value: random_kill,kill_pd_leader_5min,partition_one,subcritical_skews,big_skews,shuffle-leader-scheduler,shuffle-region-scheduler,random-merge-scheduler
  templates:
    - name: call-tipocket-bank
      steps:
        - - name: call-wait-cluster
            templateRef:
              name: wait-cluster
              template: wait-cluster
        - - name: call-tipocket-bank
            templateRef:
              name: tipocket-bank
              template: tipocket-bank
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, we use the workflow template and nemesis parameters to define the specific failure to inject. You can reuse the template to define multiple workflows that suit different test cases. This allows you to add more customized failure injections in the flow. &lt;/p&gt;

&lt;p&gt;Besides &lt;a href="https://github.com/pingcap/tipocket/tree/master/argo/workflow" rel="noopener noreferrer"&gt;TiPocket’s&lt;/a&gt; sample workflows and templates, the design also allows you to add your own failure injection flows. Handling complicated logics using codable workflows makes Argo developer-friendly and an ideal choice for our scenarios.                                                                                                                                                                  &lt;/p&gt;

&lt;p&gt;Now, our chaos experiment is running automatically. But if our results do not meet our expectations? How do we locate the problem? TiDB saves a variety of monitoring information, which makes log collecting essential for enabling observability in TiPocket.&lt;/p&gt;

&lt;h2&gt;
  
  
  Visualizing the results: Loki
&lt;/h2&gt;

&lt;p&gt;In cloud-native systems, observability is very important. Generally speaking, you can achieve observability through &lt;strong&gt;metrics&lt;/strong&gt;, &lt;strong&gt;logging&lt;/strong&gt;, and &lt;strong&gt;tracing&lt;/strong&gt;. TiPocket’s main test cases evaluate TiDB clusters, so metrics and logs are our default sources for locating issues.&lt;/p&gt;

&lt;p&gt;On Kubernetes, Prometheus is the de-facto standard for metrics. However, there is no common way for log collection. Solutions such as &lt;a href="https://en.wikipedia.org/wiki/Elasticsearch" rel="noopener noreferrer"&gt;Elasticsearch&lt;/a&gt;, &lt;a href="https://fluentbit.io/" rel="noopener noreferrer"&gt;Fluent Bit&lt;/a&gt;, and &lt;a href="https://www.elastic.co/kibana" rel="noopener noreferrer"&gt;Kibana&lt;/a&gt; perform well, but they may cause system resource contention and high maintenance costs. We decided to use &lt;a href="https://github.com/grafana/loki" rel="noopener noreferrer"&gt;Loki&lt;/a&gt;, the Prometheus-like log aggregation system from &lt;a href="https://grafana.com/" rel="noopener noreferrer"&gt;Grafana&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Prometheus processes TiDB’s monitoring information. Prometheus and Loki have a similar labeling system, so we can easily combine Prometheus' monitoring indicators with the corresponding pod logs and use a similar query language. Grafana also supports the Loki dashboard, which means we can use Grafana to display monitoring indicators and logs at the same time. Grafana is the built-in monitoring component in TiDB, which Loki can reuse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting them all together: TiPocket
&lt;/h2&gt;

&lt;p&gt;Now, everything is ready. Here is a simplified diagram of TiPocket: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Ftipocket-architecture.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Ftipocket-architecture.jpg" alt="architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, the Argo workflow manages all chaos experiments and test cases. Generally, a complete test cycle involves the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Argo creates a Cron Workflow, which defines the cluster to be tested, the faults to inject, the test case, and the duration of the task. If necessary, the Cron Workflow also lets you view case logs in real-time.
&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Fargo-workflow.jpg" alt="argo-workflow"&gt;
&lt;/li&gt;
&lt;li&gt;At a specified time, a separate TiPocket thread is started in the workflow, and the Cron Workflow is triggered. TiPocket sends TiDB-Operator the definition of the cluster to test. In turn, TiDB-Operator creates a target TiDB cluster. Meanwhile, Loki collects the related logs.&lt;/li&gt;
&lt;li&gt;Chaos Mesh injects faults in the cluster. &lt;/li&gt;
&lt;li&gt;Using the test cases mentioned above, the user validates the health of the system. Any test case failure leads to workflow failure in Argo, which triggers Alertmanager to send the result to the specified Slack channel. If the test cases complete normally, the cluster is cleared, and Argo stands by until the next test.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Falert-message.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdownload.pingcap.com%2Fimages%2Fblog%2Falert-message.jpg" alt="alert-message"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the complete TiPocket workflow. &lt;/p&gt;

&lt;h2&gt;
  
  
  Join us
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/pingcap/chaos-mesh" rel="noopener noreferrer"&gt;Chaos Mesh&lt;/a&gt;  and &lt;a href="https://github.com/pingcap/tipocket" rel="noopener noreferrer"&gt;TiPocket&lt;/a&gt; are both in active iterations. We have donated Chaos Mesh donated to &lt;a href="https://github.com/cncf/toc/pull/367" rel="noopener noreferrer"&gt;CNCF&lt;/a&gt;, and we look forward to more community members joining us in building a complete Chaos Engineering ecosystem. If this sounds interesting to you, check out our &lt;a href="https://chaos-mesh.org/" rel="noopener noreferrer"&gt;website&lt;/a&gt;, or join #chaos-mesh in &lt;a href="https://cloud-native.slack.com/archives/C018JJ686BS" rel="noopener noreferrer"&gt;Slack&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>chaosmesh</category>
      <category>chaosengineering</category>
      <category>argo</category>
      <category>testing</category>
    </item>
    <item>
      <title>Simulating Clock Skew in K8s Without Affecting Other Containers on the Node</title>
      <dc:creator>CWen</dc:creator>
      <pubDate>Wed, 22 Apr 2020 08:16:48 +0000</pubDate>
      <link>https://forem.com/cwen/simulating-clock-skew-in-k8s-without-affecting-other-containers-on-the-node-59oc</link>
      <guid>https://forem.com/cwen/simulating-clock-skew-in-k8s-without-affecting-other-containers-on-the-node-59oc</guid>
      <description>&lt;p&gt;This article was originally published at &lt;a href="https://pingcap.com/blog/simulating-clock-skew-in-k8s-without-affecting-other-containers-on-node/"&gt;www.pingcap.com&lt;/a&gt; on Apr 20, 2020 &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fM1Smb0I--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/clock-sync-chaos-engineering-k8s.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fM1Smb0I--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/clock-sync-chaos-engineering-k8s.jpg" alt="Clock synchronization in distributed system"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/pingcap/chaos-mesh"&gt;Chaos Mesh™&lt;/a&gt;, an easy-to-use, open-source, cloud-native chaos engineering platform for Kubernetes (K8s), has a new feature, TimeChaos, which simulates the &lt;a href="https://en.wikipedia.org/wiki/Clock_skew#On_a_network"&gt;clock skew&lt;/a&gt; phenomenon. Usually, when we modify clocks in a container, we want a &lt;a href="https://learning.oreilly.com/library/view/chaos-engineering/9781491988459/ch07.html"&gt;minimized blast radius&lt;/a&gt;, and we don't want the change to affect the other containers on the node. In reality, however, implementing this can be harder than you think. How does Chaos Mesh solve this problem? &lt;/p&gt;

&lt;p&gt;In this post, I'll describe how we hacked through different approaches of clock skew and how TimeChaos in Chaos Mesh enables time to swing freely in containers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Simulating clock skew without affecting other containers on the node
&lt;/h2&gt;

&lt;p&gt;Clock skew refers to the time difference between clocks on nodes within a network. It might cause reliability problems in a distributed system, and it's a concern for designers and developers of complex distributed systems. For example, in a distributed SQL database, it's vital to maintain a synchronized local clock across nodes to achieve a consistent global snapshot and ensure the ACID properties for transactions.&lt;/p&gt;

&lt;p&gt;Currently, there are well-recognized &lt;a href="https://pingcap.com/blog/Time-in-Distributed-Systems/"&gt;solutions to synchronize clocks&lt;/a&gt;, but without proper testing, you can never be sure that your implementation is solid.&lt;/p&gt;

&lt;p&gt;Then how can we test global snapshot consistency in a distributed system? The answer is obvious: we can simulate clock skew to test whether distributed systems can keep a consistent global snapshot under abnormal clock conditions. Some testing tools support simulating clock skew in containers, but they have an impact on physical nodes. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/pingcap/chaos-mesh/wiki/Time-Chaos"&gt;TimeChaos&lt;/a&gt; is a tool that &lt;strong&gt;simulates clock skew in containers to test how it impacts your application without affecting the whole node&lt;/strong&gt;. This way, we can precisely identify the potential consequences of clock skew and take measures accordingly. &lt;/p&gt;

&lt;h2&gt;
  
  
  Various approaches for simulating clock skew we've explored
&lt;/h2&gt;

&lt;p&gt;Reviewing the existing choices, we know clearly that they cannot be applied to Chaos Mesh, which runs on Kubernetes. Two common ways of simulating clock skew--changing the node clock directly and using the Jepsen framework--change the time for all processes on the node. These are not acceptable solutions for us. In a Kubernetes container, if we inject a clock skew error that affects the entire node, other containers on the same node will be disturbed. Such a clumsy approach is not tolerable.&lt;/p&gt;

&lt;p&gt;Then how are we supposed to tackle this problem? Well, the first thing that comes into our mind is finding solutions in the kernel using &lt;a href="https://en.wikipedia.org/wiki/Berkeley_Packet_Filter"&gt;Berkeley Packet Filter&lt;/a&gt; (BPF).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;LD_PRELOAD&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;LD_PRELOAD&lt;/code&gt; is a Linux environment variable that lets you define which dynamic link library is loaded before the program execution. &lt;/p&gt;

&lt;p&gt;This variable has two advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We can call our own functions without being aware of the source code.&lt;/li&gt;
&lt;li&gt;We can inject code into other programs to achieve specific purposes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For some languages that use applications to call the time function in glibc, such as Rust and C, using &lt;code&gt;LD_PRELOAD&lt;/code&gt; is enough to simulate clock skew. But things are trickier for Golang. Because languages such as Golang directly parse virtual Dynamic Shared Object (&lt;a href="http://man7.org/linux/man-pages/man7/vdso.7.html"&gt;vDSO&lt;/a&gt;), a mechanism to speed up system calls. To obtain the time function address, we can't simply use &lt;code&gt;LD_PRELOAD&lt;/code&gt; to intercept the glic interface. Therefore, &lt;code&gt;LD_PRELOAD&lt;/code&gt; is not our solution. &lt;/p&gt;

&lt;h3&gt;
  
  
  Use BPF to modify the return value of &lt;code&gt;clock_gettime&lt;/code&gt; system call
&lt;/h3&gt;

&lt;p&gt;We also tried to filter the task &lt;a href="http://www.linfo.org/pid.html"&gt;process identification number&lt;/a&gt; (PID) with BPF. This way, we could simulate clock skew on a specified process and modify the return value of the &lt;code&gt;clock_gettime&lt;/code&gt; system call.&lt;/p&gt;

&lt;p&gt;This seemed like a good idea, but we also encountered a problem: in most cases, vDSO speeds up &lt;code&gt;clock_gettime&lt;/code&gt;, but &lt;code&gt;clock_gettime&lt;/code&gt; doesn't make a system call. This selection didn't work, either. Oops.&lt;/p&gt;

&lt;p&gt;Thankfully, we determined that if the system kernel version is 4.18 or later, and if we use the &lt;a href="https://www.kernel.org/doc/html/latest/timers/hpet.html"&gt;HPET&lt;/a&gt; clock, &lt;code&gt;clock_gettime()&lt;/code&gt; gets time by making normal system calls instead of vDSO. We implemented &lt;a href="https://github.com/chaos-mesh/bpfki"&gt;a version of clock skew&lt;/a&gt; using this approach, and it works fine for Rust and C. As for Golang, the program can get the time right, but if we perform &lt;code&gt;sleep&lt;/code&gt; during the clock skew injection, the sleep operation is very likely to be blocked. Even after the injection is canceled, the system cannot recover. Thus, we have to give up this approach, too.&lt;/p&gt;

&lt;h2&gt;
  
  
  TimeChaos, our final hack
&lt;/h2&gt;

&lt;p&gt;From the previous section, we know that programs usually get the system time by calling &lt;code&gt;clock_gettime&lt;/code&gt;. In our case, &lt;code&gt;clock_gettime&lt;/code&gt; uses vDSO to speed up the calling process, so we cannot use &lt;code&gt;LD_PRELOAD&lt;/code&gt; to hack the &lt;code&gt;clock_gettime&lt;/code&gt; system calls. &lt;/p&gt;

&lt;p&gt;We figured out the cause; then what's the solution? Start from vDSO. If we can redirect the address that stores the &lt;code&gt;clock_gettime&lt;/code&gt; return value in vDSO to an address we define, we can solve the problem.&lt;/p&gt;

&lt;p&gt;Easier said than done. To achieve this goal, we must tackle the following problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Know the user-mode address used by vDSO&lt;/li&gt;
&lt;li&gt;Know vDSO's kernel-mode address, if we want to modify the &lt;code&gt;clock_gettime&lt;/code&gt; function in vDSO by any address in the kernel mode&lt;/li&gt;
&lt;li&gt;Know how to modify vDSO data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First, we need to peek inside vDSO. We can see the vDSO memory address in &lt;code&gt;/proc/pid/maps&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cat /proc/pid/maps
...
7ffe53143000-7ffe53145000 r-xp 00000000 00:00 0                     [vdso]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last line is vDSO information. The privilege of this memory space is &lt;code&gt;r-xp&lt;/code&gt;: readable and executable, but not writable. That means the user mode cannot modify this memory. We can use &lt;a href="http://man7.org/linux/man-pages/man2/ptrace.2.html"&gt;ptrace&lt;/a&gt; to avoid this restriction.&lt;/p&gt;

&lt;p&gt;Next, we use &lt;code&gt;gdb dump memory&lt;/code&gt; to export the vDSO and use &lt;code&gt;objdump&lt;/code&gt; to see what's inside. Here is what we get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(gdb) dump memory vdso.so 0x00007ffe53143000 0x00007ffe53145000
$ objdump -T vdso.so
vdso.so:    file format elf64-x86-64
DYNAMIC SYMBOL TABLE:
ffffffffff700600  w  DF .text   0000000000000545  LINUX_2.6  clock_gettime
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see that the whole vDSO is like a &lt;code&gt;.so&lt;/code&gt; file, so we can use an executable and linkable format (ELF) file to format it. With this information, a basic workflow for implementing TimeChaos starts to take shape:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8MORyJlp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/timechaos-workflow.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8MORyJlp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/timechaos-workflow.jpg" alt="TimeChaos workflow"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The chart above is the process of &lt;strong&gt;TimeChaos&lt;/strong&gt;, an implementation of clock skew in Chaos Mesh.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use ptrace to attach the specified PID process to stop the current process.&lt;/li&gt;
&lt;li&gt;Use ptrace to create a new mapping in the virtual address space of the calling process and use &lt;a href="https://linux.die.net/man/2/process_vm_writev"&gt;&lt;code&gt;process_vm_writev&lt;/code&gt;&lt;/a&gt; to write the &lt;code&gt;fake_clock_gettime&lt;/code&gt; function we defined into the memory space.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;process_vm_writev&lt;/code&gt; to write the specified parameters into &lt;code&gt;fake_clock_gettime&lt;/code&gt;. These parameters are the time we would like to inject, such as two hours backward or two days forward.&lt;/li&gt;
&lt;li&gt;Use ptrace to modify the &lt;code&gt;clock_gettime&lt;/code&gt; function in vDSO and redirect to the &lt;code&gt;fake_clock_gettime&lt;/code&gt; function.&lt;/li&gt;
&lt;li&gt;Use ptrace to detach the PID process.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you are interested in the details, see the &lt;a href="https://github.com/pingcap/chaos-mesh/blob/master/pkg/time/time_linux.go"&gt;Chaos Mesh GitHub repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Simulating clock skew on a distributed SQL database
&lt;/h2&gt;

&lt;p&gt;Statistics speak volumes. Here we‘re going to try TimeChaos on &lt;a href="https://pingcap.com/docs/stable/overview/"&gt;TiDB&lt;/a&gt;, an open source, &lt;a href="https://en.wikipedia.org/wiki/NewSQL"&gt;NewSQL&lt;/a&gt;, distributed SQL database that supports &lt;a href="https://en.wikipedia.org/wiki/Hybrid_transactional/analytical_processing"&gt;Hybrid Transactional/Analytical Processing&lt;/a&gt; (HTAP) workloads, to see if the chaos testing can really work.&lt;/p&gt;

&lt;p&gt;TiDB uses a centralized service Timestamp Oracle (TSO) to obtain the globally consistent version number, and to ensure that the transaction version number increases monotonically. The TSO service is managed by the Placement Driver (PD) component. Therefore, we choose a random PD node and inject TimeChaos regularly, each with a 10-millisecond-backward clock skew. Let's see if TiDB can meet the challenge.&lt;/p&gt;

&lt;p&gt;To better perform the testing, we use &lt;a href="https://github.com/cwen0/bank"&gt;bank&lt;/a&gt; as the workload, which simulates the financial transfers in a banking system. It‘s often used to verify the correctness of database transactions.&lt;/p&gt;

&lt;p&gt;This is our test configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: pingcap.com/v1alpha1
kind: TimeChaos
metadata:
  name: time-skew-example
  namespace: tidb-demo
spec:
  mode: one
  selector:
    labelSelectors:
      "app.kubernetes.io/component": "pd"
  timeOffset:
    sec: -600
  clockIds:
    - CLOCK_REALTIME
  duration: "10s"
  scheduler:
    cron: "@every 1m"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;During this test, Chaos Mesh injects TimeChaos into a chosen PD Pod every 1 millisecond for 10 seconds. Within the duration, the time acquired by PD will have a 600 second offset from the actual time. For further details, see &lt;a href="https://github.com/pingcap/chaos-mesh/wiki/Time-Chaos"&gt;Chaos Mesh Wiki&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Let's create a TimeChaos experiment using the &lt;code&gt;kubectl apply&lt;/code&gt; command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl apply -f pd-time.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, we can retrieve the PD log by the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl logs -n tidb-demo tidb-app-pd-0 | grep "system time jump backward"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's the log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[2020/03/24 09:06:23.164 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585041383060109693]
[2020/03/24 09:16:32.260 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585041992160476622]
[2020/03/24 09:20:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042231960027622]
[2020/03/24 09:23:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042411960079655]
[2020/03/24 09:25:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042531963640321]
[2020/03/24 09:28:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585042711960148191]
[2020/03/24 09:33:32.063 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043011960517655]
[2020/03/24 09:34:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043071959942937]
[2020/03/24 09:35:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043131978582964]
[2020/03/24 09:36:32.059 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043191960687755]
[2020/03/24 09:38:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043311959970737]
[2020/03/24 09:41:32.060 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043491959970502]
[2020/03/24 09:45:32.061 +00:00] [ERROR] [systime_mon.go:32] ["system time jump backward"] [last=1585043731961304629]
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the log above, we see that every now and then, PD detects that the system time rolls back. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TimeChaos successfully simulates clock skew.&lt;/li&gt;
&lt;li&gt;PD can deal with the clock skew situation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's encouraging. But does TimeChaos affect services other than PD? We can check it out in the Chaos Dashboard:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--c1ZtK0FB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/chaos-dashboard.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--c1ZtK0FB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/chaos-dashboard.jpg" alt="Chaos Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's clear that in the monitor, TimeChaos was injected every 1 millisecond and the whole duration lasted 10 seconds. What's more, TiDB was not affected by that injection. The bank program ran normally, and performance was not affected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try out Chaos Mesh
&lt;/h2&gt;

&lt;p&gt;As a cloud-native chaos engineering platform, Chaos Mesh features all-around &lt;a href="https://pingcap.com/blog/chaos-mesh-your-chaos-engineering-solution-for-system-resiliency-on-kubernetes/"&gt;fault injection methods for complex systems on Kubernetes&lt;/a&gt;, covering faults in Pods, the network, the file system, and even the kernel.&lt;/p&gt;

&lt;p&gt;Wanna have some hands-on experience in chaos engineering? Welcome to &lt;a href="https://github.com/pingcap/chaos-mesh"&gt;Chaos Mesh&lt;/a&gt;. This &lt;a href="https://pingcap.com/blog/run-first-chaos-experiment-in-ten-minutes/"&gt;10-minute tutorial&lt;/a&gt; will help you quickly get started with chaos engineering and run your first chaos experiment with Chaos Mesh.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>distributedsystems</category>
      <category>chaosengineering</category>
      <category>testing</category>
    </item>
    <item>
      <title>Run Your First Chaos Experiment in 10 Minutes</title>
      <dc:creator>CWen</dc:creator>
      <pubDate>Mon, 23 Mar 2020 05:50:34 +0000</pubDate>
      <link>https://forem.com/cwen/run-your-first-chaos-experiment-in-10-minutes-47fg</link>
      <guid>https://forem.com/cwen/run-your-first-chaos-experiment-in-10-minutes-47fg</guid>
      <description>&lt;p&gt;This article was originally published at &lt;a href="https://pingcap.com/blog/run-first-chaos-experiment-in-ten-minutes/"&gt;www.pingcap.com&lt;/a&gt; on Mar 18, 2020 &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QRNvY2Xo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/run-first-chaos-experiment-in-ten-minutes.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QRNvY2Xo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/run-first-chaos-experiment-in-ten-minutes.png" alt="Run your first chaos experiment in 10 minutes"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Chaos Engineering is a way to test a production software system's robustness by simulating unusual or disruptive conditions. For many people, however, the transition from learning Chaos Engineering to practicing it on their own systems is daunting. It sounds like one of those big ideas that require a fully-equipped team to plan ahead. Well, it doesn't have to be. To get started with chaos experimenting, you may be just one suitable platform away.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/pingcap/chaos-mesh"&gt;Chaos Mesh&lt;/a&gt; is an &lt;strong&gt;easy-to-use&lt;/strong&gt;, open-source, cloud-native Chaos Engineering platform that orchestrates chaos in Kubernetes environments. This 10-minute tutorial will help you quickly get started with Chaos Engineering and run your first chaos experiment with Chaos Mesh.&lt;/p&gt;

&lt;p&gt;For more information about Chaos Mesh, refer to our &lt;a href="https://pingcap.com/blog/chaos-mesh-your-chaos-engineering-solution-for-system-resiliency-on-kubernetes/"&gt;previous article &lt;/a&gt;or the &lt;a href="https://github.com/pingcap/chaos-mesh"&gt;chaos-mesh project&lt;/a&gt; on GitHub.&lt;/p&gt;

&lt;h2&gt;
  
  
  A preview of our little experiment
&lt;/h2&gt;

&lt;p&gt;Chaos experiments are similar to experiments we do in a science class. It's perfectly fine to stimulate turbulent situations in a controlled environment. In our case here, we will be simulating network chaos on a small web application called &lt;a href="https://github.com/chaos-mesh/web-show"&gt;web-show&lt;/a&gt;. To visualize the chaos effect, web-show records the latency from its pod to the kube-controller pod (under the namespace of &lt;code&gt;kube-system&lt;/code&gt;) every 10 seconds.&lt;/p&gt;

&lt;p&gt;The following clip shows the process of installing Chaos Mesh, deploying web-show, and creating the chaos experiment within a few commands: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QeUB3MG8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://download.pingcap.com/images/blog/whole-process-of-chaos-experiment.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QeUB3MG8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://download.pingcap.com/images/blog/whole-process-of-chaos-experiment.gif" alt="The whole process of the chaos experiment"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now it's your turn! It's time to get your hands dirty.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let's get started!
&lt;/h2&gt;

&lt;p&gt;For our simple experiment, we use Kubernetes in the Docker (&lt;a href="https://kind.sigs.k8s.io/"&gt;Kind&lt;/a&gt;) for Kubernetes development. You can feel free to use &lt;a href="https://minikube.sigs.k8s.io/"&gt;Minikube&lt;/a&gt; or any existing Kubernetes clusters to follow along.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prepare the environment
&lt;/h3&gt;

&lt;p&gt;Before moving forward, make sure you have &lt;a href="https://git-scm.com/"&gt;Git&lt;/a&gt; and &lt;a href="https://www.docker.com/"&gt;Docker&lt;/a&gt; installed on your local computer, with Docker up and running. For macOS, it's recommended to allocate at least 6 CPU cores to Docker. For details, see &lt;a href="https://docs.docker.com/docker-for-mac/#advanced"&gt;Docker configuration for Mac&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Get Chaos Mesh:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/pingcap/chaos-mesh.git
&lt;span class="nb"&gt;cd &lt;/span&gt;chaos-mesh/
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Install Chaos Mesh with the &lt;code&gt;install.sh&lt;/code&gt; script:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./install.sh &lt;span class="nt"&gt;--local&lt;/span&gt; kind 
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;code&gt;install.sh&lt;/code&gt; is an automated shell script that checks your environment, installs Kind, launches Kubernetes clusters locally, and deploys Chaos Mesh. To see the detailed description of &lt;code&gt;install.sh&lt;/code&gt;, you can include the &lt;code&gt;--help&lt;/code&gt; option.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your local computer cannot pull images from &lt;code&gt;docker.io&lt;/code&gt; or &lt;code&gt;gcr.io&lt;/code&gt;, use the local gcr.io mirror and execute &lt;code&gt;./install.sh --local kind --docker-mirror&lt;/code&gt; instead.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Set the system environment variable:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; ~/.bash_profile
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Depending on your network, these steps might take a few minutes.&lt;/li&gt;
&lt;li&gt;If you see an error message like this:&lt;/li&gt;
&lt;/ul&gt;


&lt;pre class="highlight shell"&gt;&lt;code&gt;ERROR: failed to create cluster: failed to generate kubeadm config content: failed to get kubernetes version from node: failed to get file: &lt;span class="nb"&gt;command&lt;/span&gt; &lt;span class="s2"&gt;"docker exec --privileged kind-control-plane cat /kind/version"&lt;/span&gt; failed with error: &lt;span class="nb"&gt;exit &lt;/span&gt;status 1
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;increase the available resources for Docker on your local computer and execute the following command:&lt;/p&gt;


&lt;pre class="highlight shell"&gt;&lt;code&gt;./install.sh &lt;span class="nt"&gt;--local&lt;/span&gt; kind &lt;span class="nt"&gt;--force-local-kube&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/blockquote&gt;

&lt;p&gt;When the process completes you will see a message indicating Chaos Mesh is successfully installed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploy the application
&lt;/h3&gt;

&lt;p&gt;The next step is to deploy the application for testing. In our case here, we choose web-show because it allows us to directly observe the effect of network chaos. You can also deploy your own application for testing.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Deploy web-show with the &lt;code&gt;deploy.sh&lt;/code&gt; script:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Make sure you are in the Chaos Mesh directory &lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;examples/web-show &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; 
./deploy.sh 
&lt;/code&gt;&lt;/pre&gt;


&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your local computer cannot pull images from &lt;code&gt;docker.io&lt;/code&gt;, use the &lt;code&gt;local gcr.io&lt;/code&gt; mirror and execute &lt;code&gt;./deploy.sh --docker-mirror&lt;/code&gt; instead. &lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Access the web-show application. From your web browser, go to &lt;code&gt;http://localhost:8081&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Create the chaos experiment
&lt;/h3&gt;

&lt;p&gt;Now that everything is ready, it's time to run your chaos experiment!&lt;/p&gt;

&lt;p&gt;Chaos Mesh uses &lt;a href="https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/"&gt;CustomResourceDefinitions&lt;/a&gt; (CRD) to define chaos experiments. CRD objects are designed separately based on different experiment scenarios, which greatly simplifies the definition of CRD objects. Currently, CRD objects that have been implemented in Chaos Mesh include PodChaos, NetworkChaos, IOChaos, TimeChaos, and KernelChaos. Later, we'll support more fault injection types.&lt;/p&gt;

&lt;p&gt;In this experiment, we are using &lt;a href="https://github.com/pingcap/chaos-mesh/blob/master/examples/web-show/network-delay.yaml"&gt;NetworkChaos&lt;/a&gt; for the chaos experiment. The NetworkChaos configuration file, written in YAML, is shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: pingcap.com/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-example
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      "app": "web-show"
  delay:
    latency: "10ms"
    correlation: "100"
    jitter: "0ms"
  duration: "30s"
  scheduler:
    cron: "@every 60s"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For detailed descriptions of NetworkChaos actions, see &lt;a href="https://github.com/pingcap/chaos-mesh/wiki/Network-Chaos"&gt;Chaos Mesh wiki&lt;/a&gt;. Here, we just rephrase the configuration as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;target: &lt;code&gt;web-show&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;mission: inject a &lt;code&gt;10ms&lt;/code&gt; network delay every &lt;code&gt;60s&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;attack duration: &lt;code&gt;30s&lt;/code&gt; each time &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To start NetworkChaos, do the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Run &lt;code&gt;network-delay.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Make sure you are in the chaos-mesh/examples/web-show directory&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; network-delay.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Access the web-show application. In your web browser, go to &lt;code&gt;http://localhost:8081&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;From the line graph, you can tell that there is a 10 ms network delay every 60 seconds.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IWFFOUBG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/using-chaos-mesh-to-insert-delays-in-web-show.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IWFFOUBG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/using-chaos-mesh-to-insert-delays-in-web-show.png" alt="Using Chaos Mesh to insert delays in web-show"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Congratulations! You just stirred up a little bit of chaos. If you are intrigued and want to try out more chaos experiments with Chaos Mesh, check out &lt;a href="https://github.com/pingcap/chaos-mesh/tree/master/examples/web-show"&gt;examples/web-show&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Delete the chaos experiment
&lt;/h3&gt;

&lt;p&gt;Once you're finished testing, terminate the chaos experiment.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Delete &lt;code&gt;network-delay.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Make sure you are in the chaos-mesh/examples/web-show directory&lt;/span&gt;
kubectl delete &lt;span class="nt"&gt;-f&lt;/span&gt; network-delay.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Access the web-show application. From your web browser, go to &lt;code&gt;http://localhost:8081&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From the line graph, you can see the network latency level is back to normal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DwnXLlwQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/network-latency-level-is-back-to-normal.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DwnXLlwQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/network-latency-level-is-back-to-normal.png" alt="Network latency level is back to normal"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Delete Kubernetes clusters
&lt;/h3&gt;

&lt;p&gt;After you're done with the chaos experiment, execute the following command to delete the Kubernetes clusters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kind delete cluster &lt;span class="nt"&gt;--name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;kind
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you encounter the &lt;code&gt;kind: command not found&lt;/code&gt; error, execute &lt;code&gt;source ~/.bash_profile&lt;/code&gt; command first and then delete the Kubernetes clusters.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Cool! What's next?
&lt;/h2&gt;

&lt;p&gt;Congratulations on your first successful journey into Chaos Engineering. How does it feel? Chaos Engineering is easy, right? But perhaps Chaos Mesh is not that easy-to-use. Command-line operation is inconvenient, writing YAML files manually is a bit tedious, or checking the experiment results is somewhat clumsy? Don't worry, Chaos Dashboard is on its way! Running chaos experiments on the web sure does sound exciting! If you'd like to help us build testing standards for cloud platforms or make Chaos Mesh better, we'd love to hear from you!&lt;/p&gt;

&lt;p&gt;If you find a bug or think something is missing, feel free to file an issue, open a pull request (PR), or join us on the #sig-chaos-mesh channel in the &lt;a href="https://pingcap.com/tidbslack"&gt;TiDB Community&lt;/a&gt; slack workspace.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/pingcap/chaos-mesh"&gt;https://github.com/pingcap/chaos-mesh&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>chaosengineering</category>
      <category>testing</category>
      <category>docker</category>
    </item>
    <item>
      <title>Chaos Mesh - Your Chaos Engineering Solution for System Resiliency on Kubernetes</title>
      <dc:creator>CWen</dc:creator>
      <pubDate>Fri, 17 Jan 2020 12:57:47 +0000</pubDate>
      <link>https://forem.com/cwen/chaos-mesh-your-chaos-engineering-solution-for-system-resiliency-on-kubernetes-2571</link>
      <guid>https://forem.com/cwen/chaos-mesh-your-chaos-engineering-solution-for-system-resiliency-on-kubernetes-2571</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vltkz_3l--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/chaos-engineering.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vltkz_3l--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/chaos-engineering.png" alt="Chaos Engineering"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Chaos Mesh?
&lt;/h2&gt;

&lt;p&gt;In the world of distributed computing, faults can happen to your clusters unpredictably any time, anywhere. Traditionally we have unit tests and integration tests that guarantee a system is production ready, but these cover just the tip of the iceberg as clusters scale, complexities amount, and data volumes increase by PB levels. To better identify system vulnerabilities and improve resilience, Netflix invented &lt;a href="https://netflix.github.io/chaosmonkey/"&gt;Chaos Monkey&lt;/a&gt; and injects various types of faults into the infrastructure and business systems. This is how Chaos Engineering was originated.&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://pingcap.com/"&gt;PingCAP&lt;/a&gt;, we are facing the same problem while building &lt;a href="https://github.com/pingcap/tidb"&gt;TiDB&lt;/a&gt;, an open source distributed NewSQL database. To be fault tolerant, or resilient holds especially true us, because the most important asset for any database users, the data itself, is at stake. To ensure resilience, we started &lt;a href="https://pingcap.com/blog/chaos-practice-in-tidb/"&gt;practicing Chaos Engineering&lt;/a&gt; internally in our testing framework from a very early stage. However, as TiDB grew, so did the testing requirements. We realized that we needed a universal chaos testing platform, not just for TiDB, but also for other distributed systems. &lt;/p&gt;

&lt;p&gt;Therefore, we present to you Chaos Mesh, a cloud-native Chaos Engineering platform that orchestrates chaos experiments on Kubernetes environments. It's an open source project available at &lt;a href="https://github.com/pingcap/chaos-mesh"&gt;https://github.com/pingcap/chaos-mesh&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In the following sections, I will share with you what Chaos Mesh is, how we design and implement it, and finally I will show you how you can use it in your environment. &lt;/p&gt;

&lt;h2&gt;
  
  
  What can Chaos Mesh do?
&lt;/h2&gt;

&lt;p&gt;Chaos Mesh is a versatile Chaos Engineering platform that features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pod, network, file system, and even the kernel. &lt;/p&gt;

&lt;p&gt;Here is an example of how we use Chaos Mesh to locate a TiDB system bug. In this example, we simulate Pod downtime with our distributed storage engine (&lt;a href="https://pingcap.com/docs/stable/architecture/#tikv-server"&gt;TiKV&lt;/a&gt;) and observe changes in queries per second (QPS). Regularly, if one TiKV node is down, the QPS may experience a transient jitter before it returns to the level before the failure. This is how we guarantee high availability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9uhJqYYW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/chaos-mesh-discovers-downtime-recovery-exceptions-in-tikv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9uhJqYYW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/chaos-mesh-discovers-downtime-recovery-exceptions-in-tikv.png" alt="Chaos Mesh discovers downtime recovery exceptions in TiKV"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see from the dashboard:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;During the first two downtimes, the QPS returns to normal after about 1 minute.&lt;/li&gt;
&lt;li&gt;After the third downtime, however, the QPS takes much longer to recover—about 9 minutes. Such a long downtime is unexpected, and it would definitely impact online services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After some diagnosis, we found the TiDB cluster version under test (V3.0.1) had some tricky issues when handling TiKV downtimes. We resolved these issues in later versions.&lt;/p&gt;

&lt;p&gt;But Chaos Mesh can do a lot more than just simulate downtime. It also includes these fault injection methods: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;pod-kill:&lt;/strong&gt; Simulates Kubernetes Pods being killed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pod-failure:&lt;/strong&gt; Simulates Kubernetes Pods being continuously unavailable &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;network-delay:&lt;/strong&gt; Simulates network delay&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;network-loss:&lt;/strong&gt; Simulates network packet loss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;network-duplication:&lt;/strong&gt; Simulates network packet duplication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;network-corrupt:&lt;/strong&gt; Simulates network packet corruption&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;network-partition:&lt;/strong&gt; Simulates network partition&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I/O delay:&lt;/strong&gt; Simulates file system I/O delay&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I/O errno:&lt;/strong&gt; Simulates file system I/O errors&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Design principles
&lt;/h2&gt;

&lt;p&gt;We designed Chaos Mesh to be easy to use, scalable, and designed for Kubernetes. &lt;/p&gt;

&lt;h3&gt;
  
  
  Easy to use
&lt;/h3&gt;

&lt;p&gt;To be easy to use, Chaos Mesh must:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Require no special dependencies, so that it can be deployed directly on Kubernetes clusters, including &lt;a href="https://github.com/kubernetes/minikube"&gt;Minikube&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Require no modification to the deployment logic of the system under test (SUT), so that chaos experiments can be performed in a production environment.&lt;/li&gt;
&lt;li&gt;Easily orchestrate fault injection behaviors in chaos experiments, and easily view experiment status and results. You should also be able to quickly rollback injected failures.&lt;/li&gt;
&lt;li&gt;Hide underlying implementation details so that users can focus on orchestrating the chaos experiments. &lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scalable
&lt;/h3&gt;

&lt;p&gt;Chaos Mesh should be scalable, so that we can "plug" new requirements into it conveniently without reinventing the wheel. Specifically, Chaos Mesh must:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Leverage existing implementations so that fault injection methods can be easily scaled.&lt;/li&gt;
&lt;li&gt;Easily integrate with other testing frameworks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Designed for Kubernetes
&lt;/h3&gt;

&lt;p&gt;In the container world, Kubernetes is the absolute leader. Its growth rate of adoption is far beyond everybody's expectations, and it has won the war of containerized orchestration. In essence, Kubernetes is an operating system for the cloud.&lt;/p&gt;

&lt;p&gt;TiDB is a cloud-native distributed database. Our internal automated testing platform was built on Kubernetes from the beginning. We had hundreds of TiDB clusters running on Kubernetes every day for various experiments, including extensive chaos testing to simulate all kinds of failures or issues in a production environment. To support these chaos experiments, the combination of chaos and Kubernetes became a natural choice and principle for our implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  CustomResourceDefinitions design
&lt;/h2&gt;

&lt;p&gt;Chaos Mesh uses &lt;a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/"&gt;CustomResourceDefinitions&lt;/a&gt; (CRD) to define chaos objects. In the Kubernetes realm, CRD is a mature solution for implementing custom resources, with abundant implementation cases and toolsets available. Using CRD makes Chaos Mesh naturally integrate with the Kubernetes ecosystem.&lt;/p&gt;

&lt;p&gt;Instead of defining all types of fault injections in a unified CRD object, we allow flexible and separate CRD objects for different types of fault injection. If we add a fault injection method that conforms to an existing CRD object, we scale directly based on this object; if it is a completely new method, we create a new CRD object for it. With this design, chaos object definitions and the logic implementation are extracted from the top level, which makes the code structure clearer. This approach also reduces the degree of coupling and the probability of errors. In addition, Kubernetes' &lt;a href="https://github.com/kubernetes-sigs/controller-runtime"&gt;controller-runtime&lt;/a&gt; is a great wrapper for implementing controllers. This saves us a lot of time because we don't have to repeatedly implement the same set of controllers for each CRD project.&lt;/p&gt;

&lt;p&gt;Chaos Mesh implements the PodChaos, NetworkChaos, and IOChaos objects. The names clearly identify the corresponding fault injection types.&lt;/p&gt;

&lt;p&gt;For example, Pod crashing is a very common problem in a Kubernetes environment. Many native resource objects automatically handle such errors with typical actions such as creating a new Pod. But can our application really deal with such errors? What if the Pod won't start?&lt;/p&gt;

&lt;p&gt;With well-defined actions such as &lt;code&gt;pod-kill&lt;/code&gt;, PodChaos can help us pinpoint these kinds of issues more effectively. The PodChaos object uses the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
 &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pod-kill&lt;/span&gt;
 &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;one&lt;/span&gt;
 &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
   &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
     &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;tidb-cluster-demo&lt;/span&gt;
   &lt;span class="na"&gt;labelSelectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
     &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app.kubernetes.io/component"&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tikv"&lt;/span&gt;
 &lt;span class="err"&gt; &lt;/span&gt;&lt;span class="na"&gt;scheduler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
   &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@every&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2m"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code does the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;action&lt;/code&gt; attribute defines the specific error type to be injected. In this case, &lt;code&gt;pod-kill&lt;/code&gt; kills Pods randomly.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;selector&lt;/code&gt; attribute limits the scope of chaos experiment to a specific scope. In this case, the scope is TiKV Pods for the TiDB cluster with the &lt;code&gt;tidb-cluster-demo&lt;/code&gt; namespace.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;scheduler&lt;/code&gt; attribute defines the interval for each chaos fault action.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For more details on CRD objects such as NetworkChaos and IOChaos, see the &lt;a href="https://github.com/pingcap/chaos-mesh"&gt;Chaos-mesh documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does Chaos Mesh work?
&lt;/h2&gt;

&lt;p&gt;With the CRD design settled, let's look at the big picture on how Chaos Mesh works. The following major components are involved: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;controller-manager&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Acts as the platform's "brain." It manages the life cycle of CRD objects and schedules chaos experiments. It has object controllers for scheduling CRD object instances, and the &lt;a href="https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/"&gt;admission-webhooks&lt;/a&gt; controller dynamically injects sidecar containers into Pods. &lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;chaos-daemon&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Runs as a privileged daemonset that can operate network devices on the node and Cgroup.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;sidecar&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Runs as a special type of container that is dynamically injected into the target Pod by the admission-webhooks. For example, the &lt;code&gt;chaosfs&lt;/code&gt; sidecar container runs a fuse-daemon to hijack the I/O operation of the application container. &lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XwO6hjLm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/chaos-mesh-workflow.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XwO6hjLm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://download.pingcap.com/images/blog/chaos-mesh-workflow.png" alt="Chaos Mesh workflow"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is how these components streamline a chaos experiment:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Using a YAML file or Kubernetes client, the user creates or updates chaos objects to the Kubernetes API server.&lt;/li&gt;
&lt;li&gt;Chaos Mesh uses the API server to watch the chaos objects and manages the lifecycle of chaos experiments through creating, updating, or deleting events. In this process, controller-manager, chaos-daemon, and sidecar containers work together to inject errors.&lt;/li&gt;
&lt;li&gt;When admission-webhooks receives a Pod creation request, the Pod object to be created is dynamically updated; for example, it is injected into the sidecar container and the Pod.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Running chaos
&lt;/h2&gt;

&lt;p&gt;The above sections introduce how we design Chaos Mesh and how it works. Now let's get down to business and show you how to use Chaos Mesh. Note that the chaos testing time may vary depending on the complexity of the application to be tested and the test scheduling rules defined in the CRD. &lt;/p&gt;

&lt;h3&gt;
  
  
  Preparing the environment
&lt;/h3&gt;

&lt;p&gt;Chaos Mesh runs on Kubernetes v1.12 or later. Helm, a Kubernetes package management tool, deploys and manages Chaos Mesh. Before you run Chaos Mesh, make sure that Helm is properly installed in the Kubernetes cluster. To set up the environment, do the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Make sure you have a Kubernetes cluster. If you do, skip to step 2; otherwise, start one locally using the script provided by Chaos Mesh:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;// &lt;span class="nb"&gt;install &lt;/span&gt;kind 
curl &lt;span class="nt"&gt;-Lo&lt;/span&gt; ./kind https://github.com/kubernetes-sigs/kind/releases/download/v0.6.1/kind-&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;uname&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="nt"&gt;-amd64&lt;/span&gt;
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x ./kind
&lt;span class="nb"&gt;mv&lt;/span&gt; ./kind /some-dir-in-your-PATH/kind 

// get script
git clone https://github.com/pingcap/chaos-mesh
&lt;span class="nb"&gt;cd &lt;/span&gt;chaos-mesh
// start cluster
hack/kind-cluster-build.sh
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Starting Kubernetes clusters locally affects network-related fault injections.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the Kubernetes cluster is ready, use &lt;a href="https://helm.sh/"&gt;Helm&lt;/a&gt; and &lt;a href="https://kubernetes.io/docs/reference/kubectl/overview/"&gt;Kubectl&lt;/a&gt; to deploy Chaos Mesh:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/pingcap/chaos-mesh.git
&lt;span class="nb"&gt;cd &lt;/span&gt;chaos-mesh
// create CRD resource
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; manifests/
// &lt;span class="nb"&gt;install &lt;/span&gt;chaos-mesh
helm &lt;span class="nb"&gt;install &lt;/span&gt;helm/chaos-mesh &lt;span class="nt"&gt;--name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;chaos-mesh &lt;span class="nt"&gt;--namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;chaos-testing
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Wait until all components are installed, and check the installation status using:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;// check chaos-mesh status
kubectl get pods &lt;span class="nt"&gt;--namespace&lt;/span&gt; chaos-testing &lt;span class="nt"&gt;-l&lt;/span&gt; app.kubernetes.io/instance&lt;span class="o"&gt;=&lt;/span&gt;chaos-mesh
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;If the installation is successful, you can see all pods up and running. Now, time to play.&lt;/p&gt;

&lt;p&gt;You can run Chaos Mesh using a YAML definition or a Kubernetes API.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Running chaos using a YAML file
&lt;/h3&gt;

&lt;p&gt;You can define your own chaos experiments through the YAML file method, which provides a fast, convenient way to conduct chaos experiments after you deploy the application. To run chaos using a YAML file, follow the steps below:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; For illustration purposes, we use TiDB as our system under test. You can use a target system of your choice, and modify the YAML file accordingly.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deploy a TiDB cluster named &lt;code&gt;chaos-demo-1&lt;/code&gt;. You can use &lt;a href="https://github.com/pingcap/tidb-operator"&gt;TiDB Operator&lt;/a&gt; to deploy TiDB. &lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create the YAML file named &lt;code&gt;kill-tikv.yaml&lt;/code&gt; and add the following content:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pingcap.com/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodChaos&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pod-kill-chaos-demo&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;chaos-testing&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pod-kill&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;one&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;namespaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;chaos-demo-1&lt;/span&gt;
    &lt;span class="na"&gt;labelSelectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app.kubernetes.io/component"&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tikv"&lt;/span&gt;
  &lt;span class="na"&gt;scheduler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@every&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1m"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Save the file.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;To start chaos, &lt;code&gt;kubectl apply -f kill-tikv.yaml&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The following chaos experiment simulates the TiKV Pods being frequently killed in the &lt;code&gt;chaos-demo-1&lt;/code&gt; cluster:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OyWzX6ho--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://download.pingcap.com/images/blog/chaos-experiment-running.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OyWzX6ho--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://download.pingcap.com/images/blog/chaos-experiment-running.gif" alt="Chaos experiment running"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We use a sysbench program to monitor the real-time QPS changes in the TiDB cluster. When errors are injected into the cluster, the QPS show a drastic jitter, which means a specific TiKV Pod has been deleted, and Kubernetes then re-creates a new TiKV Pod.&lt;/p&gt;

&lt;p&gt;For more YAML file examples, see &lt;a href="https://github.com/pingcap/chaos-mesh/tree/master/examples"&gt;https://github.com/pingcap/chaos-mesh/tree/master/examples&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running chaos using the Kubernetes API
&lt;/h3&gt;

&lt;p&gt;Chaos Mesh uses CRD to define chaos objects, so you can manipulate CRD objects directly through the Kubernetes API. This way, it is very convenient to apply Chaos Mesh to your own applications with customized test scenarios and automated chaos experiments.&lt;/p&gt;

&lt;p&gt;In the &lt;a href="https://github.com/pingcap/tipocket/tree/master/test-infra"&gt;test-infra&lt;/a&gt; project, we simulate potential errors in &lt;a href="https://github.com/pingcap/tipocket/blob/master/test-infra/tests/etcd/nemesis_test.go"&gt;ETCD&lt;/a&gt; clusters on Kubernetes, including nodes restarting, network failure, and file system failure.&lt;/p&gt;

&lt;p&gt;The following is a Chaos Mesh sample script using the Kubernetes API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import (
    "context"

    "github.com/pingcap/chaos-mesh/api/v1alpha1"
    "sigs.k8s.io/controller-runtime/pkg/client"
)

func main() {
  ...
  delay := &amp;amp;chaosv1alpha1.NetworkChaos{
        Spec: chaosv1alpha1.NetworkChaosSpec{...},
      }
      k8sClient := client.New(conf, client.Options{ Scheme: scheme.Scheme })
  k8sClient.Create(context.TODO(), delay)
      k8sClient.Delete(context.TODO(), delay)
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What does the future hold?
&lt;/h2&gt;

&lt;p&gt;In this article, we introduced you to Chaos Mesh, our open source cloud-native Chaos Engineering platform. There are still many pieces in progress, with more details to unveil regarding the design, use cases, and development. Stay tuned. &lt;/p&gt;

&lt;p&gt;Open sourcing is just a starting point. In addition to the infrastructure-level chaos experiments introduced in previous sections, we are in the process of supporting a wider range of fault types of finer granularity, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Injecting errors at the system call and kernel levels with the assistance of eBPF and other tools &lt;/li&gt;
&lt;li&gt;Injecting specific error types into the application function and statement levels by integrating &lt;a href="https://github.com/pingcap/failpoint"&gt;failpoint&lt;/a&gt;, which will cover scenarios that are otherwise impossible with conventional injection methods&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Moving forward, we will continuously improve the Chaos Mesh Dashboard, so that users can easily see if and how their online businesses are impacted by fault injections. In addition, our roadmap includes an easy-to-use fault orchestration interface. We're planning other cool features, such as Chaos Mesh Verifier and Chaos Mesh Cloud.&lt;/p&gt;

&lt;p&gt;If any of these sound interesting to you, join us in building a world class Chaos Engineering platform. May our applications dance in chaos on Kubernetes!&lt;/p&gt;

&lt;p&gt;If you find a bug or think something is missing, feel free to file an &lt;a href="https://github.com/pingcap/chaos-mesh/issues"&gt;issue&lt;/a&gt;, open a PR, or join us on the #sig-chaos-mesh channel in the &lt;a href="https://pingcap.com/tidbslack"&gt;TiDB Community&lt;/a&gt; slack workspace.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/pingcap/chaos-mesh"&gt;https://github.com/pingcap/chaos-mesh&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>chaosengineering</category>
      <category>go</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
