<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Christopher Kujawa</title>
    <description>The latest articles on Forem by Christopher Kujawa (@chriskujawa).</description>
    <link>https://forem.com/chriskujawa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F838153%2Fc98bdc83-8a65-4ec6-a956-d9dca8e6ec06.jpg</url>
      <title>Forem: Christopher Kujawa</title>
      <link>https://forem.com/chriskujawa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/chriskujawa"/>
    <language>en</language>
    <item>
      <title>Why batching matters: Real-world example of performance</title>
      <dc:creator>Christopher Kujawa</dc:creator>
      <pubDate>Thu, 16 Jan 2025 21:20:38 +0000</pubDate>
      <link>https://forem.com/chriskujawa/why-batching-matters-real-world-example-of-performance-3m49</link>
      <guid>https://forem.com/chriskujawa/why-batching-matters-real-world-example-of-performance-3m49</guid>
      <description>&lt;h3&gt;
  
  
  Why batching matters: Real-world example of performance. The interaction of latency and throughput.
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncvc1qieabxezky9klfh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncvc1qieabxezky9klfh.png" alt="Woodpile — Photo by Christopher Kujawa" width="800" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Have you ever had a case in your real life where you had to collect a lot of items/objects and move them from A to B? I’m pretty sure you had.&lt;/p&gt;

&lt;p&gt;May it be something like putting dishes into the dishwasher, clothes for the laundry, doing a relocation, or even rearranging some firewood, as I did.&lt;/p&gt;

&lt;p&gt;On every task, every day, we naturally improve our performance, for example by taking more objects at a time (batching) or taking a shorter route (reducing latency). Thus to reduce the total time it takes us to do the work. In this blog post, I want to bring you closer to the theory behind that.&lt;/p&gt;

&lt;p&gt;In my work as a software engineer at Camunda, where I do a lot of benchmarking, we talk about performance frequently. Different terms like latency and throughput are used all day. Sometimes it is not clear or tangible for everyone what they mean, and how they interact with each other. Especially how important the right batching is.&lt;/p&gt;

&lt;p&gt;As I’m interested in such topics personally and professionally, I would like to share a real-world situation I had and use it as an example to explain latency, throughput, and batching. My ambition with this blog post is that you have a better understanding of the interaction between latency and throughput and why batching matters, after reading this post.&lt;/p&gt;

&lt;h3&gt;
  
  
  A real-world example
&lt;/h3&gt;

&lt;p&gt;A while ago, I had a real-life challenge (or ambition?). I had the glorious idea to rearrange the firewood in my garden.&lt;/p&gt;

&lt;p&gt;The wood was next to my shed (place A) and I wanted to move it next to the entrance of the property (place B).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F938%2F0%2AzziCpIwjwqCVmSqV" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F938%2F0%2AzziCpIwjwqCVmSqV" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When I started, I picked two or three logs at a time and walked from place A to place B. I quickly realized that this is actually quite inefficient, and it would take ages (it felt at least to me like it).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ymxufztuq1qkugxiqrd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ymxufztuq1qkugxiqrd.png" alt="Wood with wheelbarrow — Photo by Christopher Kujawa" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I realized that I had a wheelbarrow, so I started to fill it and walk with the wheelbarrow the way back and forth. I did this until I was done. I felt (and it was) way more performant. During the time of doing this exercise, I thought actually that is a great example of batching (and finding the right limits). The idea of this blog post was born.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example in depth
&lt;/h3&gt;

&lt;p&gt;Let’s unfold the scenario and explain the example in more detail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F906%2F0%2AcaUiZm8EOGfbgJR5" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F906%2F0%2AcaUiZm8EOGfbgJR5" width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We have the place A, where the old woodpile is located. We want to move all the logs to place B (the new place). The way from A to B takes us around 20 seconds.&lt;/p&gt;

&lt;p&gt;For simplicity, we take this as constant. Imagine we are a robot which walks all the time very fast, with the same speed :). In reality, this is not true (especially if you work with software and networks).&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency
&lt;/h3&gt;

&lt;p&gt;There are some good definitions of latency out there, I don’t want to replace them, I just want to bring you closer to the topic.&lt;/p&gt;

&lt;p&gt;When we talk about latency in our example then this means we take &lt;strong&gt;one&lt;/strong&gt; log from place A and walk to place B and put it there. This will take us 20 seconds.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Latency: A -&amp;gt; B = 20 s&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This means the latency for moving one log is 20 seconds.&lt;/p&gt;

&lt;p&gt;Important to note for the latency is that low values are preferable and higher values are bad. We always want to decrease the latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Throughput
&lt;/h3&gt;

&lt;p&gt;Throughput is how many logs we can move during a certain time unit. In our example one (or more) per 20 seconds. Normally throughput is measured per second, meaning in our case:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Throughput = Amount of objects / latency&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For our example, this would mean: &lt;strong&gt;1/20 log/s = 0.05 log/s.&lt;/strong&gt; Our throughput is &lt;strong&gt;0.05 log/s&lt;/strong&gt;. In other words, we can move 0.05 log per second from A to B.&lt;/p&gt;

&lt;p&gt;Different from latency, for the throughput we want to increase the values, here higher values are better. If the latency is lower, in reverse the throughput is going up (as you can see in the formula above).&lt;/p&gt;

&lt;h3&gt;
  
  
  Batching
&lt;/h3&gt;

&lt;p&gt;Based on the formula above, we can see that if we change the amount of logs, that we move, we can increase the throughput.&lt;/p&gt;

&lt;p&gt;This means when we start batching, we can increase the throughput.&lt;/p&gt;

&lt;p&gt;This is what I naturally did in the described scenario: taking more than one log and collecting them in my arms (it describes our batch).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1023%2F0%2AIM5PB1RiSinXs6Dz" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1023%2F0%2AIM5PB1RiSinXs6Dz" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we take three logs at a time this would mean &lt;strong&gt;3/20 log/s = 0.15 log/s.&lt;/strong&gt; With that, we tripled the throughput! But this is &lt;strong&gt;only true if the batching itself is free.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I claim that you have been in this situation already, at least once, when collecting dishes, clothes, or whatever. It takes some time to collect them, carry and hold more of them (adding more to the batch). We call it &lt;strong&gt;delay&lt;/strong&gt; before you start with your actual task carrying/moving them over.&lt;/p&gt;

&lt;p&gt;This delay is added to the actual latency of every item/log we collect. For simplicity let us say every log added to our batch takes &lt;strong&gt;1 second&lt;/strong&gt;. They are heavy, you have to pick them up, put them in your collection, etc.&lt;/p&gt;

&lt;p&gt;This means latency is now a function:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;latency(batch size) = 20s + batch size * 1s&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbztuohq6x5q3v8jj3m3w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbztuohq6x5q3v8jj3m3w.png" width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we come back to our example of three logs instead of one, this would mean our latency is now: &lt;strong&gt;23 seconds.&lt;/strong&gt; This means it takes 23 seconds for a log to move from A to B, as it needs to be put first into the batch, the batch needs to be filled until its &lt;strong&gt;limits&lt;/strong&gt; (three logs) and then moved. This is the &lt;strong&gt;maximum latency&lt;/strong&gt; in our case. The last log in the batch might have lower latency, but the maximum is 23s.&lt;/p&gt;

&lt;p&gt;As our latency and batch size have changed, our throughput in consequence changed as well and is now: &lt;strong&gt;3/23 log/s ~ 0.130 log/s&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We can see a significant throughput &lt;strong&gt;increase&lt;/strong&gt; of &lt;strong&gt;260%&lt;/strong&gt; (0.13 log/s vs before 0.05 log/s), while the latency increased by &lt;strong&gt;15%&lt;/strong&gt; (23s vs before the 20s).&lt;/p&gt;

&lt;p&gt;As I mentioned above I used a wheelbarrow, so we could increase the batch size even further, maybe to 10–15 logs.&lt;/p&gt;

&lt;p&gt;Batch size &lt;strong&gt;10 logs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency: 20 s + 10 logs * 1 s = 30s -&amp;gt; &lt;strong&gt;50% increase&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Throughput: 10/30 = ⅓ ~ 0.333 log/s -&amp;gt; &lt;strong&gt;666% increase&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Batch size &lt;strong&gt;15 logs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency: 20s + 15 logs * 1 s = 35s -&amp;gt; &lt;strong&gt;75% increase&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Throughput: 15 / 35 = 3/7 = 0.429 log/s -&amp;gt; &lt;strong&gt;858% increase&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Total Execution Time
&lt;/h3&gt;

&lt;p&gt;Depending on your scenario/use case you might have to look at different metrics and tune them accordingly.&lt;/p&gt;

&lt;p&gt;Sometimes latency of a singular object is more important than the throughput of multiple, sometimes it is the total execution time that is important.&lt;/p&gt;

&lt;p&gt;In the software world, you often have endless data which you have to process and work on. In reality, this is different, the data or objects are limited. Like my woodpile (luckily).&lt;/p&gt;

&lt;p&gt;To calculate the total execution time we need to move everything from A to B we can use the following formula:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;total execution time = total amount / batch size * latency(batch size)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Taking one log
&lt;/h4&gt;

&lt;p&gt;Let’s say we have 200 logs in the woodpile, which we want to move. If we would take one log at a time it would take us:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;200 logs * 20 s = 4000s = 4000s / 60s = 66,666 min.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After around one hour we would be done with moving the woodpile.&lt;/p&gt;

&lt;h4&gt;
  
  
  Batch size three
&lt;/h4&gt;

&lt;p&gt;If we had increased the batch size to three, then this would mean:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;200 log/ 3 log * 23 s = 1533,33s = 1533,33s / 60s = 25,55 min.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We would be done after around 25 minutes when taking three logs at a time.&lt;/p&gt;

&lt;h4&gt;
  
  
  Batch size ten
&lt;/h4&gt;

&lt;p&gt;With our wheelbarrow and taking ten logs:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;200 log/ 10 log * 30 s = 600s =600s / 60s = 10 min&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We would be done after 10 minutes.&lt;/p&gt;

&lt;h4&gt;
  
  
  This is why batching matters
&lt;/h4&gt;

&lt;p&gt;We learn this naturally as kids. If we take more at once and walk less, we are faster.&lt;/p&gt;

&lt;p&gt;It is &lt;strong&gt;important to note&lt;/strong&gt; that the total execution time will behave differently when the latency, which is a function of batch size, grows significantly/non-linearly. It’s not uncommon for the latency to grow faster than the batch size, leading to the so-called latency/throughput tradeoff.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;We have seen that latency and batching influence throughput and the total execution time. They interact or cohere with each other.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AO6ZGSF_JG-lkbctp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AO6ZGSF_JG-lkbctp" width="800" height="276"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Interaction of latency and throughput&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It is always a tradeoff. It always depends on the situation. It is an art of finding the right balance between batch sizes, throughput, and good latency.&lt;/p&gt;

&lt;p&gt;In software systems acting on requests, we need to keep the balance between being responsive, having an acceptable latency, and a good throughput (reacting/processing multiple requests at the same time). This means it doesn’t make sense to batch all requests forever and send them at once. Luckily we can parallelize more in software systems, which is a way to compensate for high latency.&lt;/p&gt;

&lt;p&gt;This is not easily doable in a real-world scenario, except if you have a big family or a lot of friends to ask for help. We are limited by physical laws (or to be specific to our example there is no more room in our wheelbarrow).&lt;/p&gt;

&lt;p&gt;As we have seen batching can and will introduce delays and impact the latency of an individual object, like adding the log. In our example, it is not an issue, as the important metric is the total execution time and the latency is linear growing with the batch size.&lt;/p&gt;

&lt;p&gt;There exist situations where latency grows non-linear with the batch size producing more trouble. In general, if we reduce the latency or make sure that the growth with the batch size is close to linearity we can improve the throughput.&lt;/p&gt;

&lt;p&gt;I hope this gave you some insights into how latency and throughput interact with each other and why batching matters.&lt;/p&gt;

&lt;p&gt;Thanks for reading so far. Let me know what you think and share your stories. Thank you. :)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Thanks to&lt;/em&gt; &lt;a href="https://github.com/lenaschoenburg" rel="noopener noreferrer"&gt;&lt;em&gt;Lena Schoenburg&lt;/em&gt;&lt;/a&gt; &lt;em&gt;and&lt;/em&gt; &lt;a href="https://github.com/entangled90" rel="noopener noreferrer"&gt;&lt;em&gt;Carlo Sana&lt;/em&gt;&lt;/a&gt; &lt;em&gt;for reviewing this post.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>example</category>
      <category>benchmark</category>
      <category>performance</category>
      <category>latency</category>
    </item>
    <item>
      <title>Zeebe Debug and Inspection tool</title>
      <dc:creator>Christopher Kujawa</dc:creator>
      <pubDate>Wed, 23 Aug 2023 14:43:58 +0000</pubDate>
      <link>https://forem.com/chriskujawa/zeebe-debug-and-inspection-tool-a5j</link>
      <guid>https://forem.com/chriskujawa/zeebe-debug-and-inspection-tool-a5j</guid>
      <description>&lt;p&gt;Have you ever had the case of an incident and didn’t know what this thing you’re running in production was actually doing or how it ended up in that state?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1adensjt4mysinmqy1fr.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1adensjt4mysinmqy1fr.jpeg" alt="Crashed airplane" width="800" height="525"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Photo from &lt;a href="https://unsplash.com/de/@jonathangallegos?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Jonathan Gallegos&lt;/a&gt; on &lt;a href="https://unsplash.com/de/fotos/MkAfH2N4l5g?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;With Zeebe (the &lt;a href="https://docs.camunda.io/docs/components/zeebe/zeebe-overview/" rel="noopener noreferrer"&gt;process automation engine powering Camunda Platform 8&lt;/a&gt;) we let our customer's business fly. But what if the thing which brings the business to fly, breaks? Similarly, if an airplane crashes you need something to read the flight recorder on board.&lt;/p&gt;

&lt;p&gt;Today I want to introduce you to a tool we created for Zeebe in order to read this “flight recorder” (state) and support us in our incidents. Because in the past, if &lt;a href="https://camunda.com/platform/zeebe/" rel="noopener noreferrer"&gt;Zeebe&lt;/a&gt; ran into some processing problems there was no possibility to find out the last processing state. If there was no exporter configured or they haven’t exported for a while it was even worse, since it was not clear what the last internal engine state was.&lt;/p&gt;

&lt;p&gt;In order to shed some more light in the dark we build a tool called &lt;a href="https://github.com/Zelldon/zdb/" rel="noopener noreferrer"&gt;zdb&lt;/a&gt; — Zeebe Debugger. &lt;a href="https://github.com/Zelldon/zdb/" rel="noopener noreferrer"&gt;Zdb&lt;/a&gt; is a Java (17) CLI tool to inspect the internal state and log of a Zeebe partition. It was kicked off during the &lt;a href="https://camunda.com/blog/2020/09/highlights-from-the-summer-hackdays-2020/#:~:text=Zeebe%20Event%20Log%20debug%20and%20inspection%20tool" rel="noopener noreferrer"&gt;Camunda Summer Hackdays in 2020&lt;/a&gt; (by &lt;a href="https://github.com/korthout" rel="noopener noreferrer"&gt;Nico Korthout&lt;/a&gt;, &lt;a href="https://github.com/deepthidevaki" rel="noopener noreferrer"&gt;Deepthi Akkoorath&lt;/a&gt;, and &lt;a href="https://github.com/zelldon" rel="noopener noreferrer"&gt;Christopher Kujawa&lt;/a&gt;) and has been maintained and developed by &lt;a href="https://github.com/zelldon" rel="noopener noreferrer"&gt;me&lt;/a&gt; since then. Now reaching version &lt;a href="https://github.com/Zelldon/zdb/releases/tag/1.8.0" rel="noopener noreferrer"&gt;1.8.0&lt;/a&gt;, with new features (printing and filtering the log in a nicer way).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;zdb&lt;/code&gt; allows us to find the root cause, create fixes, and be prepared for the next upcoming (since failures always happen eventually). We use it on many of our incidents if we need to take a look at the current state of Zeebe. But also when investing in certain bugs. With &lt;code&gt;zdb&lt;/code&gt;, we finally know what Zeebe was doing and how it came into that state.&lt;/p&gt;

&lt;p&gt;In the end, the goal is always to bring our customers back to fly and keep them there.&lt;/p&gt;




&lt;p&gt;In the following blog post, I want to show you some examples of how we used &lt;code&gt;zdb&lt;/code&gt; in the past to give you some inspiration on how it might help you.&lt;/p&gt;

&lt;p&gt;Note: The output of zdb will always be JSON, which allows us to pipe it into &lt;a href="https://jqlang.github.io/jq/" rel="noopener noreferrer"&gt;jq&lt;/a&gt;, such that we can have nicer and filterable output. This is also used in our examples below.&lt;/p&gt;

&lt;h3&gt;
  
  
  General statistics
&lt;/h3&gt;

&lt;p&gt;Often when you start working on an incident you need to get a first overview or understanding of what the state generally contains (depending on the problems of course). Here &lt;code&gt;zdb&lt;/code&gt; can show you statistics of how many key-value pairs are stored in the internal state (in different column families).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;zdb state &lt;span class="nt"&gt;--path&lt;/span&gt; &amp;lt;path-to-runtime-or-snapshot&amp;gt; | jq
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"DEFAULT"&lt;/span&gt;: 1,
  &lt;span class="s2"&gt;"KEY"&lt;/span&gt;: 1,
  &lt;span class="s2"&gt;"PROCESS_VERSION"&lt;/span&gt;: 3,
  &lt;span class="s2"&gt;"PROCESS_CACHE"&lt;/span&gt;: 3,
  &lt;span class="s2"&gt;"PROCESS_CACHE_BY_ID_AND_VERSION"&lt;/span&gt;: 3,
  &lt;span class="s2"&gt;"PROCESS_CACHE_DIGEST_BY_ID"&lt;/span&gt;: 3,
  &lt;span class="s2"&gt;"ELEMENT_INSTANCE_PARENT_CHILD"&lt;/span&gt;: 6,
  &lt;span class="s2"&gt;"ELEMENT_INSTANCE_KEY"&lt;/span&gt;: 6,
  &lt;span class="s2"&gt;"ELEMENT_INSTANCE_CHILD_PARENT"&lt;/span&gt;: 6,
  &lt;span class="s2"&gt;"VARIABLES"&lt;/span&gt;: 12,
  &lt;span class="s2"&gt;"TIMERS"&lt;/span&gt;: 2,
  &lt;span class="s2"&gt;"TIMER_DUE_DATES"&lt;/span&gt;: 2,
  &lt;span class="s2"&gt;"JOBS"&lt;/span&gt;: 1,
  &lt;span class="s2"&gt;"JOB_STATES"&lt;/span&gt;: 1,
  &lt;span class="s2"&gt;"JOB_DEADLINES"&lt;/span&gt;: 1,
  &lt;span class="s2"&gt;"MESSAGE_START_EVENT_SUBSCRIPTION_BY_NAME_AND_KEY"&lt;/span&gt;: 1,
  &lt;span class="s2"&gt;"MESSAGE_START_EVENT_SUBSCRIPTION_BY_KEY_AND_NAME"&lt;/span&gt;: 1,
  &lt;span class="s2"&gt;"EVENT_SCOPE"&lt;/span&gt;: 3,
  &lt;span class="s2"&gt;"EXPORTER"&lt;/span&gt;: 2
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An experienced Zeebe engineer or power user can see here already how many processes have been deployed, how many instances, jobs, timers, messages, etc. have been created and are currently in the state. This often helps to determine where to look next.&lt;/p&gt;

&lt;p&gt;For example, if we see there are incidents in process instances in the state and the reported failure (ongoing incident) is about not progressing process instances we would check next the &lt;a href="https://github.com/Zelldon/zdb#inspect-incidents" rel="noopener noreferrer"&gt;open incidents&lt;/a&gt; in the state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Restoring BPMN models
&lt;/h3&gt;

&lt;p&gt;There are cases where you might lose your models, or you just want to find out which model has been currently deployed or is actually executed. Here &lt;code&gt;zdb&lt;/code&gt; can help.&lt;/p&gt;

&lt;p&gt;First, you can print all deployed process model metadata (it will show information like process definition key, version, and name).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;zdb process list &lt;span class="nt"&gt;--path&lt;/span&gt; &amp;lt;path-to-runtime-or-snapshot&amp;gt; | jq
&lt;span class="o"&gt;[&lt;/span&gt;
  &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"bpmnProcessId"&lt;/span&gt;: &lt;span class="s2"&gt;"benchmark"&lt;/span&gt;,
    &lt;span class="s2"&gt;"resourceName"&lt;/span&gt;: &lt;span class="s2"&gt;"bpmn/one_task.bpmn"&lt;/span&gt;,
    &lt;span class="s2"&gt;"processDefinitionKey"&lt;/span&gt;: 2251799813685363,
    &lt;span class="s2"&gt;"version"&lt;/span&gt;: 1
  &lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"bpmnProcessId"&lt;/span&gt;: &lt;span class="s2"&gt;"timerProcess"&lt;/span&gt;,
    &lt;span class="s2"&gt;"resourceName"&lt;/span&gt;: &lt;span class="s2"&gt;"bpmn/timerProcess.bpmn"&lt;/span&gt;,
    &lt;span class="s2"&gt;"processDefinitionKey"&lt;/span&gt;: 2251799813685249,
    &lt;span class="s2"&gt;"version"&lt;/span&gt;: 1
  &lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"bpmnProcessId"&lt;/span&gt;: &lt;span class="s2"&gt;"msg_one_task"&lt;/span&gt;,
    &lt;span class="s2"&gt;"resourceName"&lt;/span&gt;: &lt;span class="s2"&gt;"bpmn/msg_one_task.bpmn"&lt;/span&gt;,
    &lt;span class="s2"&gt;"processDefinitionKey"&lt;/span&gt;: 2251799813685581,
    &lt;span class="s2"&gt;"version"&lt;/span&gt;: 1
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With a specific process definition key, we can print the complete process entity. Piping it here to &lt;code&gt;jq&lt;/code&gt; allows us to filter for the resource, and the &lt;code&gt;--raw-output&lt;/code&gt; option returns us the resource string without quotes. We can then direct the output to a file and have the model restored (you can open it with for example the &lt;a href="https://camunda.com/download/modeler/" rel="noopener noreferrer"&gt;Camunda Modeler&lt;/a&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;zdb process entity 2251799813686656 &lt;span class="nt"&gt;--path&lt;/span&gt; &amp;lt;path-to-runtime-or-snapshot&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
| jq &lt;span class="nt"&gt;--raw-output&lt;/span&gt; &lt;span class="s1"&gt;'.resource'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; model.bpmn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9xzi86a5sbcsujse5y0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9xzi86a5sbcsujse5y0d.png" alt="BPMN model" width="800" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Restored Model&lt;/sup&gt;&lt;/center&gt;

&lt;h3&gt;
  
  
  Instances for a specific model
&lt;/h3&gt;

&lt;p&gt;Sometimes you’re interested in process instances of a specific process model.&lt;/p&gt;

&lt;p&gt;You might have deployed a broken model and want to cancel all of the existing instances (that happened to us), but first, you need to find out all the keys of such instances.&lt;/p&gt;

&lt;p&gt;You can use the following to print all instances for a certain process definition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;zdb process instances 2251799813685363 &lt;span class="nt"&gt;--path&lt;/span&gt; &amp;lt;path-to-runtime-or-snapshot&amp;gt; | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Printing the log
&lt;/h3&gt;

&lt;p&gt;One of our most used &lt;code&gt;zdb&lt;/code&gt; features is printing the entire log (default: as JSON).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;zdb log print &lt;span class="nt"&gt;--path&lt;/span&gt; &amp;lt;path-to-log&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the newest version (v1.8.0), &lt;code&gt;zdb&lt;/code&gt; supports some built-in filters, like filtering for the process instance key. This means only records that correspond to a certain process instance are printed. Furthermore, we can limit the output now, with &lt;code&gt;--fromPosition&lt;/code&gt; and &lt;code&gt;--toPosition&lt;/code&gt;. You can read more about it &lt;a href="https://github.com/Zelldon/zdb#print-log" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Not only JSON is the supported output format. &lt;code&gt;zdb&lt;/code&gt; can print the log in &lt;a href="https://graphviz.org/doc/info/lang.html" rel="noopener noreferrer"&gt;dot format&lt;/a&gt;as well, which allows tracing commands.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;zdb log print &lt;span class="nt"&gt;--format&lt;/span&gt; dot &lt;span class="nt"&gt;--path&lt;/span&gt; &amp;lt;path-to-log&amp;gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; output.dot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;a href="https://graphviz.org/" rel="noopener noreferrer"&gt;Graphviz&lt;/a&gt; you can visualize such dot files easily&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dot &lt;span class="nt"&gt;-Tsvg&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; output.svg output.dot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1z2olshfq2ryq018d6l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1z2olshfq2ryq018d6l.png" alt="Trace" width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Trace of log&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;Printing and investigating the log is interesting since not all commands can be applied and are then not reflected in the state. The reasons can be many. Some might be rejected due to a wrong user input or wrong process instance state, etc. These commands and their rejections are still part of the log (if compaction hasn’t happened yet) and can give you some interesting insights.&lt;/p&gt;

&lt;p&gt;I hope this small introduction and examples gave you some inspiration on how you can use &lt;code&gt;zdb&lt;/code&gt; on your next potential incident or investigation related to Zeebe. If you want to know more, check out the &lt;a href="https://github.com/Zelldon/zdb" rel="noopener noreferrer"&gt;GitHub repository.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>debugging</category>
      <category>camunda</category>
      <category>zeebe</category>
      <category>incidentresponse</category>
    </item>
    <item>
      <title>Drinking Our Champagne: Chaos Experiments with Zeebe against Zeebe</title>
      <dc:creator>Christopher Kujawa</dc:creator>
      <pubDate>Thu, 10 Aug 2023 19:27:58 +0000</pubDate>
      <link>https://forem.com/chriskujawa/drinking-our-champagne-chaos-experiments-with-zeebe-against-zeebe-4gmm</link>
      <guid>https://forem.com/chriskujawa/drinking-our-champagne-chaos-experiments-with-zeebe-against-zeebe-4gmm</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwy0f8lkymbni05rupcgr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwy0f8lkymbni05rupcgr.png" alt="drinking our champagne" width="768" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Image by Camunda on &lt;a href="https://camunda.com/blog/2023/08/automate-chaos-experiments/" rel="noopener noreferrer"&gt;Blog&lt;/a&gt;&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;At Camunda we have a mantra: &lt;a href="https://page.camunda.com/wp-automate-any-process-anywhere" rel="noopener noreferrer"&gt;Automate Any Process, Anywhere&lt;/a&gt;. Additionally, we’ll often say &lt;a href="https://en.wikipedia.org/wiki/Eating_your_own_dog_food" rel="noopener noreferrer"&gt;“eat your own dog food,” or “drink your own champagne.”&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two years ago, I wrote an article about how we can use Zeebe to orchestrate our chaos experiments; I called it: &lt;a href="https://zeebe-io.github.io/zeebe-chaos/2021/04/03/bpmn-meets-chaos-engineering/" rel="noopener noreferrer"&gt;BPMN meets chaos engineering&lt;/a&gt;. That was the result of a hack day project, in which I worked alongside my colleague &lt;a href="//mailto:philipp.ossler@camunda.com"&gt;Philipp Ossler&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Since then, a lot of things have changed. We made a lot of improvements to our tooling, like creating our own chaos toolkit &lt;a href="https://medium.com/@zelldon91/zbchaos-a-new-fault-injection-tool-for-zeebe-cbda56c5ba8d" rel="noopener noreferrer"&gt;zbchaos&lt;/a&gt; to make it easier to run chaos experiments against Zeebe (which reached &lt;a href="https://github.com/zeebe-io/zeebe-chaos/releases/tag/zbchaos-v1.0.0" rel="noopener noreferrer"&gt;v1.0&lt;/a&gt;), improving the BPMN models in use, adding more experiments to it, etc.&lt;/p&gt;

&lt;p&gt;Today, I want to take a closer look and show you how we automate and orchestrate our chaos experiments with Zeebe against Zeebe. After reading this you will see how beneficial it is to use Zeebe as your chaos experiment orchestrator.&lt;/p&gt;

&lt;p&gt;You can use this knowledge in order to orchestrate your own chaos experiments, set up your own QA test suite or use Zeebe as your CI/CD framework. The use cases are endless. We will show you how you leverage the observability of the Camunda Platform stack and how it can help you to understand what is currently executed or where issues may lie.&lt;/p&gt;

&lt;p&gt;But first, let’s start with some basics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chaos engineering and experiments
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Chaos Engineering is the discipline of experimenting on a system&lt;/em&gt;&lt;br&gt;
&lt;em&gt;in order to build confidence in the system’s capability to withstand turbulent conditions in production.&lt;/em&gt;&lt;br&gt;
&lt;a href="https://principlesofchaos.org/" rel="noopener noreferrer"&gt;&lt;em&gt;https://principlesofchaos.org/&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One of the principles of chaos engineering is automating defined experiments to ensure that no regression is introduced into the system at a later stage.&lt;/p&gt;

&lt;p&gt;A chaos experiment consists of multiple stages; three are important for automation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Verification of the steady state hypothesis&lt;/li&gt;
&lt;li&gt;Running actions to introduce chaos&lt;/li&gt;
&lt;li&gt;Verification of the steady state hypothesis (that it still holds or has recovered)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These steps can also be cast into a BPMN model, as shown below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2ANDW1DlnvzUACfL1_" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2ANDW1DlnvzUACfL1_" alt="chaos experiment in BPMN" width="800" height="129"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Chaos experiment in BPMN&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;That is the backbone of our chaos experiment orchestration. Let’s take a closer look at the process models we designed and use now to automate and orchestrate our chaos experiments.&lt;/p&gt;

&lt;h3&gt;
  
  
  BPMN meets chaos engineering
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;If you are interested in the resources take a look at the corresponding GitHub repository&lt;/em&gt; &lt;a href="https://github.com/zeebe-io/zeebe-chaos/" rel="noopener noreferrer"&gt;&lt;em&gt;zeebe-io/zeebe-chaos/&lt;/em&gt;&lt;/a&gt;&lt;em&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Chaos toolkit
&lt;/h3&gt;

&lt;p&gt;The first process model is called: “chaosToolkit” because it bundles all chaos experiments together. It reads the specifications of all existing chaos experiments (the specification for each experiment is stored in a JSON file, which we will see later) and executes them one by one &lt;a href="https://docs.camunda.io/docs/next/components/modeler/bpmn/multi-instance/" rel="noopener noreferrer"&gt;via a sequential multi-instance&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For readers with knowledge of BPMN, be aware that in earlier versions of Zeebe it was not possible to&lt;/em&gt; &lt;a href="https://docs.camunda.io/docs/next/components/modeler/bpmn/error-events/" rel="noopener noreferrer"&gt;&lt;em&gt;transfer variables with BPMN errors&lt;/em&gt;&lt;/a&gt;&lt;em&gt;, which is why we used return values of CallActivities and later interrupted the SubProcess.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AivI0P9qEi70mSaDN" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AivI0P9qEi70mSaDN" alt="BPMN Model: Chaostoolkit" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;BPMN Model: ChaosToolkit&lt;/sup&gt;&lt;/center&gt;

&lt;h3&gt;
  
  
  Chaos experiment
&lt;/h3&gt;

&lt;p&gt;The second BPMN model describes a single chaos experiment, which is why it is called “chaosExperiment”. It has similarities (the different stages) to the simplified version above.&lt;/p&gt;

&lt;p&gt;Here we see the three stages, verification, introducing chaos, and verification of the steady state again.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2ANT0ajNgQU80BtM0R" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2ANT0ajNgQU80BtM0R" alt="BPMM Model chaos experiment" width="800" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;BPMN Model: Chaos Experiment&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;All of the &lt;a href="https://docs.camunda.io/docs/next/components/modeler/bpmn/call-activities/" rel="noopener noreferrer"&gt;call activities&lt;/a&gt; above are delegated to the third BPMN model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Action
&lt;/h3&gt;

&lt;p&gt;The third model is the most generic one. It will execute any action, which is defined in the process instance payload. The payload will be a chaos experiment specification. The specification can also contain timeouts and pause times which are reflected in the model as well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AoQrdqTsM_9z2_cwe" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AoQrdqTsM_9z2_cwe" alt="BPMN Model action" width="800" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;BPMN Model: Action&lt;/sup&gt;&lt;/center&gt;

&lt;h3&gt;
  
  
  Specification
&lt;/h3&gt;

&lt;p&gt;As we have seen, the BPMN process models are quite generic and all of them are enlivened via a chaos experiment specification.&lt;/p&gt;

&lt;p&gt;The chaos experiment specification is based on&lt;a href="https://github.com/open-chaos/openchaos" rel="noopener noreferrer"&gt;OpenChaos initiative&lt;/a&gt; and the &lt;a href="https://chaostoolkit.org/reference/api/experiment/#conventions-used-in-this-document" rel="noopener noreferrer"&gt;Chaos Toolkit specification&lt;/a&gt;. We reused this specification to run these experiments as well with chaosToolkit (to run it locally).&lt;/p&gt;

&lt;p&gt;An example is the following &lt;strong&gt;experiment.json&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0.1.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Zeebe follower restart non-graceful experiment"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Zeebe should be fault-tolerant. Zeebe should be able to handle followers terminations."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"contributions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"reliability"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"availability"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"steady-state-hypothesis"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Zeebe is alive"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"probes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"All pods should be ready"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"probe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"tolerance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"process"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zbchaos"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"verify"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"readiness"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"timeout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;900&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Can deploy process model"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"probe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"tolerance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"process"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zbchaos"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"deploy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"process"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"timeout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;900&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Should be able to create process instances on partition 1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"probe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"tolerance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"process"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zbchaos"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"verify"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"instance-creation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--partitionId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="nl"&gt;"timeout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;900&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Terminate follower of partition 1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"process"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zbchaos"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"terminate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"broker"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--role"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FOLLOWER"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--partitionId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"rollbacks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first key-value pairs describe the experiment itself. The &lt;strong&gt;steady-state-hypothesis&lt;/strong&gt; and its content describe the verification stage. All of the probes inside the &lt;strong&gt;steady-state-hypothesis&lt;/strong&gt; are executed as actions in our third process model.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;method&lt;/strong&gt; object is describing the chaos which should be inserted into the system. In this case, it consists of one action, restarting a follower (&lt;a href="https://docs.camunda.io/docs/next/reference/glossary/#follower" rel="noopener noreferrer"&gt;a broker which is not leader of Zeebe partition&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;I don’t want to go into much detail about the specification itself, but you can find several examples of our experiments we already have defined &lt;a href="https://github.com/zeebe-io/zeebe-chaos/tree/main/go-chaos/internal/chaos-experiments" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Automation
&lt;/h3&gt;

&lt;p&gt;Let’s imagine we have a Zeebe cluster which we want to run the experiments against. We call it Zeebe target.&lt;/p&gt;

&lt;p&gt;As mentioned earlier, the specification is based on the &lt;a href="https://chaostoolkit.org/reference/api/experiment/#conventions-used-in-this-document" rel="noopener noreferrer"&gt;chaos toolkit&lt;/a&gt;. This means we can (if we have &lt;strong&gt;zbchaos&lt;/strong&gt; and &lt;strong&gt;chaos toolkit&lt;/strong&gt; installed) run it &lt;strong&gt;locally&lt;/strong&gt; via &lt;code&gt;**chaos run experiment.json**&lt;/code&gt;. If Zeebe is installed in Kubernetes and we have the right Kubernetes context set, this would work with &lt;strong&gt;zbchaos&lt;/strong&gt; out of the box.&lt;/p&gt;

&lt;h3&gt;
  
  
  Zeebe Testbench
&lt;/h3&gt;

&lt;p&gt;But we can also orchestrate that with Zeebe itself. Using a different Zeebe cluster, we call it Zeebe Testbench.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Aij5itHhRr_4ctewI" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Aij5itHhRr_4ctewI" alt="Chaos experiment orchestration" width="800" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Chaos experiment orchestration&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;Our Zeebe Testbench cluster is in charge of orchestrating the chaos experiments. &lt;strong&gt;zbchaos&lt;/strong&gt; , is a &lt;a href="https://docs.camunda.io/docs/next/components/concepts/job-workers/" rel="noopener noreferrer"&gt;job worker&lt;/a&gt; in this case and executes all actions. For example, verifying the healthiness of the cluster or of a node, terminating a node, creating a network partition, etc. We have seen in the chaos experiment specification above that all actions and probes are referencing &lt;strong&gt;zbchaos&lt;/strong&gt; and specifying subcommands. These are executed no matter if &lt;strong&gt;zbchaos&lt;/strong&gt; is used as a CLI tool directly or as a job worker. This means if you execute the chaos specification with the &lt;strong&gt;chaos toolkit&lt;/strong&gt; it will execute the &lt;strong&gt;zbchaos&lt;/strong&gt; CLI. If you orchestrate the experiments with Zeebe, the &lt;strong&gt;zbchaos&lt;/strong&gt; workers will handle the specific actions.&lt;/p&gt;

&lt;p&gt;From outside we are deploying the previously mentioned chaos models in Zeebe Testbench. This can happen on the setup of the Zeebe Testbench cluster (or when something changes on the models). New instances can be created either by us locally (e.g. via &lt;a href="https://docs.camunda.io/docs/next/apis-tools/cli-client/" rel="noopener noreferrer"&gt;zbctl&lt;/a&gt;, or any other client), via a Timer, or by our GitHub actions.&lt;/p&gt;

&lt;p&gt;With our &lt;a href="https://github.com/camunda/zeebe/blob/main/.github/workflows/qa-testbench.yaml" rel="noopener noreferrer"&gt;GitHub actions&lt;/a&gt;, it is fairly easy to trigger a new Testbench run, which includes all chaos experiments, and some other tests.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F986%2F0%2AW2bn-pyKguZtsc71" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F986%2F0%2AW2bn-pyKguZtsc71" alt="Zeebe Testbench run" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Zeebe Testebench run&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;To make this even greater, we even have automation to create the Zeebe Target cluster automatically. That can happen before each &lt;strong&gt;chaosToolkit&lt;/strong&gt; execution. This allows us to always start with a clean state. Otherwise, errors might be hard to reproduce (and not to waste resources if no experiment is running).&lt;/p&gt;

&lt;h3&gt;
  
  
  Run chaos experiments regularly
&lt;/h3&gt;

&lt;p&gt;We run our chaos experiments regularly. This means we create a &lt;strong&gt;chaosToolkit&lt;/strong&gt; process instance every day and execute all chaos experiments against a new Zeebe target cluster. The creation of such process instances happens with earlier mentioned Github actions. This allows us to integrate this more in our CI which we also use in releases, meaning that we can run such tests before every release.&lt;/p&gt;

&lt;p&gt;You can find the related GitHub action here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/camunda/zeebe/blob/main/.github/workflows/testbench.yaml" rel="noopener noreferrer"&gt;workflows/testbench.yaml&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/camunda/zeebe/blob/main/.github/workflows/qa-testbench.yaml" rel="noopener noreferrer"&gt;workflows/qa-testbench.yaml&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/camunda/zeebe/blob/main/.github/workflows/daily-qa.yml" rel="noopener noreferrer"&gt;workflows/daily-qa.yml&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If an experiment fails or all succeed we are notified in Slack with the help of a &lt;a href="https://docs.camunda.io/docs/next/components/connectors/out-of-the-box-connectors/slack/" rel="noopener noreferrer"&gt;Slack Connector&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This happens outside of the &lt;strong&gt;chaosToolkit&lt;/strong&gt; process, which is essentially wrapped again around other larger process models to automate other parts. As I mentioned before, creating clusters, notifications, deleting clusters, etc.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benefits
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Observability
&lt;/h3&gt;

&lt;p&gt;With &lt;a href="https://camunda.com/platform/operate/" rel="noopener noreferrer"&gt;Operate&lt;/a&gt;, you can observe a current running chaos experiment, what cluster it targets, what experiment and action it is currently executing, etc.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Ae3HQKu0vIrWTFbch" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2Ae3HQKu0vIrWTFbch" alt="Operate: Running QA" width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Operate: Running QA (ChaosToolkit process)&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;In the screenshot above, we can see a currently running chaosToolkit process instance. We can observe how many experiments have been executed (on the left in the “ &lt;strong&gt;Instance History&lt;/strong&gt; ” green highlighted) and how many we still need to process (based on Variables).&lt;/p&gt;

&lt;p&gt;Furthermore, we can see in the &lt;strong&gt;Variables&lt;/strong&gt; tab (with the red border) what type of experiment we currently execute: “Zeebe should be fault-tolerant. We expect that Zeebe can handle non-graceful leader restarts”, and there is, even more, to dive into.&lt;/p&gt;

&lt;p&gt;If we dig deeper into the current running experiment (we can do that via following the &lt;a href="https://docs.camunda.io/docs/components/modeler/bpmn/call-activities/" rel="noopener noreferrer"&gt;call-activity&lt;/a&gt; link) we can see that we are in the verification stage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2A3MKa1P8eLoBQy6Go" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2A3MKa1P8eLoBQy6Go" alt="Operate: Running chaos experiment" width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Operate: Running chaos experiment&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;In the verification after the chaos has been introduced (highlighted in green). We can investigate which chaos action has been executed, like here (highlighted in red): “Terminate leader of partition two non-gracefully”.&lt;/p&gt;

&lt;p&gt;When following the call activity again we see which verification is currently executed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AYf0yYLkRAkZF44aV" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AYf0yYLkRAkZF44aV" alt="Operate: Running action" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Operate: Running action&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;We are verifying that all pods are ready again after the leader of partition two has been terminated. This information can be extracted from the variables (highlighted in red).&lt;/p&gt;

&lt;p&gt;As Operate keeps the history of a process, we can also take a look at past experiments. You can check and verify which actions or chaos has been introduced.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AuEfOIoLNQHSK3qMw" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AuEfOIoLNQHSK3qMw" alt="Operate: Past chaos experiment" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Operate: Past chaos experiment&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;You can see a large history of executed chaos experiments, actions, and several other details.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AHuD-_rvhRJNekT0h" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AHuD-_rvhRJNekT0h" alt="Operate: Past action runs" width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Operate: Past action runs&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;This high degree of observability is important if something fails. Here you will see directly at which stage your experiment failed, what was executed before, etc. The &lt;a href="https://docs.camunda.io/docs/next/components/concepts/incidents/" rel="noopener noreferrer"&gt;incident&lt;/a&gt; message (depending on the worker) can also include a helpful note about why a stage failed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drink your own champagne
&lt;/h3&gt;

&lt;p&gt;This setup might sound a bit complex at first, but once you understand the generic approach it actually isn’t and in contrast to scripting it, the BPMN automation greatly benefits observability. Furthermore, with this approach, we are still able to execute our experiments locally (which helps with development and debugging) and are able to automate them via our Zeebe Testbench cluster. It is fairly easy to use and execute new QA runs on demand. We drink our own champagne which helps us to improve our overall system, and that is actually the biggest benefit of this setup.&lt;/p&gt;

&lt;p&gt;It just feels good to use our own product to automate our own processes. We can sit in the driver’s seat of the car we build and ship, feel what our users feel, and can improve based on that. It allows us to find bugs/issues earlier on, improve metrics and other observability measures, and build up confidence that our system can handle certain failure scenarios and situations.&lt;/p&gt;

&lt;p&gt;I hope this was helpful to you and enlightened you a bit about what you can do with Zeebe. As I mentioned in the start the use cases and possibilities to use Zeebe are endless, and the whole Camunda Platform stack supports that pretty well.&lt;/p&gt;

&lt;p&gt;—&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Thanks to&lt;/em&gt; &lt;a href="//mailto:christina.ausley@camunda.com"&gt;&lt;em&gt;Christina Ausley&lt;/em&gt;&lt;/a&gt;&lt;em&gt;,&lt;/em&gt; &lt;a href="//mailto:deepthi.akkoorath@camunda.com"&gt;&lt;em&gt;Deepthi Akkoorath&lt;/em&gt;&lt;/a&gt; &lt;em&gt;and&lt;/em&gt; &lt;a href="//mailto:sebastian.bathke@camunda.com"&gt;&lt;em&gt;Sebastian Bathke&lt;/em&gt;&lt;/a&gt; &lt;em&gt;for reviewing this blog post.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ci</category>
      <category>automation</category>
      <category>camunda</category>
      <category>qa</category>
    </item>
    <item>
      <title>Looks like bitnami/elasticsearch-curator is gone</title>
      <dc:creator>Christopher Kujawa</dc:creator>
      <pubDate>Wed, 28 Jun 2023 08:16:07 +0000</pubDate>
      <link>https://forem.com/chriskujawa/looks-like-bitnamielasticsearch-curator-is-gone-5ef9</link>
      <guid>https://forem.com/chriskujawa/looks-like-bitnamielasticsearch-curator-is-gone-5ef9</guid>
      <description>&lt;p&gt;Maybe you have realized, since last week (mid of June 2023) the old &lt;a href="https://hub.docker.com/r/bitnami/elasticsearch-curator/tags" rel="noopener noreferrer"&gt;bitnami/elasticsearch-curator&lt;/a&gt;has been removed from DockerHub, which causes several issues with our elasticsearch installation.&lt;/p&gt;

&lt;p&gt;Since I haven’t seen any announcement somewhere (which was kind of a surprise) I just want to shortly summarize what we did to overcome this, maybe it helps others as well.&lt;/p&gt;

&lt;p&gt;We had quite a hard time with our benchmarks and clusters, since elasticsearch was filling up and caused several issues. Normally we track the throughput and latency of several clusters in one dashboard, which doesn’t look healthy end of last week.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvebn9al8ieryblxj6wo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvebn9al8ieryblxj6wo.png" width="800" height="325"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We realized quickly that curator cronjobs were no longer running and crash loop because the images were no longer available.&lt;/p&gt;

&lt;p&gt;You can also reproduce this via:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;docker pull bitnami/elasticsearch-curator:5.8.4
Error response from daemon: pull access denied &lt;span class="k"&gt;for &lt;/span&gt;bitnami/elasticsearch-curator, repository does not exist or may require &lt;span class="s1"&gt;'docker login'&lt;/span&gt;: denied: requested access to the resource is denied
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Turned out that the image has been renamed to &lt;a href="https://hub.docker.com/r/bitnami/elasticsearch-curator-archived" rel="noopener noreferrer"&gt;https://hub.docker.com/r/bitnami/elasticsearch-curator-archived&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;docker pull bitnami/elasticsearch-curator-archived:5.8.4
5.8.4: Pulling from bitnami/elasticsearch-curator-archived
Digest: sha256:46c98206dfaef81705d9397bd3d962d1505c8cfe9437f86ea0258d5cbef89e7f
Status: Downloaded newer image &lt;span class="k"&gt;for &lt;/span&gt;bitnami/elasticsearch-curator-archived:5.8.4
docker.io/bitnami/elasticsearch-curator-archived:5.8.4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you use helm charts (like we do) and have a cronjob defined in the helm charts it will not help to upgrade the charts. The reason is that j&lt;a href="https://github.com/helm/helm/issues/7725#issuecomment-1038280907" rel="noopener noreferrer"&gt;obs, cronjobs, etc. are immutable&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You have to delete the job/cronjob and do an helm upgrade with &lt;code&gt;--reuse-values&lt;/code&gt; and set the right curator image.&lt;/p&gt;

&lt;p&gt;If you use our charts (camunda-platform-helm) you have to set &lt;code&gt;--set camunda-platform.retentionPolicy.image.repository=bitnami/elasticsearch-curator-archived&lt;/code&gt; to use the new curator image.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm upgrade &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$releaseName&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &amp;lt;YOUR-CHART&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--reuse-values&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
 &lt;span class="nt"&gt;--set&lt;/span&gt; camunda-platform.retentionPolicy.image.repository&lt;span class="o"&gt;=&lt;/span&gt;bitnami/elasticsearch-curator-archived
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As alternative you can recreate the installed helm releases. To do so you can use the following script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# we had several clusters to fix so this was part of a loop&lt;/span&gt;
&lt;span class="nv"&gt;ns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"YOUR NAMESPACE"&lt;/span&gt; 
&lt;span class="c"&gt;# release name is in our case the namespace name&lt;/span&gt;
&lt;span class="nv"&gt;releaseName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ns&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="c"&gt;# Get the values for the installed chart release (to reuse them)&lt;/span&gt;
&lt;span class="nv"&gt;values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;helm get values “&lt;span class="nv"&gt;$releaseName&lt;/span&gt;” &lt;span class="nt"&gt;--namespace&lt;/span&gt; “&lt;span class="nv"&gt;$ns&lt;/span&gt;” &lt;span class="nt"&gt;-o&lt;/span&gt; yaml&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 

&lt;span class="c"&gt;# You can store the values into a separate file to be on the safe side&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$values&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ns&lt;/span&gt;&lt;span class="s2"&gt;-values.yaml"&lt;/span&gt;
&lt;span class="c"&gt;# ...&lt;/span&gt;
&lt;span class="c"&gt;# You could either set the curator image now here in the values&lt;/span&gt;
&lt;span class="c"&gt;# or set it directly on installation&lt;/span&gt;
&lt;span class="c"&gt;# ...&lt;/span&gt;
&lt;span class="c"&gt;# Uninstall the chart&lt;/span&gt;
helm uninstall &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$releaseName&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--namespace&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ns&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="c"&gt;# Install the chart inject values via stdin&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$values&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | helm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$releaseName&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &amp;lt;YOUR-CHART&amp;gt; &lt;span class="nt"&gt;--namespace&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ns&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--values&lt;/span&gt; -
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Another thing we run into was that elasticsearch was in on some cases soo full that it didn’t even recover and was not able to free up space (with a curator). Later we found out that you need to increase the disk a bit, such that elasticsearch can recover and curator can free up space.&lt;/p&gt;

&lt;p&gt;If you ask why this change happened (renaming of the docker image), I haven’t found any resources for it. The current assumption is that the curator is deprecated with elasticsearch 8 and &lt;a href="https://discuss.elastic.co/t/curator-and-elasticsearch-8/316550" rel="noopener noreferrer"&gt;likely will not work anymore with 8.x&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But if you still use a lower version, you might still want or need to use the curator so I hope this will help someone.&lt;/p&gt;

</description>
      <category>bitnami</category>
      <category>elasticsearch</category>
      <category>helm</category>
      <category>curator</category>
    </item>
    <item>
      <title>Zeebe, or How I learned To Stop Worrying And Love Batching</title>
      <dc:creator>Christopher Kujawa</dc:creator>
      <pubDate>Sat, 04 Mar 2023 09:44:55 +0000</pubDate>
      <link>https://forem.com/camunda/zeebe-or-how-i-learned-to-stop-worrying-and-love-batching-4l8p</link>
      <guid>https://forem.com/camunda/zeebe-or-how-i-learned-to-stop-worrying-and-love-batching-4l8p</guid>
      <description>&lt;h3&gt;
  
  
  Zeebe, or How I learned To Stop Worrying And Love Batch Processing
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Hi, I’m Chris, Senior Software Engineer at Camunda. I have worked now for around seven years at Camunda and on the Zeebe project for almost six years, and was recently part of a hackday effort to improve Zeebe’s process execution latency&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the past, we have heard several reports from users where they have described that the process execution latency of &lt;a href="https://camunda.com/platform/zeebe/" rel="noopener noreferrer"&gt;Zeebe&lt;/a&gt;, our cloud-native workflow decision engine for &lt;a href="https://camunda.com/platform/" rel="noopener noreferrer"&gt;Camunda Platform 8&lt;/a&gt;, is sometimes sub-optimal. Some of the reports raised that the latency between certain tasks in a process model is too high, others that the general process instance execution latency is too high. This of course can also be highly affected by the used hardware and wrong configurations for certain use cases, but we also know we have something to improve.&lt;/p&gt;

&lt;p&gt;At the beginning of this year and after almost three years of COVID-19, we finally sat together in a meeting room with whiteboards to improve the situation for our users. We called that performance hackdays. It was a nice, interesting, and fruitful experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Basics
&lt;/h3&gt;

&lt;p&gt;To dive deeper into what we tried and why, we first need to elaborate on what process instance execution latency means, and what influences it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F903%2F0%2AWelxptAuI7RWHgMw" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F903%2F0%2AWelxptAuI7RWHgMw" width="800" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The image above is a &lt;a href="https://docs.camunda.io/docs/next/components/concepts/processes/" rel="noopener noreferrer"&gt;process model&lt;/a&gt;, from which we can create an instance. The execution of such an instance will go from the start to the end event; this is the process execution latency.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Since Zeebe is a complex distributed system, where the process engine is based on a distributed streaming platform, there are several influencing factors for the process execution latency. During our performance hackdays, we tried to sum up all potential factors and find several bottlenecks which we can improve. In the following post, I will try to summarize this on a high level and mention them shortly.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Stream processing
&lt;/h4&gt;

&lt;p&gt;To execute such a process model, as we have seen above, Zeebe uses a concept called &lt;a href="https://docs.camunda.io/docs/next/components/zeebe/technical-concepts/internal-processing/#stateful-stream-processing" rel="noopener noreferrer"&gt;stream processing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Each element in the process &lt;a href="https://docs.camunda.io/docs/next/components/zeebe/technical-concepts/internal-processing/#state-machines" rel="noopener noreferrer"&gt;has a specific lifecycle&lt;/a&gt;, which is divided into the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F402%2F0%2AKAV3RA8qCxr3lLtK" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F402%2F0%2AKAV3RA8qCxr3lLtK" alt="BPMN Elements Lifecycle" width="402" height="312"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;BPMN Elements Lifecycle divided into Command/Events&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;One command asks to change the state of a certain element and an event that confirms the state change. Termination can happen when elements are canceled either internally by events or outside by users.&lt;/p&gt;

&lt;p&gt;Commands drive the execution of a process instance. When Zeebe’s stream processor processes a command, state changes are applied (e.g. process instances are modified). Such modifications are confirmed via follow-up events. To split the execution into smaller pieces, not only are follow-up events produced, but also follow-up commands. All of these follow-up records are persisted. Later, the follow-up commands are further processed by the stream processor to continue the instance execution. The idea behind that is that these small chunks of processing should help to achieve high concurrency by alternating execution of different instances on the same partition.&lt;/p&gt;

&lt;h4&gt;
  
  
  Persistence
&lt;/h4&gt;

&lt;p&gt;Before a new command on a partition can be processed, it must be replicated to a quorum (typically majority) of nodes. This procedure is called commit. Committing ensures a record is durable, even in case of complete data loss on an individual broker. The exact semantics of &lt;a href="https://docs.camunda.io/docs/components/zeebe/technical-concepts/clustering/#commit" rel="noopener noreferrer"&gt;committing&lt;/a&gt; are defined by the &lt;a href="https://raft.github.io/" rel="noopener noreferrer"&gt;raft protocol&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AYJh5jPKwTFXkQL8b" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AYJh5jPKwTFXkQL8b" alt="CommitDocs" width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Source: https://docs.camunda.io/docs/components/zeebe/technical-concepts/clustering/#commit&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;Committing of such records can be affected by network latency, for sending the records over the wire. But also by disk latency since we need to persist the records on disk on a quorum of nodes before we can mark the records as committed.&lt;/p&gt;

&lt;h4&gt;
  
  
  State
&lt;/h4&gt;

&lt;p&gt;Zeebe’s state is stored in &lt;a href="https://rocksdb.org/" rel="noopener noreferrer"&gt;RocksDB&lt;/a&gt;, which is a key-value store. RocksDB persists data on disk with a &lt;a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree" rel="noopener noreferrer"&gt;log-structured merge tree&lt;/a&gt; (LSM Tree) and is made for fast storage environments.&lt;/p&gt;

&lt;p&gt;The state contains information about deployed process models and current process instance executions. It is separated per partition, which means a RocksDB instance exists per partition.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance hackdays
&lt;/h3&gt;

&lt;p&gt;When we started with the performance hackdays, we already had necessary infrastructure to run benchmarks for our improvements. We made heavy use of the &lt;a href="https://github.com/camunda-community-hub/camunda-8-benchmark" rel="noopener noreferrer"&gt;Camunda Platform 8 benchmark toolkit&lt;/a&gt; maintained by &lt;a href="https://github.com/falko" rel="noopener noreferrer"&gt;Falko Menge&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Furthermore, we run weekly benchmarks (the so-called medic benchmark) where we test for throughput, latency, and general stability. Benchmarks are run for four weeks to detect potential bugs, regressions, memory leaks, performance regressions, and more as early as possible. This, all the infrastructure around it (like &lt;a href="https://github.com/camunda/zeebe/tree/main/monitor" rel="noopener noreferrer"&gt;Grafana dashboards&lt;/a&gt;,) and knowledge about how our system performs were invaluable to make such great progress during our hackdays.&lt;/p&gt;

&lt;h4&gt;
  
  
  Measurement
&lt;/h4&gt;

&lt;p&gt;We measured our results continuously, and this is necessary to see if you are on the right track. For every small proof of concept (POC), we ran a new benchmark:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AyYeZs1hXlB_CPNbf" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AyYeZs1hXlB_CPNbf" alt="Screenshot of benchmarks over the week" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Screenshot of benchmarks over the week&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;In our benchmark, we used a process based on some user requirements:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AByPWm21YnUTKM4ol" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AByPWm21YnUTKM4ol" alt="Benchmark Process" width="800" height="177"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Benchmark Process&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;Our target was a throughput of around 500 process instances per second (PI/s) with a process execution latency goal for one process instance under one second for the 99th percentile (p99). P99, meaning 99% of all process instance executions should be executed in under one second.&lt;/p&gt;

&lt;p&gt;The benchmarks have been executed in the Google Kubernetes Engine. For each broker node, we assigned one &lt;a href="https://cloud.google.com/compute/docs/general-purpose-machines" rel="noopener noreferrer"&gt;&lt;strong&gt;n2-standard-8&lt;/strong&gt;&lt;/a&gt; node to reduce the influence of other pods running on the same node.&lt;/p&gt;

&lt;p&gt;Each broker pod had the following configuration:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkat9kpcuvxj72ow4nd2e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkat9kpcuvxj72ow4nd2e.png" alt="Benchmark Config" width="264" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Benchmark configuration&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;There were also some other configurations we played around with during our different experiments, but the above were the general ones. We had eight brokers running, which gives us the following partition distribution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./partitionDistribution.sh 8 24 4
Distribution:
P\N| N 0| N 1| N 2| N 3| N 4| N 5| N 6| N 7
P 0| L | F | F | F | - | - | - | -  
P 1| - | L | F | F | F | - | - | -  
P 2| - | - | L | F | F | F | - | -  
P 3| - | - | - | L | F | F | F | -  
P 4| - | - | - | - | L | F | F | F  
P 5| F | - | - | - | - | L | F | F  
P 6| F | F | - | - | - | - | L | F  
P 7| F | F | F | - | - | - | - | L  
P 8| L | F | F | F | - | - | - | -  
P 9| - | L | F | F | F | - | - | -  
P 10| - | - | L | F | F | F | - | -  
P 11| - | - | - | L | F | F | F | -  
P 12| - | - | - | - | L | F | F | F  
P 13| F | - | - | - | - | L | F | F  
P 14| F | F | - | - | - | - | L | F  
P 15| F | F | F | - | - | - | - | L  
P 16| L | F | F | F | - | - | - | -  
P 17| - | L | F | F | F | - | - | -  
P 18| - | - | L | F | F | F | - | -  
P 19| - | - | - | L | F | F | F | -  
P 20| - | - | - | - | L | F | F | F  
P 21| F | - | - | - | - | L | F | F  
P 22| F | F | - | - | - | - | L | F  
P 23| F | F | F | - | - | - | - | L
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each broker node had 12 partitions assigned. We used a replication factor of four because we wanted to mimic the &lt;a href="https://camunda.com/blog/2022/06/how-to-achieve-geo-redundancy-with-zeebe/" rel="noopener noreferrer"&gt;geo redundancy&lt;/a&gt; for some of our users, which had certain process execution latency requirements. The geo redundancy introduces network latency into the system by default. We wanted to reduce the influence of such network latency to the process execution latency. To make it a bit more realistic, we used &lt;a href="https://chaos-mesh.org/" rel="noopener noreferrer"&gt;Chaos Mesh&lt;/a&gt; to introduce a network latency of 35ms between two brokers, resulting in a round-trip time (RTT) of 70ms.&lt;/p&gt;

&lt;p&gt;To run with an evenly distributed partition leadership, we used the &lt;a href="https://docs.camunda.io/docs/next/self-managed/zeebe-deployment/operations/rebalancing/" rel="noopener noreferrer"&gt;partitioning rebalancing API&lt;/a&gt;, which Zeebe provides.&lt;/p&gt;

&lt;h4&gt;
  
  
  Theory
&lt;/h4&gt;

&lt;p&gt;Based on the benchmark process model above, we considered the impact of commands and events on the process model (and also in general).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AH9mAOR1ftRGIQHFZ" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AH9mAOR1ftRGIQHFZ" alt="WhiteboardSession" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Whiteboard session: Drawing commands/events&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;We calculated around 30 commands are necessary to execute the process instance from start to end.&lt;/p&gt;

&lt;p&gt;We tried to summarize what affects the processing latency and came to the following formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PEL = X * Commit Latency + Y * Processing Latency + OH
PEL - Process Execution Latency
OH - Overhead, which we haven't considered (e.g. Jobs * Job Completion Latency)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we started, &lt;strong&gt;X&lt;/strong&gt; and &lt;strong&gt;Y&lt;/strong&gt; were equal, but the idea was to change factors. This is why we split them up. The other latencies were based on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Commit Latency = Network Latency + Append Latency
Network Latency = 2 * request duration
Append Latency = Write to Disk + Flush
Processing Latency = Processing Command (apply state changes) 
                   + Commit Transaction (RocksDB) 
                   + execute side effects
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Below is a picture of our whiteboard session, where we discussed potential influences and what potential solution could mitigate which factor:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyfgk9w5j65ihzxkzskcf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyfgk9w5j65ihzxkzskcf.png" alt="DiscussionInfluenceFactors" width="800" height="934"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;Whiteboard session: Discussion potential factors and influences&lt;/sup&gt;&lt;/center&gt;

&lt;h3&gt;
  
  
  Proof of concepts
&lt;/h3&gt;

&lt;p&gt;Based on the formula, it was a bit more clear to us what might affect the process execution latency and where it might make sense to change or reduce time. For example, reducing the append latency affects commit latency and will affect process execution latency. Additionally, reducing the factor of how often commit latency is applied will highly affect the result.&lt;/p&gt;

&lt;h4&gt;
  
  
  Append and commit latency
&lt;/h4&gt;

&lt;p&gt;Before we started with the performance hackdays, there was one configuration already present which we built &lt;a href="https://github.com/camunda/zeebe/pull/5576" rel="noopener noreferrer"&gt;more than two years ago&lt;/a&gt;and made available via an experimental feature: &lt;a href="https://github.com/camunda/zeebe/blob/8.1.0/broker/src/main/java/io/camunda/zeebe/broker/system/configuration/ExperimentalCfg.java#L26" rel="noopener noreferrer"&gt;the disabling of the raft flush&lt;/a&gt;. We have seen several users applying it to reach certain performance targets, but it comes with a cost. It is not safe to use it, since on fail-over certain guarantees of raft no longer apply.&lt;/p&gt;

&lt;p&gt;As part of the hackdays we were interested in a similar performance, but with more safety. This is the reason why we tried several different other possibilities but also compared that with disabling the flush completely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flush improvement&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In one of our POC’s, we tried to flush on another thread. This gave a similar performance as with completely disabling it, but it also has similar safety issues. Combining the async flush with awaiting the completion before committing brought back the old performance (base) and the safety. This was no solution.&lt;/p&gt;

&lt;p&gt;Implementing a batch flush (flush only after a configured threshold,) having this in a separate thread, and waiting for the completion degraded the performance. However, we again had better safety than with disabling flush.&lt;/p&gt;

&lt;p&gt;We thought about flushing async in a batch, without waiting for commit and making this configurable. This would allow users to trade safety versus performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write improvement&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We had a deeper look into system calls such as &lt;a href="https://man7.org/linux/man-pages/man2/madvise.2.html" rel="noopener noreferrer"&gt;madvise&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Zeebe stores its log in a segmented journal which is memory mapped at runtime. The OS manages what is in memory at any time via the page cache, but does not know the application itself. The &lt;strong&gt;madvise&lt;/strong&gt; system call allows us to provide hints to the OS on when to read/write/evict pages.&lt;/p&gt;

&lt;p&gt;The idea was to provide hints to reduce memory churn/page faults and reduce I/O&lt;/p&gt;

&lt;p&gt;We tested with &lt;strong&gt;MADV_SEQUENTIAL&lt;/strong&gt; , hinting that we will access the file sequentially and a more aggressive read-ahead should be performed (while previous pages can be dropped sooner).&lt;/p&gt;

&lt;p&gt;Based on our benchmarks, we hadn’t seen much difference under low/mid load. However, read IO was greatly reduced under high load. We have seen slightly increased write I/O throughput under high load due to reduced IOPS contention. In general, there was a small improvement only in throughput/latency. Surprisingly, still it showed similar page faults as before.&lt;/p&gt;

&lt;h4&gt;
  
  
  Reduce transaction commits
&lt;/h4&gt;

&lt;p&gt;Based on our formula above, we can see that the processing latency is affected by the RocksDB write and transaction commit duration. This means reducing one of these could benefit the processing latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;State directory separation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Zeebe stores the current state (runtime) and snapshots on different folders on disk (under the same parent). When a Zeebe broker restarts, we recreate the state (runtime) every time from a snapshot. This is to avoid having data in the state which might not have been committed yet.&lt;/p&gt;

&lt;p&gt;This means we don’t necessarily need to keep the state (runtime) on disk, and RocksDB does a lot of IO-heavy work which might not be necessary. The idea was to separate the state directory in a way that it can be separately mounted (in Kubernetes) such that we can run RocksDB in &lt;a href="https://www.kernel.org/doc/html/v5.18/filesystems/tmpfs.html" rel="noopener noreferrer"&gt;tmpfs&lt;/a&gt;, for example.&lt;/p&gt;

&lt;p&gt;Based on our benchmarks, only p30 and lower have been improved with this POC:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AeXdsseoUCYkzRRWK" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AeXdsseoUCYkzRRWK" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disable WAL&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;RocksDB has a write-ahead log to be crash resistant. This is not necessary for us to recreate the state every time. We considered disabling it, we will see later in this post what influence it has. It is a &lt;a href="https://github.com/camunda/zeebe/blob/8.1.0/dist/src/main/config/broker.standalone.yaml.template#L757" rel="noopener noreferrer"&gt;single configuration&lt;/a&gt;, which is easy to change.&lt;/p&gt;

&lt;h4&gt;
  
  
  Processing of uncommitted
&lt;/h4&gt;

&lt;p&gt;We mentioned earlier that we have thought about changing the factor of how many commits influence the overall calculation. What if we process commands already, even if they are not committed yet, and only send results to the user if the commit of the commands is done?&lt;/p&gt;

&lt;p&gt;We worked on a POC to implement uncommitted processing, but it was a bit more complex than we thought due to the buffering of requests, etc. This is why we didn’t find a good solution during our hackdays. We still ran a benchmark to verify how it would behave:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AqO7KyPxNx_tDjkAc" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AqO7KyPxNx_tDjkAc" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The results were quite interesting and promising, but we considered them a bit too good. The production ready implementation might be different, since we have to consider more edge-cases.&lt;/p&gt;

&lt;h4&gt;
  
  
  Batch processing
&lt;/h4&gt;

&lt;p&gt;Part of another POC we did was something we called &lt;strong&gt;batch processing.&lt;/strong&gt; The implementation was rather easy.&lt;/p&gt;

&lt;p&gt;The idea was to process the follow-up commands directly and continue the execution of an instance until no more follow-up commands are produced. This normally means we have reached a wait state, like a service task. Camunda Platform 7 users will know this behavior, &lt;a href="https://docs.camunda.org/manual/latest/user-guide/process-engine/transactions-in-processes/#wait-states" rel="noopener noreferrer"&gt;as this is the Camunda Platform 7 default&lt;/a&gt;. The result was promising as well:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AROVY8b46DvI3MhZZ" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AROVY8b46DvI3MhZZ" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In our example process model above, this would reduce the factor of commit latencies from ~30 commands to 15, which is significant. The best IO you can do, however, is no IO.&lt;/p&gt;

&lt;h4&gt;
  
  
  Combining the POCs
&lt;/h4&gt;

&lt;p&gt;By combining several POCs, we reached our target line which showed us that it is possible and gave us some good insights on where to invest in order to improve our system further in the future.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AxzdFc_SWUJEtcm6w" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AxzdFc_SWUJEtcm6w" width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The improvements did not just improve overall latency of the system. In our weekly benchmarks we had to increase the load because the system was able to reach higher throughput. Before we reached ~133 (on avg) process instances per second (PI/s) over three partitions, now 163 PI/s (on avg) while also reducing the latency by a factor of 2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Next
&lt;/h3&gt;

&lt;p&gt;In the last weeks, we took several ideas from the hackdays to implement some production-ready solutions for Zeebe 8.2. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/camunda/zeebe/issues/11455" rel="noopener noreferrer"&gt;Disabling WAL per default&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/camunda/zeebe/issues/11416" rel="noopener noreferrer"&gt;Implement batch processing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/camunda/zeebe/issues/11423" rel="noopener noreferrer"&gt;Make disabling raft flush more safe&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/camunda/zeebe/issues/11494" rel="noopener noreferrer"&gt;Direct message correlation on the same partition&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We plan to work on some more like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/camunda/zeebe/issues/6044" rel="noopener noreferrer"&gt;Make state directory configurable&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/camunda/zeebe/issues/11377" rel="noopener noreferrer"&gt;Advise OS on mmap usage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/camunda/zeebe/issues/11488" rel="noopener noreferrer"&gt;Configurable raft flush interval&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You can expect some better performance with the 8.2 release; I’m really looking forward to April! :)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Thanks to all participants of the hackdays for the great and fun collaboration, and to our manager (&lt;a href="https://github.com/megglos" rel="noopener noreferrer"&gt;Sebastian Bathke&lt;/a&gt;) who made this possible. It was a really nice experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Participants (alphabetically sorted):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Zelldon" rel="noopener noreferrer"&gt;Christopher Zell (myself)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/deepthidevaki" rel="noopener noreferrer"&gt;Deepthi Devaki Akkoorath&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/falko" rel="noopener noreferrer"&gt;Falko Menge&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/npepinpe" rel="noopener noreferrer"&gt;Nicolas Pepin-Perreault&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/oleschoenburg" rel="noopener noreferrer"&gt;Ole Schönburg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/romansmirnov" rel="noopener noreferrer"&gt;Roman Smirnov&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/megglos" rel="noopener noreferrer"&gt;Sebastian Bathke&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Thanks to all the reviewers of this blog post:&lt;/em&gt; &lt;a href="https://github.com/cmausley" rel="noopener noreferrer"&gt;&lt;em&gt;Christina Ausley&lt;/em&gt;&lt;/a&gt;, &lt;a href="https://github.com/deepthidevaki" rel="noopener noreferrer"&gt;&lt;em&gt;Deepthi Devaki Akkoorath&lt;/em&gt;&lt;/a&gt;, &lt;a href="https://github.com/npepinpe" rel="noopener noreferrer"&gt;&lt;em&gt;Nicolas Pepin-Perreault&lt;/em&gt;&lt;/a&gt;, &lt;a href="https://github.com/oleschoenburg" rel="noopener noreferrer"&gt;&lt;em&gt;Ole Schönburg&lt;/em&gt;&lt;/a&gt;&lt;em&gt;,&lt;/em&gt; &lt;a href="https://github.com/saig0" rel="noopener noreferrer"&gt;&lt;em&gt;Philipp Ossler&lt;/em&gt;&lt;/a&gt; &lt;em&gt;and&lt;/em&gt; &lt;a href="https://github.com/megglos" rel="noopener noreferrer"&gt;&lt;em&gt;Sebastian Bathke&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>benchmark</category>
      <category>camunda</category>
      <category>zeebe</category>
      <category>performance</category>
    </item>
    <item>
      <title>Zbchaos — A new fault injection tool for Zeebe</title>
      <dc:creator>Christopher Kujawa</dc:creator>
      <pubDate>Thu, 15 Sep 2022 12:16:14 +0000</pubDate>
      <link>https://forem.com/camunda/zbchaos-a-new-fault-injection-tool-for-zeebe-4cin</link>
      <guid>https://forem.com/camunda/zbchaos-a-new-fault-injection-tool-for-zeebe-4cin</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3200%2F0%2ACqGpxPfdlWhgIHm9" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3200%2F0%2ACqGpxPfdlWhgIHm9" alt="Photo by [Brett Jordan](https://unsplash.com/@brett_jordan?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText) on[ Unsplash](https://unsplash.com/s/photos/chaos?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText)" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;center&gt;&lt;sup&gt;&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@brett_jordan?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Brett Jordan&lt;/a&gt; on&lt;a href="https://unsplash.com/s/photos/chaos?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt; Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;During Summer Hackdays 2022, I worked on a project called “Zeebe chaos” (&lt;strong&gt;zbchaos&lt;/strong&gt;), a fault injection CLI tool. This allows us engineers to more easily run chaos experiments against Zeebe, build up confidence in the system’s capabilities, and discover potential weaknesses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To understand this blog post, it is useful to have a certain understanding of &lt;a href="https://kubernetes.io/docs/concepts/overview/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt; and &lt;a href="https://camunda.com/platform/zeebe/" rel="noopener noreferrer"&gt;Zeebe&lt;/a&gt; itself.&lt;/p&gt;
&lt;h2&gt;
  
  
  Summer Hackdays:
&lt;/h2&gt;

&lt;p&gt;Hackdays are a regular event at Camunda, where people from different departments (engineering, consulting, DevRel, etc.) work together on new ideas, pet projects, and more.&lt;/p&gt;

&lt;p&gt;Often, the results are quite impressive and are also presented in the following CamundaCon. For example, check out the agenda of this year’s &lt;a href="https://www.camundacon.com/agenda-day-2/" rel="noopener noreferrer"&gt;CamundaCon 2022&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Check out previous Summer Hackdays here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://camunda.com/blog/2020/09/highlights-from-the-summer-hackdays-2020/" rel="noopener noreferrer"&gt;Summer Hackdays 2020&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=YCG9yHry1ks&amp;amp;ab_channel=Camunda" rel="noopener noreferrer"&gt;Summer Hackdays 2019&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Zeebe chaos CLI
&lt;/h2&gt;

&lt;p&gt;Working on the &lt;a href="https://docs.camunda.io/docs/components/zeebe/zeebe-overview/" rel="noopener noreferrer"&gt;Zeebe project&lt;/a&gt; is not only about engineering a distributed system or a process engine, it is also about testing, benchmarking, and experimenting with our capabilities.&lt;/p&gt;

&lt;p&gt;We run regular chaos experiments against Zeebe to build up confidence in our system and to determine whether we have weaknesses in certain areas. In the past, we have written &lt;a href="https://github.com/zeebe-io/zeebe-chaos/tree/main/chaos-workers/chaos-experiments/scripts" rel="noopener noreferrer"&gt;many bash scripts&lt;/a&gt; to inject faults (chaos). We wanted to replace them with better tooling: a new CLI. This allows us to make it more maintainable, but also lowers the barrier for others to experiment with the system.&lt;/p&gt;

&lt;p&gt;The CLI targets Kubernetes, as this is our recommended environment for Camunda Platform 8 Self-Managed, and the environment our own SaaS offering runs on.&lt;/p&gt;

&lt;p&gt;The tool builds upon our existing &lt;a href="https://helm.camunda.io/" rel="noopener noreferrer"&gt;Helm charts&lt;/a&gt;, which are normally used to deploy Zeebe within Kubernetes.&lt;/p&gt;
&lt;h3&gt;
  
  
  Requirements
&lt;/h3&gt;

&lt;p&gt;To use the CLI you need to have access to a Kubernetes cluster, and have our Camunda Platform 8 Helm charts deployed. &lt;a href="https://docs.camunda.io/docs/self-managed/platform-deployment/kubernetes-helm/#installing-the-camunda-helm-chart-in-a-cloud-environment" rel="noopener noreferrer"&gt;Additionally, feel free to try out Camunda Platform 8 Self-Managed&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Chaos Engineering:
&lt;/h2&gt;

&lt;p&gt;You might be wondering why we need this fault injection CLI tool or what this “chaos” stands for. It comes from chaos engineering, a practice we introduced back in 2019 to the Zeebe Project.&lt;/p&gt;

&lt;p&gt;Chaos Engineering was defined by the &lt;a href="https://principlesofchaos.org/" rel="noopener noreferrer"&gt;Principles of Chaos&lt;/a&gt;. It should help to build confidence in the system's capabilities and find potential weaknesses through regular chaos experiments. We define and execute such experiments regularly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://page.camunda.com/cclive-zell-chaosengineeringmeetszeebe" rel="noopener noreferrer"&gt;Take a look at my talk at CamundaCon 2020.2 to get to know more about Chaos Engineering at Camunda (and Zeebe)&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Chaos experiments
&lt;/h3&gt;

&lt;p&gt;As mentioned, we regularly write and run new chaos experiments to build up confidence in our system and undercover weaknesses. The first thing you have to do for your chaos experiment is to define a hypothesis that you want to prove. For example, processing should still be possible after a node goes down. Based on the hypothesis, you know what kind of property or steady state you want to verify before and after injecting faults into the system.&lt;/p&gt;

&lt;p&gt;A chaos experiment consists of three phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Verify the steady state.&lt;/li&gt;
&lt;li&gt;Inject chaos.&lt;/li&gt;
&lt;li&gt;Verify the steady state.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For each of these phases, the &lt;strong&gt;zbchaos&lt;/strong&gt; CLI provides certain features outlined below.&lt;/p&gt;
&lt;h4&gt;
  
  
  Verify steady state
&lt;/h4&gt;

&lt;p&gt;In the steady state phase, we want to verify certain properties of the system, like invariants, etc.&lt;/p&gt;

&lt;p&gt;One of the first things we typically want to check is the Zeebe topology. With &lt;strong&gt;zbchaos&lt;/strong&gt; you can run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;zbchaos topology
0 |LEADER &lt;span class="o"&gt;(&lt;/span&gt;HEALTHY&lt;span class="o"&gt;)&lt;/span&gt; |FOLLOWER &lt;span class="o"&gt;(&lt;/span&gt;HEALTHY&lt;span class="o"&gt;)&lt;/span&gt; |LEADER &lt;span class="o"&gt;(&lt;/span&gt;HEALTHY&lt;span class="o"&gt;)&lt;/span&gt;
1 |FOLLOWER &lt;span class="o"&gt;(&lt;/span&gt;HEALTHY&lt;span class="o"&gt;)&lt;/span&gt; |LEADER &lt;span class="o"&gt;(&lt;/span&gt;HEALTHY&lt;span class="o"&gt;)&lt;/span&gt; |FOLLOWER &lt;span class="o"&gt;(&lt;/span&gt;HEALTHY&lt;span class="o"&gt;)&lt;/span&gt;
2 |FOLLOWER &lt;span class="o"&gt;(&lt;/span&gt;HEALTHY&lt;span class="o"&gt;)&lt;/span&gt; |FOLLOWER &lt;span class="o"&gt;(&lt;/span&gt;HEALTHY&lt;span class="o"&gt;)&lt;/span&gt; |FOLLOWER &lt;span class="o"&gt;(&lt;/span&gt;HEALTHY&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zbchaos will do all the necessary magic for you. Finding a Zeebe gateway, do a port-forward, request the topology, and print it in a compact format. This makes the chaos engineers’ life much easier.&lt;/p&gt;

&lt;p&gt;Another basic check is verifying the readiness of all deployed Zeebe components. To achieve this, we can use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;zbchaos verify readiness
All Zeebe nodes are running.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This verifies the Zeebe Broker Pod status and the status of the Zeebe Gateway deployment status. If one of these is not ready yet, it will loop and not return before they are ready. This is beneficial in automation scripts.&lt;/p&gt;

&lt;p&gt;After you have verified the general health and readiness of the system, you also need to verify whether the system is working functionally. This is also called “verifying the steady state.” This can be achieved by:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;zbchaos verify steady-state — partitionId 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command checks that a process model can be deployed and a process instance can be started for the specified partition. As you cannot influence the partition for new process instances, process instances are started in a loop until that partition is hit. If you don’t specify the &lt;strong&gt;partitionId&lt;/strong&gt;, partition one is used.&lt;/p&gt;

&lt;h4&gt;
  
  
  Inject chaos
&lt;/h4&gt;

&lt;p&gt;After we verify our steady state we want to inject faults or chaos into our system, and afterward check again our steady state. The &lt;strong&gt;zbchaos&lt;/strong&gt; CLI already provides several possibilities to inject faults outlined below.&lt;/p&gt;

&lt;p&gt;Before we step through how we can inject failures, we need to understand what kind of components a Zeebe cluster consists of and what the architecture looks like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2810%2F0%2AbzpYhhsYYz4ATUpL" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2810%2F0%2AbzpYhhsYYz4ATUpL" alt="[https://docs.camunda.io/assets/images/zeebe-gateway-overview-2c9e101330b27687016509acef12725f.png](https://docs.camunda.io/assets/images/zeebe-gateway-overview-2c9e101330b27687016509acef12725f.png)" width="800" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We have two types of nodes: the broker, and the gateway.&lt;/p&gt;

&lt;p&gt;A &lt;a href="https://docs.camunda.io/docs/components/zeebe/technical-concepts/architecture/#brokers" rel="noopener noreferrer"&gt;broker&lt;/a&gt; is a node that does the processing work. It can participate in one or more Zeebe partitions (&lt;a href="https://docs.camunda.io/docs/components/zeebe/technical-concepts/partitions/" rel="noopener noreferrer"&gt;internally each partition is a raft group, which can consist of one or more nodes&lt;/a&gt;). A broker can have different roles for each partition (Leader, Follower, etc.)&lt;/p&gt;

&lt;p&gt;For more details about the replication, check our &lt;a href="https://docs.camunda.io/docs/components/zeebe/technical-concepts/partitions/#replication" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; and the &lt;a href="https://raft.github.io/" rel="noopener noreferrer"&gt;raft documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The Zeebe gateway is the contact point to the Zeebe cluster to which clients connect. Clients send commands to the gateway and the gateway is in charge of distributing the commands to the partition leaders. This depends on the command type of course. &lt;a href="https://docs.camunda.io/docs/self-managed/zeebe-gateway-deployment/zeebe-gateway/" rel="noopener noreferrer"&gt;For more details, check out the documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;By default, the Zeebe gateways are replicated as if Camunda Platform 8 Self-Managed was installed via our &lt;a href="http://helm.camunda.io" rel="noopener noreferrer"&gt;Helm charts&lt;/a&gt;, which makes it interesting to also experiment with the gateways.&lt;/p&gt;

&lt;h5&gt;
  
  
  Shutdown nodes
&lt;/h5&gt;

&lt;p&gt;With &lt;strong&gt;zbchaos&lt;/strong&gt; we can shutdown brokers (gracefully and non-gracefully) which have a specific role and take part in a specific partition. This is quite useful in experimenting since we often want to terminate or restart brokers based on the participation and role (e.g. terminate the Leader of partition X or restart all followers of partition Y.)&lt;/p&gt;

&lt;h6&gt;
  
  
  Graceful
&lt;/h6&gt;

&lt;p&gt;A graceful restart can be initiated like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;zbchaos restart &lt;span class="nt"&gt;-h&lt;/span&gt;
Restarts a Zeebe broker with a certain role and given partition.

    Usage:
    zbchaos restart &lt;span class="o"&gt;[&lt;/span&gt;flags]

    Flags:
      &lt;span class="nt"&gt;-h&lt;/span&gt;, &lt;span class="nt"&gt;--help&lt;/span&gt; &lt;span class="nb"&gt;help &lt;/span&gt;&lt;span class="k"&gt;for &lt;/span&gt;restart
      &lt;span class="nt"&gt;--partitionId&lt;/span&gt; int Specify the &lt;span class="nb"&gt;id &lt;/span&gt;of the partition &lt;span class="o"&gt;(&lt;/span&gt;default 1&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="nt"&gt;--role&lt;/span&gt; string Specify the partition role &lt;span class="o"&gt;[&lt;/span&gt;LEADER, FOLLOWER, INACTIVE] &lt;span class="o"&gt;(&lt;/span&gt;default “LEADER”&lt;span class="o"&gt;)&lt;/span&gt;

    Global Flags:
    &lt;span class="nt"&gt;-v&lt;/span&gt;, — verbose verbose output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sends a Kubernetes &lt;strong&gt;delete&lt;/strong&gt; command to the pod, which takes part of the specific partition and has the specific role. This is based on the current Zeebe topology, provided by the Zeebe gateway. All of this is handled by the &lt;strong&gt;zbchaos&lt;/strong&gt; toolkit. The chaos engineer doesn’t need to find this information manually.&lt;/p&gt;

&lt;h6&gt;
  
  
  Non-graceful
&lt;/h6&gt;

&lt;p&gt;Similar to the graceful restart is the termination of the broker. It will send a delete to the specific Kubernetes Pod, and will &lt;a href="https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/" rel="noopener noreferrer"&gt;set the **–gracePeriod **to zero&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;zbchaos terminate &lt;span class="nt"&gt;-h&lt;/span&gt;
Terminates a Zeebe broker with a certain role and given partition.

    Usage:
      zbchaos terminate &lt;span class="o"&gt;[&lt;/span&gt;flags]
      zbchaos terminate &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;command&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;

    Available Commands:
      gateway Terminates a Zeebe gateway

    Flags:
      &lt;span class="nt"&gt;-h&lt;/span&gt;, &lt;span class="nt"&gt;--help&lt;/span&gt; &lt;span class="nb"&gt;help &lt;/span&gt;&lt;span class="k"&gt;for &lt;/span&gt;terminate
      &lt;span class="nt"&gt;--nodeId&lt;/span&gt; int Specify the nodeId of the Broker &lt;span class="o"&gt;(&lt;/span&gt;default &lt;span class="nt"&gt;-1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="nt"&gt;--partitionId&lt;/span&gt; int Specify the &lt;span class="nb"&gt;id &lt;/span&gt;of the partition &lt;span class="o"&gt;(&lt;/span&gt;default 1&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="nt"&gt;--role&lt;/span&gt; string Specify the partition role &lt;span class="o"&gt;[&lt;/span&gt;LEADER, FOLLOWER] &lt;span class="o"&gt;(&lt;/span&gt;default “LEADER”&lt;span class="o"&gt;)&lt;/span&gt;

    Global Flags:
    &lt;span class="nt"&gt;-v&lt;/span&gt;, &lt;span class="nt"&gt;--verbose&lt;/span&gt; verbose output

    Use “zbchaos terminate &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;command&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="nt"&gt;--help&lt;/span&gt;” &lt;span class="k"&gt;for &lt;/span&gt;more information about a command.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h6&gt;
  
  
  Gateway
&lt;/h6&gt;

&lt;p&gt;Both commands above target the Zeebe brokers. Sometimes, it is also interesting to target the Zeebe gateway. For that, we can just append the &lt;strong&gt;gateway&lt;/strong&gt; subcommand to the &lt;strong&gt;restart&lt;/strong&gt; or &lt;strong&gt;terminate&lt;/strong&gt; command.&lt;/p&gt;

&lt;h4&gt;
  
  
  Disconnect brokers
&lt;/h4&gt;

&lt;p&gt;It is not only interesting to experiment with graceful and non-graceful restarts, but it is also interesting to experiment with network issues. This kind of fault undercovers other interesting weaknesses (bugs).&lt;/p&gt;

&lt;p&gt;With the &lt;strong&gt;zbchaos&lt;/strong&gt; CLI, it is possible to disconnect different brokers. We can specify at which partition they participate and what kind of role they have. These network partitions can also be set up in one direction if the &lt;strong&gt;–one-direction&lt;/strong&gt; flag is used.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ zbchaos disconnect -h
Disconnect Zeebe nodes, uses sub-commands to disconnect leaders, followers, etc.

    Usage:
     zbchaos disconnect [command]

    Available Commands:
     brokers Disconnect Zeebe Brokers

    Flags:
     -h, — help help for disconnect

    Global Flags:
     -v, — verbose verbose output

    Use “zbchaos disconnect [command] — help” for more information about a command.
    [zell ~/ cluster: zeebe-cluster ns:zell-chaos]$ zbchaos disconnect brokers -h
    Disconnect Zeebe Brokers with a given partition and role.

    Usage:
     zbchaos disconnect brokers [flags]

    Flags:
     — broker1NodeId int Specify the nodeId of the first Broker (default -1)
     — broker1PartitionId int Specify the partition id of the first Broker (default 1)
     — broker1Role string Specify the partition role [LEADER, FOLLOWER] of the first Broker (default “LEADER”)
     — broker2NodeId int Specify the nodeId of the second Broker (default -1)
     — broker2PartitionId int Specify the partition id of the second Broker (default 2)
     — broker2Role string Specify the partition role [LEADER, FOLLOWER] of the second Broker (default “LEADER”)
     -h, — help help for brokers
     — one-direction Specify whether the network partition should be setup only in one direction (asymmetric)

    Global Flags:
     -v, — verbose verbose output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The network partition will be established with &lt;a href="https://man7.org/linux/man-pages/man8/ip-route.8.html" rel="noopener noreferrer"&gt;ip route tables&lt;/a&gt;, which are installed on the specific broker pods.&lt;/p&gt;

&lt;p&gt;Right now this is only supported for the brokers, but hopefully, we will add support for the gateways soon as well.&lt;/p&gt;

&lt;p&gt;To connect the brokers again, the following can be used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;zbchaos connect brokers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This removes the ip routes on all pods again.&lt;/p&gt;

&lt;h3&gt;
  
  
  Other features
&lt;/h3&gt;

&lt;p&gt;All the described commands support a verbose flag, which allows the user to determine what kind of action is done, how it connects to the cluster, and more.&lt;/p&gt;

&lt;p&gt;For all of the commands, a bash-completion can be generated via &lt;code&gt;zbchaos completion&lt;/code&gt;, which is very handy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Outcome and future
&lt;/h2&gt;

&lt;p&gt;In general, I was quite happy with the outcome of Summer Hackdays 2022, and it was a lot of fun to build and use this tool already. I was able to finally spend some more time writing go code and especially a go CLI. I learned to use the Kubernetes go-client and how to write go tests with fakes for the Kubernetes API, which was quite interesting. You can take a look at the tests &lt;a href="https://github.com/zeebe-io/zeebe-chaos/blob/main/go-chaos/internal/pods_test.go" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/zeebe-io/zeebe-chaos/tree/main/go-chaos" rel="noopener noreferrer"&gt;Code of the CLI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/zeebe-io/zeebe-chaos/releases" rel="noopener noreferrer"&gt;Releases&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zeebe-io.github.io/zeebe-chaos/2022/08/31/Message-Correlation-after-Network-Partition/" rel="noopener noreferrer"&gt;Example usage&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We plan to extend the CLI in the future and use it in our upcoming experiments.&lt;/p&gt;

&lt;p&gt;For example, I recently did a new chaos day, a day I use to run new experiments, and &lt;a href="https://zeebe-io.github.io/zeebe-chaos/2022/08/31/Message-Correlation-after-Network-Partition/" rel="noopener noreferrer"&gt;wrote a post about it&lt;/a&gt;. In this article, I extended the CLI, with features like sending messages to certain partitions.&lt;/p&gt;

&lt;p&gt;At some point, we want to use the functionality within our automated chaos experiments as Zeebe workers and replace our old bash scripts.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Thanks to Christina Ausley and Bernd Ruecker for reviewing this post :)&lt;/em&gt;&lt;/p&gt;

</description>
      <category>chaosengineering</category>
      <category>kubernetes</category>
      <category>opensource</category>
      <category>camunda</category>
    </item>
    <item>
      <title>Advanced Test Practices For Helm Charts</title>
      <dc:creator>Christopher Kujawa</dc:creator>
      <pubDate>Mon, 28 Mar 2022 12:15:13 +0000</pubDate>
      <link>https://forem.com/camunda/advanced-test-practices-for-helm-charts-57gp</link>
      <guid>https://forem.com/camunda/advanced-test-practices-for-helm-charts-57gp</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh06xj8o9mw60xz8ahmc.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh06xj8o9mw60xz8ahmc.jpeg" alt="Photo by Joseph Barrientos on Unsplash" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;center&gt;&lt;sup&gt;&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@jbcreate_?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Joseph Barrientos&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/ship?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;I’m excited to share below the detailed learnings and experiences I had along my journey of finding a good way to write automated tests for Helm charts. At the end of this blog post, I’ll present to you the current solution we’re using, which are meeting all our requirements.&lt;/p&gt;

&lt;p&gt;I’m a distributed systems engineer working on the &lt;a href="https://github.com/camunda-cloud/zeebe" rel="noopener noreferrer"&gt;Camunda Zeebe&lt;/a&gt; project, that’s part of &lt;a href="https://camunda.com/products/cloud/" rel="noopener noreferrer"&gt;Camunda Cloud&lt;/a&gt;. I’m highly interested in SRE topics, so I started maintaining the Helm charts for Camunda Cloud.&lt;/p&gt;

&lt;p&gt;Please, be aware that these are my personal experiences and might be a bit subjective, but I try to be as objective as possible.&lt;/p&gt;
&lt;h3&gt;
  
  
  How it Began
&lt;/h3&gt;

&lt;p&gt;We started with the community-maintained Helm charts for Zeebe and &lt;a href="https://camunda.com/products/cloud/" rel="noopener noreferrer"&gt;Camunda Cloud-related tools&lt;/a&gt;, like Tasklist and Operate. This project had a lack of support and stability issues.&lt;/p&gt;

&lt;p&gt;In the past, we often had issues with the charts being broken, sometimes because we added a new feature or property. Or because the property was never used before, and was hidden by a condition. We wanted to avoid that and give the users a better experience.&lt;/p&gt;

&lt;p&gt;In early 2022, we at Camunda wanted to create some new Helm charts, based on the old ones we had. The new Helm charts needed to be officially supported by Camunda. In order to do that with a clear conscience, we wanted to add some automated tests to the charts.&lt;/p&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;In order to understand this blog post you should have some knowledge of the following topics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://helm.sh/" rel="noopener noreferrer"&gt;Helm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://go.dev/" rel="noopener noreferrer"&gt;Golang&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Helm Testing. What is the Issue?
&lt;/h3&gt;

&lt;p&gt;Testing in the Helm world is, I would say, not as well evolved as it should be. Some tools exist, but they lack usability, or they needed too much boilerplate code. Sometimes it’s not really clear how to use or write them.&lt;/p&gt;

&lt;p&gt;Some posts around that topic already exist, but there aren’t many. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://faun.pub/helm-charts-testing-2091a63a83af" rel="noopener noreferrer"&gt;&lt;em&gt;Kubernetes Helm Charts Testing&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/questions/58288784/helm-test-best-practices" rel="noopener noreferrer"&gt;&lt;em&gt;Helm Test Best Practices&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This one really helped us, &lt;a href="https://blog.gruntwork.io/automated-testing-for-kubernetes-and-helm-charts-using-terratest-a4ddc4e67344" rel="noopener noreferrer"&gt;&lt;em&gt;Automated Testing for Kubernetes and Helm Charts using Terratest.&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It explains how to test Helm charts with &lt;a href="https://terratest.gruntwork.io/" rel="noopener noreferrer"&gt;Terratest&lt;/a&gt;, a framework to write tests for Helm charts, and other Kubernetes-related things.&lt;/p&gt;

&lt;p&gt;We did a comparison of Terratest, writing golden file tests (here’s a &lt;a href="https://medium.com/@jarifibrahim/golden-files-why-you-should-use-them-47087ec994bf" rel="noopener noreferrer"&gt;blog post&lt;/a&gt; about that why you should use them), and using &lt;a href="https://github.com/helm/chart-testing" rel="noopener noreferrer"&gt;Chart Testing (CT)&lt;/a&gt;. You can find the details in this &lt;a href="https://github.com/camunda/camunda-cloud-helm/issues/125" rel="noopener noreferrer"&gt;GitHub issue&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This issue contains a comparison between the test tools, as well as some subjective field reports, which I wrote during the testing. It helped me to make some decisions.&lt;/p&gt;
&lt;h3&gt;
  
  
  What and How to Test
&lt;/h3&gt;

&lt;p&gt;First of all, we separated our tests into two parts, with different targets and goals.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Template tests (unit tests)&lt;/strong&gt;  —  Which verify the general structure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration tests&lt;/strong&gt;  —  Which verify whether we can install the charts and use them.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Template tests
&lt;/h3&gt;

&lt;p&gt;With the template tests, we want to verify the general structure. This includes whether it’s yaml conform, does the default values not change, or are they set at all.&lt;/p&gt;

&lt;p&gt;For template tests, we combine both golden files and Terratest. Generally speaking, golden files store the expected output of a certain command or response for a specific request. In our case, the golden files contain the rendered manifest, which are outputted after you run &lt;strong&gt;helm template&lt;/strong&gt;. This allows you to verify that the default values are set and changed only in a controlled manner, this reduces the burden of writing many tests.&lt;/p&gt;

&lt;p&gt;If we want to verify specific properties (or conditions), we can use the direct property tests with Terratest. We will come to that again later.&lt;/p&gt;

&lt;p&gt;This allows us to use one tool (Terratest) and separate the tests per manifest, such as a test for Zeebe statefulset, the Zeebe gateway deployment, etc. The tests can be easily run via command line or IDE, and CI.&lt;/p&gt;
&lt;h3&gt;
  
  
  Integration tests
&lt;/h3&gt;

&lt;p&gt;With the integration tests we want to test for two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Whether the charts can be deployed to Kubernetes, and are accepted by the K8s API.&lt;/li&gt;
&lt;li&gt;Whether the services are running and can work with each other.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Other things, like broken templates, incorrectly set values, etc., are caught by the tests above.&lt;/p&gt;

&lt;p&gt;So to turn it around, here are potential failure cases we can find with such tests:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Specifications that are in the wrong place (look like valid yaml), but aren’t accepted by the K8s API.&lt;/li&gt;
&lt;li&gt;Services that aren’t becoming ready because of configuration errors, and they can’t reach each other.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first case we could also solve with other tools, which validates manifests based on the K8s API, but not the second one.&lt;/p&gt;

&lt;p&gt;In order to write the integration tests, we tried out the Chart Testing tool and Terratest. We chose Terratest over Chart Testing. If you want to know why, read the next section, otherwise, you can simply skip it.&lt;/p&gt;
&lt;h4&gt;
  
  
  Chart testing
&lt;/h4&gt;

&lt;p&gt;While trying to write the tests using Chart Testing, we encountered several issues that made the tool difficult to use, and the tests difficult to maintain.&lt;/p&gt;

&lt;p&gt;For example, the options to configure the testing process seem rather limited  —  see &lt;a href="https://github.com/helm/chart-testing/blob/main/doc/ct_install.md" rel="noopener noreferrer"&gt;CT Install documentation&lt;/a&gt; for available options. In particular, during the Helm install phase, our tests deploy a lot of components (Elasticsearch, Zeebe) that take ages to become ready. However, Chart Testing times out by default after three minutes, and we didn’t find a way to adjust this type of setting. As such, we actually were never able to run a successful test using the &lt;strong&gt;ct&lt;/strong&gt;  CLI.&lt;/p&gt;

&lt;p&gt;Another painful point was the way the tests are shipped, executed, and eventually how results are reported. The Chart Testing tool wraps, simply speaking, the Helm CLI, which means it’ll run the &lt;strong&gt;helm install&lt;/strong&gt; and &lt;strong&gt;helm test&lt;/strong&gt; command. To be executed using the &lt;strong&gt;helm test&lt;/strong&gt; command, the tests have to be configured and deployed as part of the Helm chart. This means the tests have to be embedded inside a Docker image, which might not be super practical, and the Helm chart also needs to be modified to ship with the additional tests settings.&lt;/p&gt;

&lt;p&gt;If the tests fail in the CI and you want to reproduce it, you would need the &lt;strong&gt;ct&lt;/strong&gt; CLI locally, and run &lt;strong&gt;ct install&lt;/strong&gt; to redeploy the whole Helm chart, and run the tests. When the tests fail, the complete logs of all the containers are printed, which can be a big amount of data to inspect. We found it was difficult to iterate on the tests, and quite cumbersome to debug them when they were failing.&lt;/p&gt;

&lt;p&gt;All the reasons above pushed us to use Terratest (see next section) to write the tests. The benefit here is that we have one tool for both (unit and IT tests), and more control over it. It makes it easy to run and debug the tests. In general, the tests were also quite simple to write, and the failures were easy to understand.&lt;/p&gt;

&lt;p&gt;For more information regarding this, please check the comments in the &lt;a href="https://github.com/camunda/camunda-cloud-helm/issues/125#issuecomment-1034827550" rel="noopener noreferrer"&gt;Github issue&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Helm Chart Tests In Practice
&lt;/h3&gt;

&lt;p&gt;In the following section, I would like to present how we use Terratest, and what our new tests for our Helm charts look like.&lt;/p&gt;
&lt;h4&gt;
  
  
  Golden files test
&lt;/h4&gt;

&lt;p&gt;We wrote a base test, which renders given Helm templates and compares them against golden files. The golden files can be generated via a separate flag. The golden files are tracked in git, which allows us to see changes easily via a git diff. This means if we change any defaults, we can directly see the resulting rendered manifests. These tests ensure that the Helm chart templates render correctly and the output of the templates changes in a controlled manner.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Golden Base&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;golden&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"flag"&lt;/span&gt;
    &lt;span class="s"&gt;"io/ioutil"&lt;/span&gt;

    &lt;span class="s"&gt;"regexp"&lt;/span&gt;

    &lt;span class="s"&gt;"github.com/gruntwork-io/terratest/modules/helm"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gruntwork-io/terratest/modules/k8s"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/stretchr/testify/suite"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;update&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"update-golden"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"update golden test output files"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;TemplateGoldenTest&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Suite&lt;/span&gt;
    &lt;span class="n"&gt;ChartPath&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Release&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Namespace&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;GoldenFileName&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Templates&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;SetValues&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;TemplateGoldenTest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;TestContainerGoldenTestDefaults&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;helm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Options&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;KubectlOptions&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;k8s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewKubectlOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;SetValues&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetValues&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;helm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RenderTemplate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChartPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Templates&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;regex&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;regexp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MustCompile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;`\s+helm.sh/chart:\s+.*`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;bytes&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;regex&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReplaceAll&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;goldenFile&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;"golden/"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GoldenFileName&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;".golden.yaml"&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ioutil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;goldenFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0644&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Require&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NoError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Golden file was not writable"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ioutil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;goldenFile&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// then&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Require&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NoError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Golden file doesn't exist or was not readable"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Require&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The base test allows us to easily add/write new golden file tests for each of our sub charts. For example, we have the following test for the Zeebe sub-chart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;zeebe&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"path/filepath"&lt;/span&gt;
    &lt;span class="s"&gt;"strings"&lt;/span&gt;
    &lt;span class="s"&gt;"testing"&lt;/span&gt;

    &lt;span class="s"&gt;"camunda-cloud-helm/charts/ccsm-helm/test/golden"&lt;/span&gt;

    &lt;span class="s"&gt;"github.com/gruntwork-io/terratest/modules/random"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/stretchr/testify/require"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/stretchr/testify/suite"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestGoldenDefaultsTemplate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parallel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;chartPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"../../"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NoError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;templateNames&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"serviceaccount"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"statefulset"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"configmap"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;templateNames&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;golden&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TemplateGoldenTest&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;ChartPath&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chartPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"ccsm-helm-test"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"ccsm-helm-"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToLower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UniqueId&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
            &lt;span class="n"&gt;GoldenFileName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Templates&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"charts/zeebe/templates/"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;".yaml"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, we test the Zeebe resources: &lt;strong&gt;service&lt;/strong&gt; &lt;em&gt;,&lt;/em&gt; &lt;strong&gt;serviceaccount&lt;/strong&gt; &lt;em&gt;,&lt;/em&gt; &lt;strong&gt;statefulset&lt;/strong&gt; &lt;em&gt;, and&lt;/em&gt; &lt;strong&gt;confimap&lt;/strong&gt; with default values against golden values. Here are the &lt;a href="https://github.com/camunda/camunda-platform-helm/tree/main/charts/camunda-platform/test/zeebe/golden" rel="noopener noreferrer"&gt;golden files&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Property test:
&lt;/h3&gt;

&lt;p&gt;As described above, sometimes, we want to test specific properties, like conditions in our templates. Here it’s easier to write specific Terratest tests.&lt;/p&gt;

&lt;p&gt;We do that for each manifest, like the &lt;strong&gt;statefulset&lt;/strong&gt; , and then call it &lt;strong&gt;statefulset_test.go&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In such go test file, we have a base structure, which looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;statefulSetTest&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Suite&lt;/span&gt;
    &lt;span class="n"&gt;chartPath&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;templates&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestStatefulSetTemplate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parallel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;chartPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"../../"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NoError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;statefulSetTest&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;chartPath&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chartPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;release&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"ccsm-helm-test"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"ccsm-helm-"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToLower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UniqueId&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
        &lt;span class="n"&gt;templates&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"charts/zeebe/templates/statefulset.yaml"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we want to test a condition in our templates, which look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- if .Values.priorityClassName&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
      &lt;span class="na"&gt;priorityClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.priorityClassName | quote&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
      &lt;span class="pi"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt;- end&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, we can easily add such tests to the &lt;strong&gt;statefulset_test.go&lt;/strong&gt; file. That would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;statefulSetTest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;TestContainerSetPriorityClassName&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// given&lt;/span&gt;
    &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;helm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Options&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;SetValues&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"zeebe.priorityClassName"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"PRIO"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;KubectlOptions&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;k8s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewKubectlOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// when&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;helm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RenderTemplate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chartPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;release&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;templates&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;statefulSet&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatefulSet&lt;/span&gt;
    &lt;span class="n"&gt;helm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UnmarshalK8SYaml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;statefulSet&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// then&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Require&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"PRIO"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;statefulSet&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Template&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PriorityClassName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this test, we set the &lt;strong&gt;priorityClassName&lt;/strong&gt; to a custom value like “PRIO”, render the template, and verify that the object (statefulset) contains that value.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration test
&lt;/h3&gt;

&lt;p&gt;Terratest allows us to write not only template tests, but also real integration tests. This means we can access a Kubernetes cluster, create namespaces, install the Helm chart, and verify certain properties.&lt;/p&gt;

&lt;p&gt;I’ll only present the basic setup here, since otherwise, it would go too far. If you’re interested in what our integration test looks like, &lt;a href="https://github.com/camunda/camunda-cloud-helm/blob/main/charts/camunda-platform/test/integration/integration_test.go" rel="noopener noreferrer"&gt;check this out&lt;/a&gt;. Here we set up the namespaces, install the Helm charts, and test each service we deploy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Basic Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;//go:build integration&lt;/span&gt;
&lt;span class="c"&gt;// +build integration&lt;/span&gt;

&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;integration&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"os"&lt;/span&gt;
    &lt;span class="s"&gt;"path/filepath"&lt;/span&gt;
    &lt;span class="s"&gt;"strings"&lt;/span&gt;
    &lt;span class="s"&gt;"time"&lt;/span&gt;

    &lt;span class="s"&gt;"context"&lt;/span&gt;
    &lt;span class="s"&gt;"testing"&lt;/span&gt;

    &lt;span class="s"&gt;"github.com/gruntwork-io/terratest/modules/helm"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gruntwork-io/terratest/modules/k8s"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gruntwork-io/terratest/modules/random"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/stretchr/testify/require"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/stretchr/testify/suite"&lt;/span&gt;
    &lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="s"&gt;"k8s.io/apimachinery/pkg/apis/meta/v1"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;integrationTest&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Suite&lt;/span&gt;
    &lt;span class="n"&gt;chartPath&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;kubeOptions&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;k8s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;KubectlOptions&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestIntegration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;chartPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"../../"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;require&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NoError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;createNamespaceName&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;kubeOptions&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;k8s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewKubectlOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"gke_&amp;lt;project&amp;gt;_europe-west1-b_&amp;lt;project-name&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;integrationTest&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;chartPath&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chartPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;release&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"zeebe-cluster-helm-it"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;kubeOptions&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;kubeOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similar to the properties test above, we have some base structure that allows us to write the integration tests. This is to set up the test environment. It allows us to specify the targeting Kubernetes cluster, via kubeOptions.&lt;/p&gt;

&lt;p&gt;In order to separate the integration tests from the normal unit tests, we use &lt;a href="https://www.digitalocean.com/community/tutorials/customizing-go-binaries-with-build-tags" rel="noopener noreferrer"&gt;go build tags&lt;/a&gt;. The first lines above, define the tag &lt;strong&gt;integration,&lt;/strong&gt; which allows us to run the tests only via &lt;strong&gt;go test -tags integration ./…/integration&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We create the Kubernetes namespace name either randomly (using a&lt;a href="https://terratest.gruntwork.io/docs/testing-best-practices/namespacing/" rel="noopener noreferrer"&gt;helper from Terratest&lt;/a&gt; ) or based on the git commit, if triggered as a GitHub action. We’ll get to that later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;truncateString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="n"&gt;shortenStr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;str&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;shortenStr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;
   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;shortenStr&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;createNamespaceName&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="c"&gt;// if triggered by a github action the environment variable is set&lt;/span&gt;
   &lt;span class="c"&gt;// we use it to better identify the test&lt;/span&gt;
   &lt;span class="n"&gt;commitSHA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LookupEnv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"GITHUB_SHA"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;"ccsm-helm-"&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;exist&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToLower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UniqueId&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;commitSHA&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;

   &lt;span class="c"&gt;// max namespace length is 63 characters&lt;/span&gt;
   &lt;span class="c"&gt;// https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names&lt;/span&gt;
   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;truncateString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;63&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://pkg.go.dev/github.com/stretchr/testify/suite" rel="noopener noreferrer"&gt;Go testify suite&lt;/a&gt; allows us to run functions before and after a test, which we use to create and delete a namespace.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;integrationTest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;SetupTest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="n"&gt;k8s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CreateNamespace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kubeOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;integrationTest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;TearDownTest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="n"&gt;k8s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DeleteNamespace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kubeOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The example integration test is fairly simple, we install the Helm charts with default values, and wait until all pods are available. For that, we can use some helpers, which Terratest offers &lt;a href="https://github.com/gruntwork-io/terratest/blob/master/modules/k8s/pod.go#L107" rel="noopener noreferrer"&gt;here&lt;/a&gt; for example.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;integrationTest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;TestServicesEnd2End&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="c"&gt;// given&lt;/span&gt;
   &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;helm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Options&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;KubectlOptions&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kubeOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;

   &lt;span class="c"&gt;// when&lt;/span&gt;
   &lt;span class="n"&gt;helm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Install&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chartPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;release&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="c"&gt;// then&lt;/span&gt;
   &lt;span class="c"&gt;// await that all ccsm related pods become ready&lt;/span&gt;
   &lt;span class="n"&gt;pods&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;k8s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListPods&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kubeOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListOptions&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;LabelSelector&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"app=camunda-cloud-self-managed"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

   &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;pods&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;k8s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WaitUntilPodAvailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kubeOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As written above, our actual integration test is far more complex, but this should give you a good idea of what you can do. Since Terratest is written in go, this allowed us to write all our tests in go, use mechanics like build tags, and use go libraries like &lt;a href="https://github.com/stretchr/testify" rel="noopener noreferrer"&gt;testify&lt;/a&gt;. Terratest makes it easy to access the Kubernetes API, run Helm commands like &lt;strong&gt;install,&lt;/strong&gt; and validate the outcome. I really appreciate the verbosity, since the rendered Helm templates are also printed to standard out on running the tests, which helps to debug them. After implementing the integration tests, we were quite satisfied with the result, and the test coding approach, which stands in contrast to having a separate abstraction around the tests that you would have with the Chart Testing tool.&lt;/p&gt;

&lt;p&gt;After creating such integration tests, we, of course, wanted to automate them. We did that with GitHub actions (see next section).&lt;/p&gt;

&lt;h3&gt;
  
  
  Automation
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff1q1sjthlrzta8by9cw3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff1q1sjthlrzta8by9cw3.png" alt="GithubActions" width="489" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As written above, we automate our tests via GitHub Actions. For normal tests, this is quite simple, you can find an example &lt;a href="https://github.com/camunda/camunda-cloud-helm/blob/main/.github/workflows/go.yml" rel="noopener noreferrer"&gt;here&lt;/a&gt; of how we run our &lt;a href="https://github.com/camunda/camunda-cloud-helm/blob/main/.github/workflows/go.yml" rel="noopener noreferrer"&gt;normal template tests&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It becomes more appealing for integration tests, where you want to connect to an external Kubernetes cluster. Since we use GKE, we also use the corresponding GitHub actions to authenticate with &lt;a href="https://github.com/google-github-actions/auth" rel="noopener noreferrer"&gt;Google Cloud&lt;/a&gt;, and get the &lt;a href="https://github.com/google-github-actions/get-gke-credentials" rel="noopener noreferrer"&gt;credentials&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Follow this &lt;a href="https://github.com/google-github-actions/auth#setting-up-workload-identity-federation" rel="noopener noreferrer"&gt;guide&lt;/a&gt; to set up the needed &lt;a href="https://cloud.google.com/iam/docs/workload-identity-federation" rel="noopener noreferrer"&gt;workload identity federation&lt;/a&gt;. This is the recommended way to authenticate with Google Cloud resources from outside and replace the old usage of service account keys. The workflow identity federation lets you access resources directly, using a &lt;a href="https://cloud.google.com/iam/docs/creating-short-lived-service-account-credentials" rel="noopener noreferrer"&gt;short-lived access token&lt;/a&gt;, and eliminates the maintenance, and security burden associated with service account keys.&lt;/p&gt;

&lt;p&gt;After setting up the workload identity federation, the usage in GitHub actions is fairly simple.&lt;/p&gt;

&lt;p&gt;As an example, we use the following in our GitHub action:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add "id-token" with the intended permissions.&lt;/span&gt;
&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;read'&lt;/span&gt;
  &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;write'&lt;/span&gt;

&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;auth'&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Authenticate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Google&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Cloud'&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v0'&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;workload_identity_provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;‘&amp;lt;Workload Identity Provider resource name&amp;gt;’&lt;/span&gt;
    &lt;span class="na"&gt;service_account&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;‘&amp;lt;service-account-name&amp;gt;@&amp;lt;project-id&amp;gt;.iam.gserviceaccount.com’&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;get-credentials'&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Get&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;GKE&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;credentials'&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/get-gke-credentials@v0'&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cluster_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;‘&amp;lt;cluster-name&amp;gt;’&lt;/span&gt;
    &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;europe-west1-b'&lt;/span&gt;
&lt;span class="c1"&gt;# The KUBECONFIG env var is automatically exported and picked up by kubectl.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;check-credentials'&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Check&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;credentials'&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;auth&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;can-i&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;create&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;deployment'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is based on the examples of &lt;a href="https://github.com/google-github-actions/auth" rel="noopener noreferrer"&gt;google-github-actions/auth&lt;/a&gt; and &lt;a href="https://github.com/google-github-actions/get-gke-credentials" rel="noopener noreferrer"&gt;google-github-actions/get-gke-credentials&lt;/a&gt;. Checking the credentials is the last step to verify whether we have enough permissions to create a &lt;a href="https://kubernetes.io/docs/reference/access-authn-authz/authorization/#checking-api-access" rel="noopener noreferrer"&gt;deployment&lt;/a&gt; , which is necessary for our integration tests.&lt;/p&gt;

&lt;p&gt;After this, you just need to install Helm and go into your GitHub action container. In order to run the integration test, you can execute the go test with the integration build tag (described above). We use a &lt;a href="https://github.com/camunda/camunda-platform-helm/blob/main/Makefile" rel="noopener noreferrer"&gt;Makefile&lt;/a&gt; for that . Take a look at the &lt;a href="https://github.com/camunda/camunda-platform-helm/blob/main/.github/workflows/go-it.yaml" rel="noopener noreferrer"&gt;full GitHub action.&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Last Words
&lt;/h3&gt;

&lt;p&gt;We are now quite satisfied with the new approach and tests. Writing such tests has allowed us to detect several issues in our Helm charts, which is quite rewarding. It’s fun to write and execute them (the template tests are quite fast), and it always gives us good feedback.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qt2rkc9ul5ufgctqpnl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qt2rkc9ul5ufgctqpnl.png" alt="Running the template tests in GoLand" width="800" height="174"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;_Running the template tests in GoLand_&lt;/sup&gt;&lt;/center&gt;

&lt;p&gt;Side note: what I really like about Terratest, is not only the functionality and how easy it is to write the test, but also its verbosity. On each run, the complete template is printed, which is quite helpful. In addition, on an error, it’s clear where the error/issue is.&lt;/p&gt;

&lt;p&gt;I hope to help you with this knowledge and the examples above. Feel free to &lt;a href="https://github.com/Zelldon" rel="noopener noreferrer"&gt;contact me&lt;/a&gt; or &lt;a href="https://twitter.com/ChristopherZell" rel="noopener noreferrer"&gt;tweet me&lt;/a&gt; if you have any thoughts to share or better ideas on how to test Helm charts. :)&lt;/p&gt;

&lt;center&gt;&lt;sup&gt;_Thanks to_ [_Ahmed AbouZaid_](mailto:ahmed.abouzaid@camunda.com)_,_ [_Jonathan Ballet_](mailto:jonathan.ballet@camunda.com) _and_ [_Brittany des Vignes_](mailto:brittany.des-vignes@camunda.com) _for reviewing this post._&lt;/sup&gt;&lt;/center&gt;

</description>
      <category>testing</category>
      <category>camunda</category>
      <category>githubactions</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
