Forem: Christopher Kujawa

Why batching matters: Real-world example of performance

Christopher Kujawa — Thu, 16 Jan 2025 21:20:38 +0000

Why batching matters: Real-world example of performance. The interaction of latency and throughput.

Have you ever had a case in your real life where you had to collect a lot of items/objects and move them from A to B? I’m pretty sure you had.

May it be something like putting dishes into the dishwasher, clothes for the laundry, doing a relocation, or even rearranging some firewood, as I did.

On every task, every day, we naturally improve our performance, for example by taking more objects at a time (batching) or taking a shorter route (reducing latency). Thus to reduce the total time it takes us to do the work. In this blog post, I want to bring you closer to the theory behind that.

In my work as a software engineer at Camunda, where I do a lot of benchmarking, we talk about performance frequently. Different terms like latency and throughput are used all day. Sometimes it is not clear or tangible for everyone what they mean, and how they interact with each other. Especially how important the right batching is.

As I’m interested in such topics personally and professionally, I would like to share a real-world situation I had and use it as an example to explain latency, throughput, and batching. My ambition with this blog post is that you have a better understanding of the interaction between latency and throughput and why batching matters, after reading this post.

A real-world example

A while ago, I had a real-life challenge (or ambition?). I had the glorious idea to rearrange the firewood in my garden.

The wood was next to my shed (place A) and I wanted to move it next to the entrance of the property (place B).

When I started, I picked two or three logs at a time and walked from place A to place B. I quickly realized that this is actually quite inefficient, and it would take ages (it felt at least to me like it).

I realized that I had a wheelbarrow, so I started to fill it and walk with the wheelbarrow the way back and forth. I did this until I was done. I felt (and it was) way more performant. During the time of doing this exercise, I thought actually that is a great example of batching (and finding the right limits). The idea of this blog post was born.

Example in depth

Let’s unfold the scenario and explain the example in more detail.

We have the place A, where the old woodpile is located. We want to move all the logs to place B (the new place). The way from A to B takes us around 20 seconds.

For simplicity, we take this as constant. Imagine we are a robot which walks all the time very fast, with the same speed :). In reality, this is not true (especially if you work with software and networks).

Latency

There are some good definitions of latency out there, I don’t want to replace them, I just want to bring you closer to the topic.

When we talk about latency in our example then this means we take one log from place A and walk to place B and put it there. This will take us 20 seconds.

Latency: A -> B = 20 s

This means the latency for moving one log is 20 seconds.

Important to note for the latency is that low values are preferable and higher values are bad. We always want to decrease the latency.

Throughput

Throughput is how many logs we can move during a certain time unit. In our example one (or more) per 20 seconds. Normally throughput is measured per second, meaning in our case:

Throughput = Amount of objects / latency

For our example, this would mean: 1/20 log/s = 0.05 log/s. Our throughput is 0.05 log/s. In other words, we can move 0.05 log per second from A to B.

Different from latency, for the throughput we want to increase the values, here higher values are better. If the latency is lower, in reverse the throughput is going up (as you can see in the formula above).

Batching

Based on the formula above, we can see that if we change the amount of logs, that we move, we can increase the throughput.

This means when we start batching, we can increase the throughput.

This is what I naturally did in the described scenario: taking more than one log and collecting them in my arms (it describes our batch).

If we take three logs at a time this would mean 3/20 log/s = 0.15 log/s. With that, we tripled the throughput! But this is only true if the batching itself is free.

I claim that you have been in this situation already, at least once, when collecting dishes, clothes, or whatever. It takes some time to collect them, carry and hold more of them (adding more to the batch). We call it delay before you start with your actual task carrying/moving them over.

This delay is added to the actual latency of every item/log we collect. For simplicity let us say every log added to our batch takes 1 second. They are heavy, you have to pick them up, put them in your collection, etc.

This means latency is now a function:

latency(batch size) = 20s + batch size * 1s

If we come back to our example of three logs instead of one, this would mean our latency is now: 23 seconds. This means it takes 23 seconds for a log to move from A to B, as it needs to be put first into the batch, the batch needs to be filled until its limits (three logs) and then moved. This is the maximum latency in our case. The last log in the batch might have lower latency, but the maximum is 23s.

As our latency and batch size have changed, our throughput in consequence changed as well and is now: 3/23 log/s ~ 0.130 log/s

We can see a significant throughput increase of 260% (0.13 log/s vs before 0.05 log/s), while the latency increased by 15% (23s vs before the 20s).

As I mentioned above I used a wheelbarrow, so we could increase the batch size even further, maybe to 10–15 logs.

Batch size 10 logs

Latency: 20 s + 10 logs * 1 s = 30s -> 50% increase
Throughput: 10/30 = ⅓ ~ 0.333 log/s -> 666% increase

Batch size 15 logs

Latency: 20s + 15 logs * 1 s = 35s -> 75% increase
Throughput: 15 / 35 = 3/7 = 0.429 log/s -> 858% increase

Total Execution Time

Depending on your scenario/use case you might have to look at different metrics and tune them accordingly.

Sometimes latency of a singular object is more important than the throughput of multiple, sometimes it is the total execution time that is important.

In the software world, you often have endless data which you have to process and work on. In reality, this is different, the data or objects are limited. Like my woodpile (luckily).

To calculate the total execution time we need to move everything from A to B we can use the following formula:

total execution time = total amount / batch size * latency(batch size)

Taking one log

Let’s say we have 200 logs in the woodpile, which we want to move. If we would take one log at a time it would take us:

200 logs * 20 s = 4000s = 4000s / 60s = 66,666 min.

After around one hour we would be done with moving the woodpile.

Batch size three

If we had increased the batch size to three, then this would mean:

200 log/ 3 log * 23 s = 1533,33s = 1533,33s / 60s = 25,55 min.

We would be done after around 25 minutes when taking three logs at a time.

Batch size ten

With our wheelbarrow and taking ten logs:

200 log/ 10 log * 30 s = 600s =600s / 60s = 10 min

We would be done after 10 minutes.

This is why batching matters

We learn this naturally as kids. If we take more at once and walk less, we are faster.

It is important to note that the total execution time will behave differently when the latency, which is a function of batch size, grows significantly/non-linearly. It’s not uncommon for the latency to grow faster than the batch size, leading to the so-called latency/throughput tradeoff.

Conclusion

We have seen that latency and batching influence throughput and the total execution time. They interact or cohere with each other.

Interaction of latency and throughput

It is always a tradeoff. It always depends on the situation. It is an art of finding the right balance between batch sizes, throughput, and good latency.

In software systems acting on requests, we need to keep the balance between being responsive, having an acceptable latency, and a good throughput (reacting/processing multiple requests at the same time). This means it doesn’t make sense to batch all requests forever and send them at once. Luckily we can parallelize more in software systems, which is a way to compensate for high latency.

This is not easily doable in a real-world scenario, except if you have a big family or a lot of friends to ask for help. We are limited by physical laws (or to be specific to our example there is no more room in our wheelbarrow).

As we have seen batching can and will introduce delays and impact the latency of an individual object, like adding the log. In our example, it is not an issue, as the important metric is the total execution time and the latency is linear growing with the batch size.

There exist situations where latency grows non-linear with the batch size producing more trouble. In general, if we reduce the latency or make sure that the growth with the batch size is close to linearity we can improve the throughput.

I hope this gave you some insights into how latency and throughput interact with each other and why batching matters.

Thanks for reading so far. Let me know what you think and share your stories. Thank you. :)

Thanks to Lena Schoenburg and Carlo Sana for reviewing this post.

Zeebe Debug and Inspection tool

Christopher Kujawa — Wed, 23 Aug 2023 14:43:58 +0000

Have you ever had the case of an incident and didn’t know what this thing you’re running in production was actually doing or how it ended up in that state?

^{Photo from Jonathan Gallegos on Unsplash}

With Zeebe (the process automation engine powering Camunda Platform 8) we let our customer's business fly. But what if the thing which brings the business to fly, breaks? Similarly, if an airplane crashes you need something to read the flight recorder on board.

Today I want to introduce you to a tool we created for Zeebe in order to read this “flight recorder” (state) and support us in our incidents. Because in the past, if Zeebe ran into some processing problems there was no possibility to find out the last processing state. If there was no exporter configured or they haven’t exported for a while it was even worse, since it was not clear what the last internal engine state was.

In order to shed some more light in the dark we build a tool called zdb — Zeebe Debugger. Zdb is a Java (17) CLI tool to inspect the internal state and log of a Zeebe partition. It was kicked off during the Camunda Summer Hackdays in 2020 (by Nico Korthout, Deepthi Akkoorath, and Christopher Kujawa) and has been maintained and developed by me since then. Now reaching version 1.8.0, with new features (printing and filtering the log in a nicer way).

zdb allows us to find the root cause, create fixes, and be prepared for the next upcoming (since failures always happen eventually). We use it on many of our incidents if we need to take a look at the current state of Zeebe. But also when investing in certain bugs. With zdb, we finally know what Zeebe was doing and how it came into that state.

In the end, the goal is always to bring our customers back to fly and keep them there.

In the following blog post, I want to show you some examples of how we used zdb in the past to give you some inspiration on how it might help you.

Note: The output of zdb will always be JSON, which allows us to pipe it into jq, such that we can have nicer and filterable output. This is also used in our examples below.

General statistics

Often when you start working on an incident you need to get a first overview or understanding of what the state generally contains (depending on the problems of course). Here zdb can show you statistics of how many key-value pairs are stored in the internal state (in different column families).

$ zdb state --path <path-to-runtime-or-snapshot> | jq
{
  "DEFAULT": 1,
  "KEY": 1,
  "PROCESS_VERSION": 3,
  "PROCESS_CACHE": 3,
  "PROCESS_CACHE_BY_ID_AND_VERSION": 3,
  "PROCESS_CACHE_DIGEST_BY_ID": 3,
  "ELEMENT_INSTANCE_PARENT_CHILD": 6,
  "ELEMENT_INSTANCE_KEY": 6,
  "ELEMENT_INSTANCE_CHILD_PARENT": 6,
  "VARIABLES": 12,
  "TIMERS": 2,
  "TIMER_DUE_DATES": 2,
  "JOBS": 1,
  "JOB_STATES": 1,
  "JOB_DEADLINES": 1,
  "MESSAGE_START_EVENT_SUBSCRIPTION_BY_NAME_AND_KEY": 1,
  "MESSAGE_START_EVENT_SUBSCRIPTION_BY_KEY_AND_NAME": 1,
  "EVENT_SCOPE": 3,
  "EXPORTER": 2
}

An experienced Zeebe engineer or power user can see here already how many processes have been deployed, how many instances, jobs, timers, messages, etc. have been created and are currently in the state. This often helps to determine where to look next.

For example, if we see there are incidents in process instances in the state and the reported failure (ongoing incident) is about not progressing process instances we would check next the open incidents in the state.

Restoring BPMN models

There are cases where you might lose your models, or you just want to find out which model has been currently deployed or is actually executed. Here zdb can help.

First, you can print all deployed process model metadata (it will show information like process definition key, version, and name).

zdb process list --path <path-to-runtime-or-snapshot> | jq
[
  {
    "bpmnProcessId": "benchmark",
    "resourceName": "bpmn/one_task.bpmn",
    "processDefinitionKey": 2251799813685363,
    "version": 1
  },
  {
    "bpmnProcessId": "timerProcess",
    "resourceName": "bpmn/timerProcess.bpmn",
    "processDefinitionKey": 2251799813685249,
    "version": 1
  },
  {
    "bpmnProcessId": "msg_one_task",
    "resourceName": "bpmn/msg_one_task.bpmn",
    "processDefinitionKey": 2251799813685581,
    "version": 1
  }
]

With a specific process definition key, we can print the complete process entity. Piping it here to jq allows us to filter for the resource, and the --raw-output option returns us the resource string without quotes. We can then direct the output to a file and have the model restored (you can open it with for example the Camunda Modeler).

zdb process entity 2251799813686656 --path <path-to-runtime-or-snapshot> \
| jq --raw-output '.resource' > model.bpmn

^{Restored Model}

Instances for a specific model

Sometimes you’re interested in process instances of a specific process model.

You might have deployed a broken model and want to cancel all of the existing instances (that happened to us), but first, you need to find out all the keys of such instances.

You can use the following to print all instances for a certain process definition.

zdb process instances 2251799813685363 --path <path-to-runtime-or-snapshot> | jq

Printing the log

One of our most used zdb features is printing the entire log (default: as JSON).

zdb log print --path <path-to-log>

With the newest version (v1.8.0), zdb supports some built-in filters, like filtering for the process instance key. This means only records that correspond to a certain process instance are printed. Furthermore, we can limit the output now, with --fromPosition and --toPosition. You can read more about it here.

Not only JSON is the supported output format. zdb can print the log in dot formatas well, which allows tracing commands.

zdb log print --format dot --path <path-to-log> > output.dot

With Graphviz you can visualize such dot files easily

dot -Tsvg -o output.svg output.dot

^{Trace of log}

Printing and investigating the log is interesting since not all commands can be applied and are then not reflected in the state. The reasons can be many. Some might be rejected due to a wrong user input or wrong process instance state, etc. These commands and their rejections are still part of the log (if compaction hasn’t happened yet) and can give you some interesting insights.

I hope this small introduction and examples gave you some inspiration on how you can use zdb on your next potential incident or investigation related to Zeebe. If you want to know more, check out the GitHub repository.

Drinking Our Champagne: Chaos Experiments with Zeebe against Zeebe

Christopher Kujawa — Thu, 10 Aug 2023 19:27:58 +0000

^{Image by Camunda on Blog}

At Camunda we have a mantra: Automate Any Process, Anywhere. Additionally, we’ll often say “eat your own dog food,” or “drink your own champagne.”

Two years ago, I wrote an article about how we can use Zeebe to orchestrate our chaos experiments; I called it: BPMN meets chaos engineering. That was the result of a hack day project, in which I worked alongside my colleague Philipp Ossler.

Since then, a lot of things have changed. We made a lot of improvements to our tooling, like creating our own chaos toolkit zbchaos to make it easier to run chaos experiments against Zeebe (which reached v1.0), improving the BPMN models in use, adding more experiments to it, etc.

Today, I want to take a closer look and show you how we automate and orchestrate our chaos experiments with Zeebe against Zeebe. After reading this you will see how beneficial it is to use Zeebe as your chaos experiment orchestrator.

You can use this knowledge in order to orchestrate your own chaos experiments, set up your own QA test suite or use Zeebe as your CI/CD framework. The use cases are endless. We will show you how you leverage the observability of the Camunda Platform stack and how it can help you to understand what is currently executed or where issues may lie.

But first, let’s start with some basics.

Chaos engineering and experiments

Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability to withstand turbulent conditions in production.
https://principlesofchaos.org/

One of the principles of chaos engineering is automating defined experiments to ensure that no regression is introduced into the system at a later stage.

A chaos experiment consists of multiple stages; three are important for automation:

Verification of the steady state hypothesis
Running actions to introduce chaos
Verification of the steady state hypothesis (that it still holds or has recovered)

These steps can also be cast into a BPMN model, as shown below:

^{Chaos experiment in BPMN}

That is the backbone of our chaos experiment orchestration. Let’s take a closer look at the process models we designed and use now to automate and orchestrate our chaos experiments.

BPMN meets chaos engineering

If you are interested in the resources take a look at the corresponding GitHub repository zeebe-io/zeebe-chaos/.

Chaos toolkit

The first process model is called: “chaosToolkit” because it bundles all chaos experiments together. It reads the specifications of all existing chaos experiments (the specification for each experiment is stored in a JSON file, which we will see later) and executes them one by one via a sequential multi-instance.

For readers with knowledge of BPMN, be aware that in earlier versions of Zeebe it was not possible to transfer variables with BPMN errors, which is why we used return values of CallActivities and later interrupted the SubProcess.

^{BPMN Model: ChaosToolkit}

Chaos experiment

The second BPMN model describes a single chaos experiment, which is why it is called “chaosExperiment”. It has similarities (the different stages) to the simplified version above.

Here we see the three stages, verification, introducing chaos, and verification of the steady state again.

^{BPMN Model: Chaos Experiment}

All of the call activities above are delegated to the third BPMN model.

Action

The third model is the most generic one. It will execute any action, which is defined in the process instance payload. The payload will be a chaos experiment specification. The specification can also contain timeouts and pause times which are reflected in the model as well.

^{BPMN Model: Action}

Specification

As we have seen, the BPMN process models are quite generic and all of them are enlivened via a chaos experiment specification.

The chaos experiment specification is based onOpenChaos initiative and the Chaos Toolkit specification. We reused this specification to run these experiments as well with chaosToolkit (to run it locally).

An example is the following experiment.json

{
    "version": "0.1.0",
    "title": "Zeebe follower restart non-graceful experiment",
    "description": "Zeebe should be fault-tolerant. Zeebe should be able to handle followers terminations.",
    "contributions": {
        "reliability": "high",
        "availability": "high"
    },
    "steady-state-hypothesis": {
        "title": "Zeebe is alive",
        "probes": [
            {
                "name": "All pods should be ready",
                "type": "probe",
                "tolerance": 0,
                "provider": {
                    "type": "process",
                    "path": "zbchaos",
                    "arguments": ["verify", "readiness"],
                    "timeout": 900
                }
            },
            {
                "name": "Can deploy process model",
                "type": "probe",
                "tolerance": 0,
                "provider": {
                    "type": "process",
                    "path": "zbchaos",
                    "arguments": ["deploy", "process"],
                    "timeout": 900
                }
            },
            {
                "name": "Should be able to create process instances on partition 1",
                "type": "probe",
                "tolerance": 0,
                "provider": {
                    "type": "process",
                    "path": "zbchaos",
                    "arguments": ["verify", "instance-creation", "--partitionId", "1"],
                    "timeout": 900
                }
            }
        ]
    },
    "method": [
        {
            "type": "action",
            "name": "Terminate follower of partition 1",
            "provider": {
                "type": "process",
                "path": "zbchaos",
                "arguments": ["terminate", "broker", "--role", "FOLLOWER", "--partitionId", "1"]
            }
        }
    ],
    "rollbacks": []
}

The first key-value pairs describe the experiment itself. The steady-state-hypothesis and its content describe the verification stage. All of the probes inside the steady-state-hypothesis are executed as actions in our third process model.

The method object is describing the chaos which should be inserted into the system. In this case, it consists of one action, restarting a follower (a broker which is not leader of Zeebe partition).

I don’t want to go into much detail about the specification itself, but you can find several examples of our experiments we already have defined here

Automation

Let’s imagine we have a Zeebe cluster which we want to run the experiments against. We call it Zeebe target.

As mentioned earlier, the specification is based on the chaos toolkit. This means we can (if we have zbchaos and chaos toolkit installed) run it locally via **chaos run experiment.json**. If Zeebe is installed in Kubernetes and we have the right Kubernetes context set, this would work with zbchaos out of the box.

Zeebe Testbench

But we can also orchestrate that with Zeebe itself. Using a different Zeebe cluster, we call it Zeebe Testbench.

^{Chaos experiment orchestration}

Our Zeebe Testbench cluster is in charge of orchestrating the chaos experiments. zbchaos , is a job worker in this case and executes all actions. For example, verifying the healthiness of the cluster or of a node, terminating a node, creating a network partition, etc. We have seen in the chaos experiment specification above that all actions and probes are referencing zbchaos and specifying subcommands. These are executed no matter if zbchaos is used as a CLI tool directly or as a job worker. This means if you execute the chaos specification with the chaos toolkit it will execute the zbchaos CLI. If you orchestrate the experiments with Zeebe, the zbchaos workers will handle the specific actions.

From outside we are deploying the previously mentioned chaos models in Zeebe Testbench. This can happen on the setup of the Zeebe Testbench cluster (or when something changes on the models). New instances can be created either by us locally (e.g. via zbctl, or any other client), via a Timer, or by our GitHub actions.

With our GitHub actions, it is fairly easy to trigger a new Testbench run, which includes all chaos experiments, and some other tests.

^{Zeebe Testebench run}

To make this even greater, we even have automation to create the Zeebe Target cluster automatically. That can happen before each chaosToolkit execution. This allows us to always start with a clean state. Otherwise, errors might be hard to reproduce (and not to waste resources if no experiment is running).

Run chaos experiments regularly

We run our chaos experiments regularly. This means we create a chaosToolkit process instance every day and execute all chaos experiments against a new Zeebe target cluster. The creation of such process instances happens with earlier mentioned Github actions. This allows us to integrate this more in our CI which we also use in releases, meaning that we can run such tests before every release.

You can find the related GitHub action here:

If an experiment fails or all succeed we are notified in Slack with the help of a Slack Connector.

This happens outside of the chaosToolkit process, which is essentially wrapped again around other larger process models to automate other parts. As I mentioned before, creating clusters, notifications, deleting clusters, etc.

Benefits

Observability

With Operate, you can observe a current running chaos experiment, what cluster it targets, what experiment and action it is currently executing, etc.

^{Operate: Running QA (ChaosToolkit process)}

In the screenshot above, we can see a currently running chaosToolkit process instance. We can observe how many experiments have been executed (on the left in the “ Instance History ” green highlighted) and how many we still need to process (based on Variables).

Furthermore, we can see in the Variables tab (with the red border) what type of experiment we currently execute: “Zeebe should be fault-tolerant. We expect that Zeebe can handle non-graceful leader restarts”, and there is, even more, to dive into.

If we dig deeper into the current running experiment (we can do that via following the call-activity link) we can see that we are in the verification stage.

^{Operate: Running chaos experiment}

In the verification after the chaos has been introduced (highlighted in green). We can investigate which chaos action has been executed, like here (highlighted in red): “Terminate leader of partition two non-gracefully”.

When following the call activity again we see which verification is currently executed.

^{Operate: Running action}

We are verifying that all pods are ready again after the leader of partition two has been terminated. This information can be extracted from the variables (highlighted in red).

As Operate keeps the history of a process, we can also take a look at past experiments. You can check and verify which actions or chaos has been introduced.

^{Operate: Past chaos experiment}

You can see a large history of executed chaos experiments, actions, and several other details.

^{Operate: Past action runs}

This high degree of observability is important if something fails. Here you will see directly at which stage your experiment failed, what was executed before, etc. The incident message (depending on the worker) can also include a helpful note about why a stage failed.

Drink your own champagne

This setup might sound a bit complex at first, but once you understand the generic approach it actually isn’t and in contrast to scripting it, the BPMN automation greatly benefits observability. Furthermore, with this approach, we are still able to execute our experiments locally (which helps with development and debugging) and are able to automate them via our Zeebe Testbench cluster. It is fairly easy to use and execute new QA runs on demand. We drink our own champagne which helps us to improve our overall system, and that is actually the biggest benefit of this setup.

It just feels good to use our own product to automate our own processes. We can sit in the driver’s seat of the car we build and ship, feel what our users feel, and can improve based on that. It allows us to find bugs/issues earlier on, improve metrics and other observability measures, and build up confidence that our system can handle certain failure scenarios and situations.

I hope this was helpful to you and enlightened you a bit about what you can do with Zeebe. As I mentioned in the start the use cases and possibilities to use Zeebe are endless, and the whole Camunda Platform stack supports that pretty well.

—

Thanks to Christina Ausley, Deepthi Akkoorath and Sebastian Bathke for reviewing this blog post.

Looks like bitnami/elasticsearch-curator is gone

Christopher Kujawa — Wed, 28 Jun 2023 08:16:07 +0000

Maybe you have realized, since last week (mid of June 2023) the old bitnami/elasticsearch-curatorhas been removed from DockerHub, which causes several issues with our elasticsearch installation.

Since I haven’t seen any announcement somewhere (which was kind of a surprise) I just want to shortly summarize what we did to overcome this, maybe it helps others as well.

We had quite a hard time with our benchmarks and clusters, since elasticsearch was filling up and caused several issues. Normally we track the throughput and latency of several clusters in one dashboard, which doesn’t look healthy end of last week.

We realized quickly that curator cronjobs were no longer running and crash loop because the images were no longer available.

You can also reproduce this via:

$ docker pull bitnami/elasticsearch-curator:5.8.4
Error response from daemon: pull access denied for bitnami/elasticsearch-curator, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

Turned out that the image has been renamed to https://hub.docker.com/r/bitnami/elasticsearch-curator-archived

$ docker pull bitnami/elasticsearch-curator-archived:5.8.4
5.8.4: Pulling from bitnami/elasticsearch-curator-archived
Digest: sha256:46c98206dfaef81705d9397bd3d962d1505c8cfe9437f86ea0258d5cbef89e7f
Status: Downloaded newer image for bitnami/elasticsearch-curator-archived:5.8.4
docker.io/bitnami/elasticsearch-curator-archived:5.8.4

If you use helm charts (like we do) and have a cronjob defined in the helm charts it will not help to upgrade the charts. The reason is that jobs, cronjobs, etc. are immutable.

You have to delete the job/cronjob and do an helm upgrade with --reuse-values and set the right curator image.

If you use our charts (camunda-platform-helm) you have to set --set camunda-platform.retentionPolicy.image.repository=bitnami/elasticsearch-curator-archived to use the new curator image.

helm upgrade \
 "$releaseName" <YOUR-CHART> \
 --reuse-values \
 --set camunda-platform.retentionPolicy.image.repository=bitnami/elasticsearch-curator-archived

As alternative you can recreate the installed helm releases. To do so you can use the following script:

# we had several clusters to fix so this was part of a loop
ns="YOUR NAMESPACE" 
# release name is in our case the namespace name
releaseName="$ns"
# Get the values for the installed chart release (to reuse them)
values=$(helm get values “$releaseName” --namespace “$ns” -o yaml); 

# You can store the values into a separate file to be on the safe side
echo "$values" > "$ns-values.yaml"
# ...
# You could either set the curator image now here in the values
# or set it directly on installation
# ...
# Uninstall the chart
helm uninstall "$releaseName" --namespace "$ns"
# Install the chart inject values via stdin
echo "$values" | helm install "$releaseName" <YOUR-CHART> --namespace "$ns" --values -

Another thing we run into was that elasticsearch was in on some cases soo full that it didn’t even recover and was not able to free up space (with a curator). Later we found out that you need to increase the disk a bit, such that elasticsearch can recover and curator can free up space.

If you ask why this change happened (renaming of the docker image), I haven’t found any resources for it. The current assumption is that the curator is deprecated with elasticsearch 8 and likely will not work anymore with 8.x.

But if you still use a lower version, you might still want or need to use the curator so I hope this will help someone.

Zeebe, or How I learned To Stop Worrying And Love Batching

Christopher Kujawa — Sat, 04 Mar 2023 09:44:55 +0000

Zeebe, or How I learned To Stop Worrying And Love Batch Processing

Hi, I’m Chris, Senior Software Engineer at Camunda. I have worked now for around seven years at Camunda and on the Zeebe project for almost six years, and was recently part of a hackday effort to improve Zeebe’s process execution latency

In the past, we have heard several reports from users where they have described that the process execution latency of Zeebe, our cloud-native workflow decision engine for Camunda Platform 8, is sometimes sub-optimal. Some of the reports raised that the latency between certain tasks in a process model is too high, others that the general process instance execution latency is too high. This of course can also be highly affected by the used hardware and wrong configurations for certain use cases, but we also know we have something to improve.

At the beginning of this year and after almost three years of COVID-19, we finally sat together in a meeting room with whiteboards to improve the situation for our users. We called that performance hackdays. It was a nice, interesting, and fruitful experience.

Basics

To dive deeper into what we tried and why, we first need to elaborate on what process instance execution latency means, and what influences it.

The image above is a process model, from which we can create an instance. The execution of such an instance will go from the start to the end event; this is the process execution latency.

Since Zeebe is a complex distributed system, where the process engine is based on a distributed streaming platform, there are several influencing factors for the process execution latency. During our performance hackdays, we tried to sum up all potential factors and find several bottlenecks which we can improve. In the following post, I will try to summarize this on a high level and mention them shortly.

Stream processing

To execute such a process model, as we have seen above, Zeebe uses a concept called stream processing.

Each element in the process has a specific lifecycle, which is divided into the following:

^{BPMN Elements Lifecycle divided into Command/Events}

One command asks to change the state of a certain element and an event that confirms the state change. Termination can happen when elements are canceled either internally by events or outside by users.

Commands drive the execution of a process instance. When Zeebe’s stream processor processes a command, state changes are applied (e.g. process instances are modified). Such modifications are confirmed via follow-up events. To split the execution into smaller pieces, not only are follow-up events produced, but also follow-up commands. All of these follow-up records are persisted. Later, the follow-up commands are further processed by the stream processor to continue the instance execution. The idea behind that is that these small chunks of processing should help to achieve high concurrency by alternating execution of different instances on the same partition.

Persistence

Before a new command on a partition can be processed, it must be replicated to a quorum (typically majority) of nodes. This procedure is called commit. Committing ensures a record is durable, even in case of complete data loss on an individual broker. The exact semantics of committing are defined by the raft protocol.

^{Source: https://docs.camunda.io/docs/components/zeebe/technical-concepts/clustering/#commit}

Committing of such records can be affected by network latency, for sending the records over the wire. But also by disk latency since we need to persist the records on disk on a quorum of nodes before we can mark the records as committed.

State

Zeebe’s state is stored in RocksDB, which is a key-value store. RocksDB persists data on disk with a log-structured merge tree (LSM Tree) and is made for fast storage environments.

The state contains information about deployed process models and current process instance executions. It is separated per partition, which means a RocksDB instance exists per partition.

Performance hackdays

When we started with the performance hackdays, we already had necessary infrastructure to run benchmarks for our improvements. We made heavy use of the Camunda Platform 8 benchmark toolkit maintained by Falko Menge.

Furthermore, we run weekly benchmarks (the so-called medic benchmark) where we test for throughput, latency, and general stability. Benchmarks are run for four weeks to detect potential bugs, regressions, memory leaks, performance regressions, and more as early as possible. This, all the infrastructure around it (like Grafana dashboards,) and knowledge about how our system performs were invaluable to make such great progress during our hackdays.

Measurement

We measured our results continuously, and this is necessary to see if you are on the right track. For every small proof of concept (POC), we ran a new benchmark:

^{Screenshot of benchmarks over the week}

In our benchmark, we used a process based on some user requirements:

^{Benchmark Process}

Our target was a throughput of around 500 process instances per second (PI/s) with a process execution latency goal for one process instance under one second for the 99th percentile (p99). P99, meaning 99% of all process instance executions should be executed in under one second.

The benchmarks have been executed in the Google Kubernetes Engine. For each broker node, we assigned one n2-standard-8 node to reduce the influence of other pods running on the same node.

Each broker pod had the following configuration:

^{Benchmark configuration}

There were also some other configurations we played around with during our different experiments, but the above were the general ones. We had eight brokers running, which gives us the following partition distribution:

$ ./partitionDistribution.sh 8 24 4
Distribution:
P\N| N 0| N 1| N 2| N 3| N 4| N 5| N 6| N 7
P 0| L | F | F | F | - | - | - | -  
P 1| - | L | F | F | F | - | - | -  
P 2| - | - | L | F | F | F | - | -  
P 3| - | - | - | L | F | F | F | -  
P 4| - | - | - | - | L | F | F | F  
P 5| F | - | - | - | - | L | F | F  
P 6| F | F | - | - | - | - | L | F  
P 7| F | F | F | - | - | - | - | L  
P 8| L | F | F | F | - | - | - | -  
P 9| - | L | F | F | F | - | - | -  
P 10| - | - | L | F | F | F | - | -  
P 11| - | - | - | L | F | F | F | -  
P 12| - | - | - | - | L | F | F | F  
P 13| F | - | - | - | - | L | F | F  
P 14| F | F | - | - | - | - | L | F  
P 15| F | F | F | - | - | - | - | L  
P 16| L | F | F | F | - | - | - | -  
P 17| - | L | F | F | F | - | - | -  
P 18| - | - | L | F | F | F | - | -  
P 19| - | - | - | L | F | F | F | -  
P 20| - | - | - | - | L | F | F | F  
P 21| F | - | - | - | - | L | F | F  
P 22| F | F | - | - | - | - | L | F  
P 23| F | F | F | - | - | - | - | L

Each broker node had 12 partitions assigned. We used a replication factor of four because we wanted to mimic the geo redundancy for some of our users, which had certain process execution latency requirements. The geo redundancy introduces network latency into the system by default. We wanted to reduce the influence of such network latency to the process execution latency. To make it a bit more realistic, we used Chaos Mesh to introduce a network latency of 35ms between two brokers, resulting in a round-trip time (RTT) of 70ms.

To run with an evenly distributed partition leadership, we used the partitioning rebalancing API, which Zeebe provides.

Theory

Based on the benchmark process model above, we considered the impact of commands and events on the process model (and also in general).

^{Whiteboard session: Drawing commands/events}

We calculated around 30 commands are necessary to execute the process instance from start to end.

We tried to summarize what affects the processing latency and came to the following formula:

PEL = X * Commit Latency + Y * Processing Latency + OH
PEL - Process Execution Latency
OH - Overhead, which we haven't considered (e.g. Jobs * Job Completion Latency)

When we started, X and Y were equal, but the idea was to change factors. This is why we split them up. The other latencies were based on:

Commit Latency = Network Latency + Append Latency
Network Latency = 2 * request duration
Append Latency = Write to Disk + Flush
Processing Latency = Processing Command (apply state changes) 
                   + Commit Transaction (RocksDB) 
                   + execute side effects

Below is a picture of our whiteboard session, where we discussed potential influences and what potential solution could mitigate which factor:

^{Whiteboard session: Discussion potential factors and influences}

Proof of concepts

Based on the formula, it was a bit more clear to us what might affect the process execution latency and where it might make sense to change or reduce time. For example, reducing the append latency affects commit latency and will affect process execution latency. Additionally, reducing the factor of how often commit latency is applied will highly affect the result.

Append and commit latency

Before we started with the performance hackdays, there was one configuration already present which we built more than two years agoand made available via an experimental feature: the disabling of the raft flush. We have seen several users applying it to reach certain performance targets, but it comes with a cost. It is not safe to use it, since on fail-over certain guarantees of raft no longer apply.

As part of the hackdays we were interested in a similar performance, but with more safety. This is the reason why we tried several different other possibilities but also compared that with disabling the flush completely.

Flush improvement

In one of our POC’s, we tried to flush on another thread. This gave a similar performance as with completely disabling it, but it also has similar safety issues. Combining the async flush with awaiting the completion before committing brought back the old performance (base) and the safety. This was no solution.

Implementing a batch flush (flush only after a configured threshold,) having this in a separate thread, and waiting for the completion degraded the performance. However, we again had better safety than with disabling flush.

We thought about flushing async in a batch, without waiting for commit and making this configurable. This would allow users to trade safety versus performance.

Write improvement

We had a deeper look into system calls such as madvise.

Zeebe stores its log in a segmented journal which is memory mapped at runtime. The OS manages what is in memory at any time via the page cache, but does not know the application itself. The madvise system call allows us to provide hints to the OS on when to read/write/evict pages.

The idea was to provide hints to reduce memory churn/page faults and reduce I/O

We tested with MADV_SEQUENTIAL , hinting that we will access the file sequentially and a more aggressive read-ahead should be performed (while previous pages can be dropped sooner).

Based on our benchmarks, we hadn’t seen much difference under low/mid load. However, read IO was greatly reduced under high load. We have seen slightly increased write I/O throughput under high load due to reduced IOPS contention. In general, there was a small improvement only in throughput/latency. Surprisingly, still it showed similar page faults as before.

Reduce transaction commits

Based on our formula above, we can see that the processing latency is affected by the RocksDB write and transaction commit duration. This means reducing one of these could benefit the processing latency.

State directory separation

Zeebe stores the current state (runtime) and snapshots on different folders on disk (under the same parent). When a Zeebe broker restarts, we recreate the state (runtime) every time from a snapshot. This is to avoid having data in the state which might not have been committed yet.

This means we don’t necessarily need to keep the state (runtime) on disk, and RocksDB does a lot of IO-heavy work which might not be necessary. The idea was to separate the state directory in a way that it can be separately mounted (in Kubernetes) such that we can run RocksDB in tmpfs, for example.

Based on our benchmarks, only p30 and lower have been improved with this POC:

Disable WAL

RocksDB has a write-ahead log to be crash resistant. This is not necessary for us to recreate the state every time. We considered disabling it, we will see later in this post what influence it has. It is a single configuration, which is easy to change.

Processing of uncommitted

We mentioned earlier that we have thought about changing the factor of how many commits influence the overall calculation. What if we process commands already, even if they are not committed yet, and only send results to the user if the commit of the commands is done?

We worked on a POC to implement uncommitted processing, but it was a bit more complex than we thought due to the buffering of requests, etc. This is why we didn’t find a good solution during our hackdays. We still ran a benchmark to verify how it would behave:

The results were quite interesting and promising, but we considered them a bit too good. The production ready implementation might be different, since we have to consider more edge-cases.

Batch processing

Part of another POC we did was something we called batch processing. The implementation was rather easy.

The idea was to process the follow-up commands directly and continue the execution of an instance until no more follow-up commands are produced. This normally means we have reached a wait state, like a service task. Camunda Platform 7 users will know this behavior, as this is the Camunda Platform 7 default. The result was promising as well:

In our example process model above, this would reduce the factor of commit latencies from ~30 commands to 15, which is significant. The best IO you can do, however, is no IO.

Combining the POCs

By combining several POCs, we reached our target line which showed us that it is possible and gave us some good insights on where to invest in order to improve our system further in the future.

The improvements did not just improve overall latency of the system. In our weekly benchmarks we had to increase the load because the system was able to reach higher throughput. Before we reached ~133 (on avg) process instances per second (PI/s) over three partitions, now 163 PI/s (on avg) while also reducing the latency by a factor of 2.

In the last weeks, we took several ideas from the hackdays to implement some production-ready solutions for Zeebe 8.2. For example:

We plan to work on some more like:

You can expect some better performance with the 8.2 release; I’m really looking forward to April! :)

Thanks to all participants of the hackdays for the great and fun collaboration, and to our manager (Sebastian Bathke) who made this possible. It was a really nice experience.

Participants (alphabetically sorted):

Thanks to all the reviewers of this blog post: Christina Ausley, Deepthi Devaki Akkoorath, Nicolas Pepin-Perreault, Ole Schönburg, Philipp Ossler and Sebastian Bathke

Zbchaos — A new fault injection tool for Zeebe

Christopher Kujawa — Thu, 15 Sep 2022 12:16:14 +0000

^{Photo by Brett Jordan on Unsplash}

During Summer Hackdays 2022, I worked on a project called “Zeebe chaos” (zbchaos), a fault injection CLI tool. This allows us engineers to more easily run chaos experiments against Zeebe, build up confidence in the system’s capabilities, and discover potential weaknesses.

Requirements

To understand this blog post, it is useful to have a certain understanding of Kubernetes and Zeebe itself.

Summer Hackdays:

Hackdays are a regular event at Camunda, where people from different departments (engineering, consulting, DevRel, etc.) work together on new ideas, pet projects, and more.

Often, the results are quite impressive and are also presented in the following CamundaCon. For example, check out the agenda of this year’s CamundaCon 2022.

Check out previous Summer Hackdays here:

Zeebe chaos CLI

Working on the Zeebe project is not only about engineering a distributed system or a process engine, it is also about testing, benchmarking, and experimenting with our capabilities.

We run regular chaos experiments against Zeebe to build up confidence in our system and to determine whether we have weaknesses in certain areas. In the past, we have written many bash scripts to inject faults (chaos). We wanted to replace them with better tooling: a new CLI. This allows us to make it more maintainable, but also lowers the barrier for others to experiment with the system.

The CLI targets Kubernetes, as this is our recommended environment for Camunda Platform 8 Self-Managed, and the environment our own SaaS offering runs on.

The tool builds upon our existing Helm charts, which are normally used to deploy Zeebe within Kubernetes.

Requirements

To use the CLI you need to have access to a Kubernetes cluster, and have our Camunda Platform 8 Helm charts deployed. Additionally, feel free to try out Camunda Platform 8 Self-Managed.

Chaos Engineering:

You might be wondering why we need this fault injection CLI tool or what this “chaos” stands for. It comes from chaos engineering, a practice we introduced back in 2019 to the Zeebe Project.

Chaos Engineering was defined by the Principles of Chaos. It should help to build confidence in the system's capabilities and find potential weaknesses through regular chaos experiments. We define and execute such experiments regularly.

Take a look at my talk at CamundaCon 2020.2 to get to know more about Chaos Engineering at Camunda (and Zeebe).

Chaos experiments

As mentioned, we regularly write and run new chaos experiments to build up confidence in our system and undercover weaknesses. The first thing you have to do for your chaos experiment is to define a hypothesis that you want to prove. For example, processing should still be possible after a node goes down. Based on the hypothesis, you know what kind of property or steady state you want to verify before and after injecting faults into the system.

A chaos experiment consists of three phases:

Verify the steady state.
Inject chaos.
Verify the steady state.

For each of these phases, the zbchaos CLI provides certain features outlined below.

Verify steady state

In the steady state phase, we want to verify certain properties of the system, like invariants, etc.

One of the first things we typically want to check is the Zeebe topology. With zbchaos you can run:

$ zbchaos topology
0 |LEADER (HEALTHY) |FOLLOWER (HEALTHY) |LEADER (HEALTHY)
1 |FOLLOWER (HEALTHY) |LEADER (HEALTHY) |FOLLOWER (HEALTHY)
2 |FOLLOWER (HEALTHY) |FOLLOWER (HEALTHY) |FOLLOWER (HEALTHY)

Zbchaos will do all the necessary magic for you. Finding a Zeebe gateway, do a port-forward, request the topology, and print it in a compact format. This makes the chaos engineers’ life much easier.

Another basic check is verifying the readiness of all deployed Zeebe components. To achieve this, we can use:

$ zbchaos verify readiness
All Zeebe nodes are running.

This verifies the Zeebe Broker Pod status and the status of the Zeebe Gateway deployment status. If one of these is not ready yet, it will loop and not return before they are ready. This is beneficial in automation scripts.

After you have verified the general health and readiness of the system, you also need to verify whether the system is working functionally. This is also called “verifying the steady state.” This can be achieved by:

$ zbchaos verify steady-state — partitionId 2

This command checks that a process model can be deployed and a process instance can be started for the specified partition. As you cannot influence the partition for new process instances, process instances are started in a loop until that partition is hit. If you don’t specify the partitionId, partition one is used.

Inject chaos

After we verify our steady state we want to inject faults or chaos into our system, and afterward check again our steady state. The zbchaos CLI already provides several possibilities to inject faults outlined below.

Before we step through how we can inject failures, we need to understand what kind of components a Zeebe cluster consists of and what the architecture looks like.

We have two types of nodes: the broker, and the gateway.

A broker is a node that does the processing work. It can participate in one or more Zeebe partitions (internally each partition is a raft group, which can consist of one or more nodes). A broker can have different roles for each partition (Leader, Follower, etc.)

For more details about the replication, check our documentation and the raft documentation.

The Zeebe gateway is the contact point to the Zeebe cluster to which clients connect. Clients send commands to the gateway and the gateway is in charge of distributing the commands to the partition leaders. This depends on the command type of course. For more details, check out the documentation.

By default, the Zeebe gateways are replicated as if Camunda Platform 8 Self-Managed was installed via our Helm charts, which makes it interesting to also experiment with the gateways.

Shutdown nodes

With zbchaos we can shutdown brokers (gracefully and non-gracefully) which have a specific role and take part in a specific partition. This is quite useful in experimenting since we often want to terminate or restart brokers based on the participation and role (e.g. terminate the Leader of partition X or restart all followers of partition Y.)

Graceful

A graceful restart can be initiated like this:

$ zbchaos restart -h
Restarts a Zeebe broker with a certain role and given partition.

    Usage:
    zbchaos restart [flags]

    Flags:
      -h, --help help for restart
      --partitionId int Specify the id of the partition (default 1)
      --role string Specify the partition role [LEADER, FOLLOWER, INACTIVE] (default “LEADER”)

    Global Flags:
    -v, — verbose verbose output

This sends a Kubernetes delete command to the pod, which takes part of the specific partition and has the specific role. This is based on the current Zeebe topology, provided by the Zeebe gateway. All of this is handled by the zbchaos toolkit. The chaos engineer doesn’t need to find this information manually.

Non-graceful

Similar to the graceful restart is the termination of the broker. It will send a delete to the specific Kubernetes Pod, and will set the **–gracePeriod **to zero.

$ zbchaos terminate -h
Terminates a Zeebe broker with a certain role and given partition.

    Usage:
      zbchaos terminate [flags]
      zbchaos terminate [command]

    Available Commands:
      gateway Terminates a Zeebe gateway

    Flags:
      -h, --help help for terminate
      --nodeId int Specify the nodeId of the Broker (default -1)
      --partitionId int Specify the id of the partition (default 1)
      --role string Specify the partition role [LEADER, FOLLOWER] (default “LEADER”)

    Global Flags:
    -v, --verbose verbose output

    Use “zbchaos terminate [command] --help” for more information about a command.

Gateway

Both commands above target the Zeebe brokers. Sometimes, it is also interesting to target the Zeebe gateway. For that, we can just append the gateway subcommand to the restart or terminate command.

Disconnect brokers

It is not only interesting to experiment with graceful and non-graceful restarts, but it is also interesting to experiment with network issues. This kind of fault undercovers other interesting weaknesses (bugs).

With the zbchaos CLI, it is possible to disconnect different brokers. We can specify at which partition they participate and what kind of role they have. These network partitions can also be set up in one direction if the –one-direction flag is used.

$ zbchaos disconnect -h
Disconnect Zeebe nodes, uses sub-commands to disconnect leaders, followers, etc.

    Usage:
     zbchaos disconnect [command]

    Available Commands:
     brokers Disconnect Zeebe Brokers

    Flags:
     -h, — help help for disconnect

    Global Flags:
     -v, — verbose verbose output

    Use “zbchaos disconnect [command] — help” for more information about a command.
    [zell ~/ cluster: zeebe-cluster ns:zell-chaos]$ zbchaos disconnect brokers -h
    Disconnect Zeebe Brokers with a given partition and role.

    Usage:
     zbchaos disconnect brokers [flags]

    Flags:
     — broker1NodeId int Specify the nodeId of the first Broker (default -1)
     — broker1PartitionId int Specify the partition id of the first Broker (default 1)
     — broker1Role string Specify the partition role [LEADER, FOLLOWER] of the first Broker (default “LEADER”)
     — broker2NodeId int Specify the nodeId of the second Broker (default -1)
     — broker2PartitionId int Specify the partition id of the second Broker (default 2)
     — broker2Role string Specify the partition role [LEADER, FOLLOWER] of the second Broker (default “LEADER”)
     -h, — help help for brokers
     — one-direction Specify whether the network partition should be setup only in one direction (asymmetric)

    Global Flags:
     -v, — verbose verbose output

The network partition will be established with ip route tables, which are installed on the specific broker pods.

Right now this is only supported for the brokers, but hopefully, we will add support for the gateways soon as well.

To connect the brokers again, the following can be used:

$ zbchaos connect brokers

This removes the ip routes on all pods again.

Other features

All the described commands support a verbose flag, which allows the user to determine what kind of action is done, how it connects to the cluster, and more.

For all of the commands, a bash-completion can be generated via zbchaos completion, which is very handy.

Outcome and future

In general, I was quite happy with the outcome of Summer Hackdays 2022, and it was a lot of fun to build and use this tool already. I was able to finally spend some more time writing go code and especially a go CLI. I learned to use the Kubernetes go-client and how to write go tests with fakes for the Kubernetes API, which was quite interesting. You can take a look at the tests here.

We plan to extend the CLI in the future and use it in our upcoming experiments.

For example, I recently did a new chaos day, a day I use to run new experiments, and wrote a post about it. In this article, I extended the CLI, with features like sending messages to certain partitions.

At some point, we want to use the functionality within our automated chaos experiments as Zeebe workers and replace our old bash scripts.

Thanks to Christina Ausley and Bernd Ruecker for reviewing this post :)

Advanced Test Practices For Helm Charts

Christopher Kujawa — Mon, 28 Mar 2022 12:15:13 +0000

^{Photo by Joseph Barrientos on Unsplash}

I’m excited to share below the detailed learnings and experiences I had along my journey of finding a good way to write automated tests for Helm charts. At the end of this blog post, I’ll present to you the current solution we’re using, which are meeting all our requirements.

I’m a distributed systems engineer working on the Camunda Zeebe project, that’s part of Camunda Cloud. I’m highly interested in SRE topics, so I started maintaining the Helm charts for Camunda Cloud.

Please, be aware that these are my personal experiences and might be a bit subjective, but I try to be as objective as possible.

How it Began

We started with the community-maintained Helm charts for Zeebe and Camunda Cloud-related tools, like Tasklist and Operate. This project had a lack of support and stability issues.

In the past, we often had issues with the charts being broken, sometimes because we added a new feature or property. Or because the property was never used before, and was hidden by a condition. We wanted to avoid that and give the users a better experience.

In early 2022, we at Camunda wanted to create some new Helm charts, based on the old ones we had. The new Helm charts needed to be officially supported by Camunda. In order to do that with a clear conscience, we wanted to add some automated tests to the charts.

Prerequisites

In order to understand this blog post you should have some knowledge of the following topics:

Helm Testing. What is the Issue?

Testing in the Helm world is, I would say, not as well evolved as it should be. Some tools exist, but they lack usability, or they needed too much boilerplate code. Sometimes it’s not really clear how to use or write them.

Some posts around that topic already exist, but there aren’t many. For example:

This one really helped us, Automated Testing for Kubernetes and Helm Charts using Terratest.

It explains how to test Helm charts with Terratest, a framework to write tests for Helm charts, and other Kubernetes-related things.

We did a comparison of Terratest, writing golden file tests (here’s a blog post about that why you should use them), and using Chart Testing (CT). You can find the details in this GitHub issue.

This issue contains a comparison between the test tools, as well as some subjective field reports, which I wrote during the testing. It helped me to make some decisions.

What and How to Test

First of all, we separated our tests into two parts, with different targets and goals.

Template tests (unit tests) — Which verify the general structure.
Integration tests — Which verify whether we can install the charts and use them.

Template tests

With the template tests, we want to verify the general structure. This includes whether it’s yaml conform, does the default values not change, or are they set at all.

For template tests, we combine both golden files and Terratest. Generally speaking, golden files store the expected output of a certain command or response for a specific request. In our case, the golden files contain the rendered manifest, which are outputted after you run helm template. This allows you to verify that the default values are set and changed only in a controlled manner, this reduces the burden of writing many tests.

If we want to verify specific properties (or conditions), we can use the direct property tests with Terratest. We will come to that again later.

This allows us to use one tool (Terratest) and separate the tests per manifest, such as a test for Zeebe statefulset, the Zeebe gateway deployment, etc. The tests can be easily run via command line or IDE, and CI.

Integration tests

With the integration tests we want to test for two things:

Whether the charts can be deployed to Kubernetes, and are accepted by the K8s API.
Whether the services are running and can work with each other.

Other things, like broken templates, incorrectly set values, etc., are caught by the tests above.

So to turn it around, here are potential failure cases we can find with such tests:

Specifications that are in the wrong place (look like valid yaml), but aren’t accepted by the K8s API.
Services that aren’t becoming ready because of configuration errors, and they can’t reach each other.

The first case we could also solve with other tools, which validates manifests based on the K8s API, but not the second one.

In order to write the integration tests, we tried out the Chart Testing tool and Terratest. We chose Terratest over Chart Testing. If you want to know why, read the next section, otherwise, you can simply skip it.

Chart testing

While trying to write the tests using Chart Testing, we encountered several issues that made the tool difficult to use, and the tests difficult to maintain.

For example, the options to configure the testing process seem rather limited — see CT Install documentation for available options. In particular, during the Helm install phase, our tests deploy a lot of components (Elasticsearch, Zeebe) that take ages to become ready. However, Chart Testing times out by default after three minutes, and we didn’t find a way to adjust this type of setting. As such, we actually were never able to run a successful test using the ct CLI.

Another painful point was the way the tests are shipped, executed, and eventually how results are reported. The Chart Testing tool wraps, simply speaking, the Helm CLI, which means it’ll run the helm install and helm test command. To be executed using the helm test command, the tests have to be configured and deployed as part of the Helm chart. This means the tests have to be embedded inside a Docker image, which might not be super practical, and the Helm chart also needs to be modified to ship with the additional tests settings.

If the tests fail in the CI and you want to reproduce it, you would need the ct CLI locally, and run ct install to redeploy the whole Helm chart, and run the tests. When the tests fail, the complete logs of all the containers are printed, which can be a big amount of data to inspect. We found it was difficult to iterate on the tests, and quite cumbersome to debug them when they were failing.

All the reasons above pushed us to use Terratest (see next section) to write the tests. The benefit here is that we have one tool for both (unit and IT tests), and more control over it. It makes it easy to run and debug the tests. In general, the tests were also quite simple to write, and the failures were easy to understand.

For more information regarding this, please check the comments in the Github issue.

Helm Chart Tests In Practice

In the following section, I would like to present how we use Terratest, and what our new tests for our Helm charts look like.

Golden files test

We wrote a base test, which renders given Helm templates and compares them against golden files. The golden files can be generated via a separate flag. The golden files are tracked in git, which allows us to see changes easily via a git diff. This means if we change any defaults, we can directly see the resulting rendered manifests. These tests ensure that the Helm chart templates render correctly and the output of the templates changes in a controlled manner.

Golden Base

package golden

import (
    "flag"
    "io/ioutil"

    "regexp"

    "github.com/gruntwork-io/terratest/modules/helm"
    "github.com/gruntwork-io/terratest/modules/k8s"
    "github.com/stretchr/testify/suite"
)

var update = flag.Bool("update-golden", false, "update golden test output files")

type TemplateGoldenTest struct {
    suite.Suite
    ChartPath string
    Release string
    Namespace string
    GoldenFileName string
    Templates []string
    SetValues map[string]string
}

func (s *TemplateGoldenTest) TestContainerGoldenTestDefaults() {
    options := &helm.Options{
        KubectlOptions: k8s.NewKubectlOptions("", "", s.Namespace),
        SetValues: s.SetValues,
    }
    output := helm.RenderTemplate(s.T(), options, s.ChartPath, s.Release, s.Templates)
    regex := regexp.MustCompile(`\s+helm.sh/chart:\s+.*`)
    bytes := regex.ReplaceAll([]byte(output), []byte(""))
    output = string(bytes)

    goldenFile := "golden/" + s.GoldenFileName + ".golden.yaml"

    if *update {
        err := ioutil.WriteFile(goldenFile, bytes, 0644)
        s.Require().NoError(err, "Golden file was not writable")
    }

    expected, err := ioutil.ReadFile(goldenFile)

    // then
    s.Require().NoError(err, "Golden file doesn't exist or was not readable")
    s.Require().Equal(string(expected), output)
}

The base test allows us to easily add/write new golden file tests for each of our sub charts. For example, we have the following test for the Zeebe sub-chart:

package zeebe

import (
    "path/filepath"
    "strings"
    "testing"

    "camunda-cloud-helm/charts/ccsm-helm/test/golden"

    "github.com/gruntwork-io/terratest/modules/random"
    "github.com/stretchr/testify/require"
    "github.com/stretchr/testify/suite"
)

func TestGoldenDefaultsTemplate(t *testing.T) {
    t.Parallel()

    chartPath, err := filepath.Abs("../../")
    require.NoError(t, err)
    templateNames := []string{"service", "serviceaccount", "statefulset", "configmap"}

    for _, name := range templateNames {
        suite.Run(t, &golden.TemplateGoldenTest{
            ChartPath: chartPath,
            Release: "ccsm-helm-test",
            Namespace: "ccsm-helm-" + strings.ToLower(random.UniqueId()),
            GoldenFileName: name,
            Templates: []string{"charts/zeebe/templates/" + name + ".yaml"},
        })
    }
}

Here, we test the Zeebe resources: service , serviceaccount , statefulset , and confimap with default values against golden values. Here are the golden files.

Property test:

As described above, sometimes, we want to test specific properties, like conditions in our templates. Here it’s easier to write specific Terratest tests.

We do that for each manifest, like the statefulset , and then call it statefulset_test.go.

In such go test file, we have a base structure, which looks like this:

type statefulSetTest struct {
    suite.Suite
    chartPath string
    release string
    namespace string
    templates []string
}

func TestStatefulSetTemplate(t *testing.T) {
    t.Parallel()

    chartPath, err := filepath.Abs("../../")
    require.NoError(t, err)

    suite.Run(t, &statefulSetTest{
        chartPath: chartPath,
        release: "ccsm-helm-test",
        namespace: "ccsm-helm-" + strings.ToLower(random.UniqueId()),
        templates: []string{"charts/zeebe/templates/statefulset.yaml"},
    })
}

If we want to test a condition in our templates, which look like this:

spec:
      {{- if .Values.priorityClassName }}
      priorityClassName: {{ .Values.priorityClassName | quote }}
      {{- end }}

Then, we can easily add such tests to the statefulset_test.go file. That would look like this:

func (s *statefulSetTest) TestContainerSetPriorityClassName() {
    // given
    options := &helm.Options{
        SetValues: map[string]string{
            "zeebe.priorityClassName": "PRIO",
        },
        KubectlOptions: k8s.NewKubectlOptions("", "", s.namespace),
    }

    // when
    output := helm.RenderTemplate(s.T(), options, s.chartPath, s.release, s.templates)
    var statefulSet v1.StatefulSet
    helm.UnmarshalK8SYaml(s.T(), output, &statefulSet)

    // then
    s.Require().Equal("PRIO", statefulSet.Spec.Template.Spec.PriorityClassName)
}

In this test, we set the priorityClassName to a custom value like “PRIO”, render the template, and verify that the object (statefulset) contains that value.

Integration test

Terratest allows us to write not only template tests, but also real integration tests. This means we can access a Kubernetes cluster, create namespaces, install the Helm chart, and verify certain properties.

I’ll only present the basic setup here, since otherwise, it would go too far. If you’re interested in what our integration test looks like, check this out. Here we set up the namespaces, install the Helm charts, and test each service we deploy.

Basic Setup:

//go:build integration
// +build integration

package integration

import (
    "os"
    "path/filepath"
    "strings"
    "time"

    "context"
    "testing"

    "github.com/gruntwork-io/terratest/modules/helm"
    "github.com/gruntwork-io/terratest/modules/k8s"
    "github.com/gruntwork-io/terratest/modules/random"
    "github.com/stretchr/testify/require"
    "github.com/stretchr/testify/suite"
    v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type integrationTest struct {
    suite.Suite
    chartPath string
    release string
    namespace string
    kubeOptions *k8s.KubectlOptions
}

func TestIntegration(t *testing.T) {
    chartPath, err := filepath.Abs("../../")
    require.NoError(t, err)

    namespace := createNamespaceName()
    kubeOptions := k8s.NewKubectlOptions("gke_<project>_europe-west1-b_<project-name>", "", namespace)

    suite.Run(t, &integrationTest{
        chartPath: chartPath,
        release: "zeebe-cluster-helm-it",
        namespace: namespace,
        kubeOptions: kubeOptions,
    })
}

Similar to the properties test above, we have some base structure that allows us to write the integration tests. This is to set up the test environment. It allows us to specify the targeting Kubernetes cluster, via kubeOptions.

In order to separate the integration tests from the normal unit tests, we use go build tags. The first lines above, define the tag integration, which allows us to run the tests only via go test -tags integration ./…/integration.

We create the Kubernetes namespace name either randomly (using ahelper from Terratest ) or based on the git commit, if triggered as a GitHub action. We’ll get to that later.

func truncateString(str string, num int) string {
   shortenStr := str
   if len(str) > num {
      shortenStr = str[0:num]
   }
   return shortenStr
}

func createNamespaceName() string {
   // if triggered by a github action the environment variable is set
   // we use it to better identify the test
   commitSHA, exist := os.LookupEnv("GITHUB_SHA")
   namespace := "ccsm-helm-"
   if !exist {
      namespace += strings.ToLower(random.UniqueId())
   } else {
      namespace += commitSHA
   }

   // max namespace length is 63 characters
   // https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names
   return truncateString(namespace, 63)
}

Go testify suite allows us to run functions before and after a test, which we use to create and delete a namespace.

func (s *integrationTest) SetupTest() {
   k8s.CreateNamespace(s.T(), s.kubeOptions, s.namespace)
}

func (s *integrationTest) TearDownTest() {
   k8s.DeleteNamespace(s.T(), s.kubeOptions, s.namespace)
}

The example integration test is fairly simple, we install the Helm charts with default values, and wait until all pods are available. For that, we can use some helpers, which Terratest offers here for example.

func (s *integrationTest) TestServicesEnd2End() {
   // given
   options := &helm.Options{
      KubectlOptions: s.kubeOptions,
   }

   // when
   helm.Install(s.T(), options, s.chartPath, s.release)

   // then
   // await that all ccsm related pods become ready
   pods := k8s.ListPods(s.T(), s.kubeOptions, v1.ListOptions{LabelSelector: "app=camunda-cloud-self-managed"})

   for _, pod := range pods {
      k8s.WaitUntilPodAvailable(s.T(), s.kubeOptions, pod.Name, 100, 1*time.Second)
   }
}

As written above, our actual integration test is far more complex, but this should give you a good idea of what you can do. Since Terratest is written in go, this allowed us to write all our tests in go, use mechanics like build tags, and use go libraries like testify. Terratest makes it easy to access the Kubernetes API, run Helm commands like install, and validate the outcome. I really appreciate the verbosity, since the rendered Helm templates are also printed to standard out on running the tests, which helps to debug them. After implementing the integration tests, we were quite satisfied with the result, and the test coding approach, which stands in contrast to having a separate abstraction around the tests that you would have with the Chart Testing tool.

After creating such integration tests, we, of course, wanted to automate them. We did that with GitHub actions (see next section).

Automation

As written above, we automate our tests via GitHub Actions. For normal tests, this is quite simple, you can find an example here of how we run our normal template tests.

It becomes more appealing for integration tests, where you want to connect to an external Kubernetes cluster. Since we use GKE, we also use the corresponding GitHub actions to authenticate with Google Cloud, and get the credentials.

Follow this guide to set up the needed workload identity federation. This is the recommended way to authenticate with Google Cloud resources from outside and replace the old usage of service account keys. The workflow identity federation lets you access resources directly, using a short-lived access token, and eliminates the maintenance, and security burden associated with service account keys.

After setting up the workload identity federation, the usage in GitHub actions is fairly simple.

As an example, we use the following in our GitHub action:

# Add "id-token" with the intended permissions.
permissions:
  contents: 'read'
  id-token: 'write'

steps:
- uses: actions/checkout@v3
- id: 'auth'
  name: 'Authenticate to Google Cloud'
  uses: 'google-github-actions/auth@v0'
  with:
    workload_identity_provider: ‘<Workload Identity Provider resource name>’
    service_account:  ‘<service-account-name>@<project-id>.iam.gserviceaccount.com’
- id: 'get-credentials'
  name: 'Get GKE credentials'
  uses: 'google-github-actions/get-gke-credentials@v0'
  with:
    cluster_name: ‘<cluster-name>’
    location: 'europe-west1-b'
# The KUBECONFIG env var is automatically exported and picked up by kubectl.
- id: 'check-credentials'
  name: 'Check credentials'
  run: 'kubectl auth can-i create deployment'

This is based on the examples of google-github-actions/auth and google-github-actions/get-gke-credentials. Checking the credentials is the last step to verify whether we have enough permissions to create a deployment , which is necessary for our integration tests.

After this, you just need to install Helm and go into your GitHub action container. In order to run the integration test, you can execute the go test with the integration build tag (described above). We use a Makefile for that . Take a look at the full GitHub action.

Last Words

We are now quite satisfied with the new approach and tests. Writing such tests has allowed us to detect several issues in our Helm charts, which is quite rewarding. It’s fun to write and execute them (the template tests are quite fast), and it always gives us good feedback.

^{_Running the template tests in GoLand_}

Side note: what I really like about Terratest, is not only the functionality and how easy it is to write the test, but also its verbosity. On each run, the complete template is printed, which is quite helpful. In addition, on an error, it’s clear where the error/issue is.

I hope to help you with this knowledge and the examples above. Feel free to contact me or tweet me if you have any thoughts to share or better ideas on how to test Helm charts. :)

^{_Thanks to_ [_Ahmed AbouZaid_](mailto:ahmed.abouzaid@camunda.com)_,_ [_Jonathan Ballet_](mailto:jonathan.ballet@camunda.com) _and_ [_Brittany des Vignes_](mailto:brittany.des-vignes@camunda.com) _for reviewing this post._}

Forem: Christopher Kujawa

Why batching matters: Real-world example of performance

Why batching matters: Real-world example of performance. The interaction of latency and throughput.

A real-world example

Example in depth

Latency

Throughput

Batching

Total Execution Time

Taking one log

Batch size three

Batch size ten

This is why batching matters

Conclusion

Zeebe Debug and Inspection tool

General statistics

Restoring BPMN models

Instances for a specific model

Printing the log

Drinking Our Champagne: Chaos Experiments with Zeebe against Zeebe

Chaos engineering and experiments

BPMN meets chaos engineering

Chaos toolkit

Chaos experiment

Action

Specification

Automation

Zeebe Testbench

Run chaos experiments regularly

Benefits

Observability

Drink your own champagne

Looks like bitnami/elasticsearch-curator is gone

Zeebe, or How I learned To Stop Worrying And Love Batching

Zeebe, or How I learned To Stop Worrying And Love Batch Processing

Basics

Stream processing

Persistence

State

Performance hackdays

Measurement

Theory

Proof of concepts

Append and commit latency

Reduce transaction commits

Processing of uncommitted

Batch processing

Combining the POCs

Next

Zbchaos — A new fault injection tool for Zeebe

Summer Hackdays:

Zeebe chaos CLI

Requirements

Chaos Engineering:

Chaos experiments

Verify steady state

Inject chaos

Shutdown nodes

Graceful

Non-graceful

Gateway

Disconnect brokers

Other features

Outcome and future

Advanced Test Practices For Helm Charts

How it Began

Prerequisites

Helm Testing. What is the Issue?

What and How to Test

Template tests

Integration tests

Chart testing

Helm Chart Tests In Practice

Golden files test

Property test:

Integration test

Automation

Last Words