Forem: Camunda

From Insights to Action: Harnessing Camunda Optimize for Effective Development

Samantha Holstine — Mon, 26 Feb 2024 13:00:00 +0000

As a new Camunda user, I am actively learning the wide toolset the product has to offer and understanding its impact on my new role at Camunda. While navigating through the Camunda Academy tutorials, my curiosity prompted me to explore all of Camunda’s components, leading me to discover Optimize. As a developer, I was hesitant at first to learn more about a feature that is primarily designed for business logic, but immediately saw how developers can leverage its features to enhance their development process and gain valuable information. This blog post will guide you through my findings as I learn more about Camunda and Optimize, using a simple trivia game as a practical illustration.

Error and Incident Optimization

One of the key features of Optimize is its ability to help developers optimize their processes to produce fewer errors in the long term. By leveraging premade reports, developers can identify incident hotspots and bottlenecks using heatmaps.

When adding a “Locate incident hotspots on a heatmap” template to an Optimize dashboard, developers can produce a visual representation of incidents based on both the resolution duration—how long it takes to resolve a specific issue—and by count—how many incidents occurred. By hovering over any task on the heatmap, developers can quickly analyze the duration and frequency of incidents. Additionally, filters can be applied to narrow down the data, such as by date and time.

Looking at our trivia game example, the heatmap can display the average time each trivia game spends at a specific task, in this case getting a hint from OpenAI. However, if this task was attached to a specific piece of code, developers can easily identify where there may be issues, like a bug in the code or a service that is down.

Another useful template in Optimize is the “locating bottlenecks” report. This heatmap showcases the average time spent on each part of a process.

For example, in the trivia game, answering questions and displaying messages took the longest amount of time. By analyzing this data in their product, developers can identify potential areas of improvement or automation. This information can help optimize the code and improve the overall user experience.

Creating Custom Reports

In addition to pre-built templates, Optimize allows developers to create their own custom reports. With a blank report, developers have the freedom to display any data they desire. For instance, the number of categories played in the trivia game can be showcased.

It’s not surprising to see that science/nature was our top category at a developer conference! This provides a comprehensive overview of the process’s performance and user preferences, as well as information about specific variables to make sure they’re producing the correct values.

In-depth Process Analysis

Optimize offers an analysis tab that provides additional insights into data and performance.

In the “Task Analysis” tab, developers can explore heatmaps that highlight outliers, indicating process instances that deviate from the average, which helps identify potential issues or inefficiencies in the process. From there, the instance IDs and more details can help developers dive into issues about specific tasks or parts of the process. Using the task analysis can be helpful to ensure new versions of a process runs similarly, as well as tracking improvements between versions.Similarly, the “Branch Analysis” tab provides the probability of instances following a desired path.

For example, in the trivia game, developers can analyze how many players who received hints were able to reach the winning path. These insights can guide further improvements and optimizations.

Optimize also offers a machine learning-ready data set, enabling developers to export and analyze their data using machine learning techniques. This means the data is already formatted and structured to align with machine learning algorithms to make predictions for future instances based on existing instances. The machine learning-ready data set simplifies the process of integrating Optimize with machine learning models and allows for more advanced analysis and predictions.

Let’s Review

No matter what type of developer you are—whether you’re an enterprise professional, a startup enthusiast, an engineering manager, or just starting your coding journey—Optimize offers a diverse range of insights that cater to your specific needs.

Efficient Issue Resolution : By utilizing Optimize’s error and incident optimization features, developers can identify and address bottlenecks and issues efficiently.
Comprehensive Performance Views: Custom reports and dashboards provide a comprehensive view of the application’s performance and user interactions.
In-Depth Analysis for Improvement : The in-depth analysis features enable developers to gain deeper insights into the data and identify areas for further improvement.

Overall, Optimize empowers developers to improve their code, improve user experiences, and drive better business outcomes. For a visual exploration of these capabilities, don’t forget to check out our video overview. Feel free to engage in discussions and share your experiences in the Optimize forum category—it’s a great platform for further insights and community collaboration. For additional resources and detailed guides, explore the following links:

The post From Insights to Action: Harnessing Camunda Optimize for Effective Development appeared first on Camunda.

Pro-code, Low-code, and the Role of Camunda

Bernd Ruecker — Mon, 11 Dec 2023 14:15:33 +0000

Pro-code is our heart and soul, but people and processes are diverse. Our optional low-code features support more use cases without getting in the way of pro-code developers.

Developers regularly ask me about Camunda’s product strategy. Especially around the Camunda 8 launch they raised concerns that we “forgot our roots” or “abandoned our developer-friendliness” — the exact attributes that developers love us for. They presume that we “jumped on the low-code train” instead, because we now have funding and need to “chase the big dollars.” As a developer at heart myself I can tell you that nothing is further from the truth, so let me explain our strategy in this post.

Here is the TL/DR : We will stay 100% developer-friendly and pro-code is our heart and soul (or bread and butter if you prefer). But people that create process solutions are diverse, as are the processes that need to be automated. So for some use cases low-code does make sense, and it is great to be able to support those cases. But low-code features in Camunda are optional and do not get in the way of pro-code developers.

For example, your worker code can become a reusable Connector (or be replaced by an out-of-the-box one) that is configured in the BPMN model using element templates. But you don’t have to use that and can just stay in your development environment to code your way forward. This flexibility allows you to use Camunda for a wide variety of use cases, which prevents business departments from being forced into shaky low-code solutions just because IT lacks resources.

But step by step…

Camunda 8 loves developers

First of all, Camunda 8 focuses on the developer experience in the same way — or even more strongly — than former Camunda versions. The whole point of providing Camunda as a product was to break out of unhandy huge BPM or low-code suites, that are simply impossible to use in professional software engineering projects (see the Camunda story here for example). This hasn’t changed. The heart of Camunda is around bringing process orchestration into the professional software developers toolbelt.

Especially with Camunda 8, we put a lot of focus on providing an excellent developer experience and a great programming model. And we now also extend that beyond the Java ecosystem. We might still have to do some homework here and there (for example getting the Spring integration to a supported product component 2024) — but it is very close to what we always had. Let me give you some short examples (you can find working code on GitHub).

Writing worker code (aka Java Delegates):

https://medium.com/media/60c54cb9c45dec8d12b937ffa07a7b65/href

Using the Spring Boot Starter as Maven dependency:

https://medium.com/media/8b6f8d2cae1868bd4ae7778f3603127e/href

Writing a JUnit test case (with an in-memory engine):

https://medium.com/media/c1e5a8e01021edfd2b0e9c9585576971/href

The only real change from Camunda version 7 to 8 is that the orchestration engine (or workflow engine if you prefer that term) runs as a separate Java process. So the above Spring Boot Starter actually starts a client that connects to the engine, not the whole engine itself. I wrote about why this is a huge advantage in moving from embedded to remote workflow engines. Summarized, it is about isolating your code from the engine’s code and simplifying your overall solution project (think about optimizing the engine configuration or resolving third-party dependency version incompatibilities).

The adjusted architecture without relational database allows us to continuously look at scalability and performance and make big leaps with Camunda 8, allowing use cases we could not tackle with Camunda 7 (e.g. multiple thousands of process instances per second, geo-redundant active/active datacenters, etc.).

A common misconception is that you have to use our cloud/SaaS offering, but this is not true. You can run the engine self-managed as well and there are different options to do that. The SaaS offering is an additional possibility you can leverage, freeing you from thinking about how to run and operate Camunda, but it is up to you if you want to make use of it.

This is a general recurring theme in Camunda 8: We added more possibilities you can leverage to make your own life easier — but we do not force anyone to use them.

The prime example of new possibilities are our low-code accelerators (e.g. Connectors). Let’s quickly dive into why we do low-code next before touching on how Connectors can help more concretely.

Existing customers adopt Camunda for many use cases

We learned from our customers that they want to use Camunda for a wide variety of use cases. Many of the use cases are core end-to-end business processes, like customer onboarding, order fulfillment, claim settlement, payment processing, trading, or the like.

But customers also need to automate simpler processes. Those processes are less complex, less critical, and typically less valuable, but still those processes are there and automating them has a return on investment or is simply necessary to fulfill customer expectations. Good examples are around master data changes (e.g. address or bank account data), bank transfer limits, annual mileage reports for insurances, delay compensation, and so on.

In the past, organizations often did not consider using Camunda for those processes, as they could not set up and staff software development projects for simpler, less critical processes.

And the non-functional requirements for those simpler process automation solutions differ. While the super critical high complex use cases are always implemented with the help of the IT team, to make sure the quality meets the expectations for this kind of solution and everything runs smoothly, the use cases on the lower end of that spectrum don’t have to comply with the same requirements. If they are down, it might not be the end of the world. If they get hacked, it might not be headline news. If there are wired bugs, it might just be annoying. So it is probably OK to apply a different approach to create solutions for these less critical processes.

Categorizing use cases

The important thing is to make a conscious choice and not apply the wrong approach for the process at hand. What we have seen working successfully is to categorize use cases and place them into three buckets:

Red : Processes are mission critical for the organization. They are also complex to automate and probably need to operate at scale. Performance and information security can be very relevant, and regulatory requirements might need to be fulfilled. Often we talk about core end-to-end business processes here, but sometimes also other processes might be that critical. For these use cases you need to do professional software engineering using industry best practices like version control, automated testing, continuous integration and continuous delivery. The organization wants to apply some governance, for example around which tools can be used and what best practices need to be applied.
Yellow : Processes are less critical, but still the organization’s operations would be seriously affected if there are problems. So you need to apply a healthy level of governance, but need to accept that solutions are not created in the same quality as for red use cases, mostly because you simply have a shortage of software developers.
Green : Simple automations, often being very local to one business unit or even an individual. These are often quick fixes stitched together to make one’s life a bit easier, but the overall organization might not even recognize if they break apart. For those uncritical use cases, the organization can afford leaving a lot of freedom to people, so typically there is no governance or quality assurance applied.

While the red use cases are traditionally done with Camunda, and the green use cases are traditionally done with Office-like tooling or low-code solutions (like Airtable or Zapier), the yellow bucket gets interesting. And this is a long tail of processes, that all needs to be automated with a fair level of governance, quality assurance and information security.

We already know organizations using Camunda for those yellow use cases. In order to do this and to ease solution creation, they developed low-code tooling on top of Camunda. A prime example is Goldman Sachs, who built a quite extensive platform based on Camunda 7 (side note: they alsotalk about a differentiation between core banking use cases and the long tail of simpler processes across the firm in later presentations). Speaking to those customers we found a recurring theme and used this feedback to design product extensions that those organizations could have used off-the-shelf (if it would have been there when they started). And we designed this solution to not get in the way of professional software developers when implementing red use cases around critical core processes.

I am not going into too much detail around all of those low code accelerators in this post, but it is mostly around Connectors, rich forms, data handling, the out-of-the-box experience of tools like Tasklist, and browser-based tooling.

For me it is important to re-emphasize the pattern mentioned earlier: Those accelerators are an offer — you don’t have to use them. And if you look deeper, those accelerators are not mystic black boxes. A Connector, for example, is “just” a reusable job worker with a focused properties panel (if you are interested in code, check out any of our existing out-of-the-box Connectors), whereas the property panel can even be generated from Java code. Camunda Marketplace helps you to make this reusable piece of functionality discoverable. Existing Connectors are available in their source and can be extended if needed.

Democratization and acceleration by Connectors

There are two main motivations to use Connectors.

Software developers might simply become more productive by using them, and this is what we call acceleration. For example, it might simply be quicker to use a Twilio Connector instead of figuring out the REST API for sending an SMS and how it is best called from Java. As mentioned, if this is not true for you, e.g. because you have an internal library you simply use to hide the complexity of using Twilio, this is great, then you just keep using that. Also, when you want to write more JUnit tests, it might be simpler to write integration code in Java yourself. This is fine! You are not forced to use Connectors, it is an offer, and if it makes your life easier, use them.

The other more important advantage is that it allows a more diverse set of people to take part in solution creation, which is referred to as democratization. So for example, a tech-savvy business person could probably stitch together a simpler process using Connectors, even if they cannot write any programming code. Remember, we are talking about the long tail of simpler processes (yellow) here.

A powerful pattern then is that software developers enable other roles within the organization. One way of doing this can be to have a Center of Excellence where custom Connectors are built specifically shaped around the needs of the organization. And those Connectors are then used by other roles to stitch together the processes. One big advantage is that your IT team has control over how Connectors are built and used, allowing them to enforce important governance rules, e.g. around information security or secret handling (something which is a huge problem with typical low code solutions).

You could also mix different roles in one team creating a solution, and the developer can focus on the technical problems to set up Connectors properly, and more business-like people can concentrate on the process model. And of course there are many nuances in the middle.

This is comparable to a situation we know from software vendors embedding Camunda into their software for customization. Their software product then typically comes with a default process model and consultants can customize the processes to end-customer needs within certain limits the software team built-in.

Avoiding the danger zone when doing vendor rationalization and tool harmonization

Many organizations currently try to reduce the number of vendors and tools they are using. This is understandable on many levels, but it is very risky if the different non-functional requirements of green, yellow, and red processes are ignored.

For example, procurement departments might not want to have multiple process automation tools. But for them, the difference between Camunda and a low-code vendor is not very tangible as they both automate processes.

For red use cases, customers can still easily argue why they cannot use a low-code tool because those tools simply don’t fit into professional software development approaches. But for yellow use cases, this gets much more complicated to argue. This can lead to a situation where low-code tools, made for green use cases, are applied for yellow ones. This might work for very simple yellow processes, but can easily become risky if processes are getting too complex, or simply if requirements around stability, resilience, easing maintenance, scalability or information security rise over time. This is why I consider this a big danger zone for companies to be in.

Camunda’s low-code acceleration features allow you to use Camunda in more yellow use cases, as you don’t have to involve software developers for everything. But if non-functional requirements rise, you can always fulfill those with Camunda, as it is built for red use cases as well. Just as an example, you could start adding automated tests whenever the solution starts to be too shaky. Or you could scale operations, if you face an unexpected high demand (think of flight cancellations around the Covid pandemic — this was a yellow use case for airlines, but it became highly important to be able to process them efficiently basically overnight).

To summarize: It’s better to target yellow use cases with a pro-code solution like Camunda with added low-code acceleration layers that you can use, but don’t have to. This prevents risky situations with low-code solutions that cannot cope with rising non-functional requirements.

And to link back to our product strategy: With Camunda 8 we worked hard to allow even “redder” use cases (because of improved performance, scalability, and resilience), as well as more yellow use cases at the same time. So you can go further left (red) and right (yellow) at the same time.

Summary

In today’s post I re-emphasized that Camunda is and will be developer-friendly. Pro-code (red) use cases are our bread and butter business, and honestly those are super exciting use cases where we can play to our strengths. This is strategically highly relevant, even if you might see a lot of marketing messaging around low-code accelerations at the moment.

Those low-code accelerators allow building less complex solutions (yellow) too, where typically other roles take part in solution creation (democratization, acceleration, and enablement). This helps you to reduce the risk of using the wrong tool for yellow use cases ending up in headline news.

You can read more about our vision for low-code here, or if you’re curious about how our Connectors work, feel free to check out our docs to learn more.

Prioritizing Our DevRel Backlog with Form Builder, DMN, and a Process Model

Mary Thengvall — Mon, 22 May 2023 19:32:16 +0000

Developer Relations teams typically have a wide variety of skills to meet community members where they are and help them get where they need to be. These skills can range from programming, creating content, and giving technical presentations, to mentoring community members on the best ways to use the product, running community programs, and communicating feedback to internal teams on ways to improve the product experience. But when the backlog grows faster than tasks can be completed, how do we prioritize the work that’s in front of us, ensuring we’re working on the most impactful items?

I recently faced this predicament on my own team here at Camunda. I needed to not only help my team decide which projects to prioritize but also help our coworkers understand how (and why) we respond to new requests.

While I could have written and published an internal strategy doc that listed our reasoning, information overload sets in quickly, and I’d rather not ask Camunda employees (Camundi) to read yet another page in our internal wiki every time they have a request. Instead, I turned to DMN and Camunda Platform 8’s new features to build a process model with an integrated form. While it’s not yet perfect, it’s already benefiting my team! How did I get here?

Step 1: Identify the criteria for prioritization

I chose to start with a DMN table, which would be the foundation of the model. Before I could build this table, however, I had to identify the criteria we would use to prioritize the DevRel team’s work. Many DevRel teams prioritize tasks based on impact on the developer community, alignment with the company’s goals, or level of effort.

In our case, the assumption is that the work being submitted to this form is independent of the projects we’ve already taken on as a team for the quarter. This allowed me to keep our criteria simple: alignment with the team goals and the urgency of the task.

Step 2: Create a decision table

With this criteria defined, it’s time to create the DMN table. These tables map the input values (timing and goals) to the output value (priority) based on a set of rules. A nice perk of building the prioritization rules into this table is that as our criteria change over time, I can update the table and the new rules will take effect immediately. Because the table is versioned, I can revert to a previous iteration at any time if necessary.

The first thing I had to define was the input data which the decision table will use to process requests. In my case, I used the following:

Input: 2023 Goals
Expression: goals
Type: string
Predefined Values: “healthy C8 community” “successful C8 community” “C8 community onboarding” “other”

Note: While filling in predefined values is optional, I found it helpful when populating the decision rules.

I set the second input (Timing) with fairly general ranges: this week, month, or quarter, and next week, month, or quarter, as well as no specific timeframe.

I kept the output value very straightforward: yes or no, with a possible exception that could be raised to me if necessary. Lastly, I’ve used the hit policy “first” in order to evaluate the rules from top to bottom and stop when a match is found.

The end result is a decision matrix that allows us to easily filter company requests that meet these specific goals and aren’t urgent (e.g. can be completed this month, next month, or next quarter). For anything that falls into the “this week” or “next week” timing, it’s likely going to be a no, unless it’s a very high-priority task that also aligns with our goals; in this case, the request is flagged as a possible exception in need of review. The outcome is a fairly straightforward model that outlines when we can prioritize requests and when we’ll need to either reconsider them at a later time or simply say no.

Step 3: Create a form to populate the decision table

I repurposed a simple Google form we’ve used for years, using the Camunda Form Builder so I could integrate it with my decision table.

Once the form was created, I made sure the key for the questions around the goals and the timing matched up with the expression in my DMN table.

Step 4: Integrate the form and decision table into a process model

An idea is submitted & evaluated

The next step was to create a process model using Web Modeler. This model represents the process of prioritizing tasks, including collecting the criteria via the form, applying the decision table to determine the priority of the task, and communicating the decision to the appropriate Camundi. Let’s take a look at the current model:

The first section of the model includes the completion of the form, the DMN table which helps us prioritize the task, and an automated Slack message (using our Slack Connector) that notifies the DevRel team a new request has been submitted.

I connected the form to the model by copying the JSON from the code editor in the form builder and pasting it into the properties panel of the user task “Complete DevRel Request Form.”

I then added a Business Rule task and connected it to the DMN table I created by associating the following fields:

I copied the ID for the Decision from the Decision Requirements Diagram (DRD) view and pasted it in the Called Decision: Decision ID field for the Business Rule task.
In the Result Variable field, I pasted the output name from my DMN table (abilityToHelp).

It was important to me to minimize blockers (including myself) for these requests and better enable my team to take action, so I wanted to make sure the entire team would be alerted whenever someone filled out the form. Using the Slack Connector, I set up an alert to go directly to our team channel.

The next iteration will include a link to the specific request, in addition to the results of the DMN table so it will be easy to see at a glance whether additional insight is needed.

The decision is validated

The next step in our process is to validate the decisions.

There are two decision gateways here:

Do you need help making a decision? The DMN table will automatically determine this step, moving requests directly to Look at automated decision if the answer is a clear yes or no. If the answer from the DMN table is abilityToHelp = “possible exception,” it will be flagged and go the Escalate to Manager route.
Do you agree with the decision? When I was first designing this process, our Community Manager Maria Alcantara made the excellent observation that there may be times when users disagree with the automated decision. If this is the case, requests should be escalated to the manager.

Right now, looking at the automated decision as well as making the decision are user tasks that have to be managed within Tasklist. If users need to escalate to manager in either case, they can type in a variable agree = false for the first question and doThing = false for the second. In the next iteration of this model, I’d like to have a Slack integration that allows us to say yes or no to both of these questions in order to move forward seamlessly.

The end result of all these decisions is that we have a clear path forward: we’re either going to tackle this project or not.

The outcome is communicated

Feedback loops are important to us at Camunda, so I wanted to make sure no matter what the decision was, there was a follow-up with the person who requested help.

Rather than listing every Camundi in the Requester DMN table, I chose to include variants of our team members’ names: first name and first + last name, in lowercase as well as camelcase, since DMN tables are case sensitive. The output is groupCamundi with either the value of devrelTeam or_ colleague._

These final tasks are all user tasks, but again, there are opportunities to include automation here:

Create a Trello card in our task board based on the form.
Send a refusal message to the individual who submitted the request.
Notify the requester when the task is moved to the done column in our task board.

We could likely even turn the form into a Slack bot that then pings the appropriate team member. In short, there are all sorts of possible iterations here, which we’ll definitely explore as we roll this out company-wide.

Step 5: Be more productive

I have high hopes that this process model will continue to help us prioritize our work more effectively, ensuring we focus our efforts on the tasks that will have the greatest impact on our community members as well as Camunda company goals. Additionally, by streamlining the prioritization process, we are able to complete tasks more quickly and efficiently, improving our overall productivity.

While this prioritization of tasks might seem like a relatively small and perhaps insignificant issue compared to the other items on our plate, this DMN table, form, and process model will serve as the foundation for future team endeavors and resource-planning. Here’s to solving “mole hills” before they turn into mountains!

What have you created with Camunda lately? Let us know over in our forum. I’d love to hear how process models have made your day-to-day work easier!

The post Prioritizing Our DevRel Backlog with Form Builder, DMN, and a Process Model appeared first on Camunda.

Zeebe, or How I learned To Stop Worrying And Love Batching

Christopher Kujawa — Sat, 04 Mar 2023 09:44:55 +0000

Zeebe, or How I learned To Stop Worrying And Love Batch Processing

Hi, I’m Chris, Senior Software Engineer at Camunda. I have worked now for around seven years at Camunda and on the Zeebe project for almost six years, and was recently part of a hackday effort to improve Zeebe’s process execution latency

In the past, we have heard several reports from users where they have described that the process execution latency of Zeebe, our cloud-native workflow decision engine for Camunda Platform 8, is sometimes sub-optimal. Some of the reports raised that the latency between certain tasks in a process model is too high, others that the general process instance execution latency is too high. This of course can also be highly affected by the used hardware and wrong configurations for certain use cases, but we also know we have something to improve.

At the beginning of this year and after almost three years of COVID-19, we finally sat together in a meeting room with whiteboards to improve the situation for our users. We called that performance hackdays. It was a nice, interesting, and fruitful experience.

Basics

To dive deeper into what we tried and why, we first need to elaborate on what process instance execution latency means, and what influences it.

The image above is a process model, from which we can create an instance. The execution of such an instance will go from the start to the end event; this is the process execution latency.

Since Zeebe is a complex distributed system, where the process engine is based on a distributed streaming platform, there are several influencing factors for the process execution latency. During our performance hackdays, we tried to sum up all potential factors and find several bottlenecks which we can improve. In the following post, I will try to summarize this on a high level and mention them shortly.

Stream processing

To execute such a process model, as we have seen above, Zeebe uses a concept called stream processing.

Each element in the process has a specific lifecycle, which is divided into the following:

^{BPMN Elements Lifecycle divided into Command/Events}

One command asks to change the state of a certain element and an event that confirms the state change. Termination can happen when elements are canceled either internally by events or outside by users.

Commands drive the execution of a process instance. When Zeebe’s stream processor processes a command, state changes are applied (e.g. process instances are modified). Such modifications are confirmed via follow-up events. To split the execution into smaller pieces, not only are follow-up events produced, but also follow-up commands. All of these follow-up records are persisted. Later, the follow-up commands are further processed by the stream processor to continue the instance execution. The idea behind that is that these small chunks of processing should help to achieve high concurrency by alternating execution of different instances on the same partition.

Persistence

Before a new command on a partition can be processed, it must be replicated to a quorum (typically majority) of nodes. This procedure is called commit. Committing ensures a record is durable, even in case of complete data loss on an individual broker. The exact semantics of committing are defined by the raft protocol.

^{Source: https://docs.camunda.io/docs/components/zeebe/technical-concepts/clustering/#commit}

Committing of such records can be affected by network latency, for sending the records over the wire. But also by disk latency since we need to persist the records on disk on a quorum of nodes before we can mark the records as committed.

State

Zeebe’s state is stored in RocksDB, which is a key-value store. RocksDB persists data on disk with a log-structured merge tree (LSM Tree) and is made for fast storage environments.

The state contains information about deployed process models and current process instance executions. It is separated per partition, which means a RocksDB instance exists per partition.

Performance hackdays

When we started with the performance hackdays, we already had necessary infrastructure to run benchmarks for our improvements. We made heavy use of the Camunda Platform 8 benchmark toolkit maintained by Falko Menge.

Furthermore, we run weekly benchmarks (the so-called medic benchmark) where we test for throughput, latency, and general stability. Benchmarks are run for four weeks to detect potential bugs, regressions, memory leaks, performance regressions, and more as early as possible. This, all the infrastructure around it (like Grafana dashboards,) and knowledge about how our system performs were invaluable to make such great progress during our hackdays.

Measurement

We measured our results continuously, and this is necessary to see if you are on the right track. For every small proof of concept (POC), we ran a new benchmark:

^{Screenshot of benchmarks over the week}

In our benchmark, we used a process based on some user requirements:

^{Benchmark Process}

Our target was a throughput of around 500 process instances per second (PI/s) with a process execution latency goal for one process instance under one second for the 99th percentile (p99). P99, meaning 99% of all process instance executions should be executed in under one second.

The benchmarks have been executed in the Google Kubernetes Engine. For each broker node, we assigned one n2-standard-8 node to reduce the influence of other pods running on the same node.

Each broker pod had the following configuration:

^{Benchmark configuration}

There were also some other configurations we played around with during our different experiments, but the above were the general ones. We had eight brokers running, which gives us the following partition distribution:

$ ./partitionDistribution.sh 8 24 4
Distribution:
P\N| N 0| N 1| N 2| N 3| N 4| N 5| N 6| N 7
P 0| L | F | F | F | - | - | - | -  
P 1| - | L | F | F | F | - | - | -  
P 2| - | - | L | F | F | F | - | -  
P 3| - | - | - | L | F | F | F | -  
P 4| - | - | - | - | L | F | F | F  
P 5| F | - | - | - | - | L | F | F  
P 6| F | F | - | - | - | - | L | F  
P 7| F | F | F | - | - | - | - | L  
P 8| L | F | F | F | - | - | - | -  
P 9| - | L | F | F | F | - | - | -  
P 10| - | - | L | F | F | F | - | -  
P 11| - | - | - | L | F | F | F | -  
P 12| - | - | - | - | L | F | F | F  
P 13| F | - | - | - | - | L | F | F  
P 14| F | F | - | - | - | - | L | F  
P 15| F | F | F | - | - | - | - | L  
P 16| L | F | F | F | - | - | - | -  
P 17| - | L | F | F | F | - | - | -  
P 18| - | - | L | F | F | F | - | -  
P 19| - | - | - | L | F | F | F | -  
P 20| - | - | - | - | L | F | F | F  
P 21| F | - | - | - | - | L | F | F  
P 22| F | F | - | - | - | - | L | F  
P 23| F | F | F | - | - | - | - | L

Each broker node had 12 partitions assigned. We used a replication factor of four because we wanted to mimic the geo redundancy for some of our users, which had certain process execution latency requirements. The geo redundancy introduces network latency into the system by default. We wanted to reduce the influence of such network latency to the process execution latency. To make it a bit more realistic, we used Chaos Mesh to introduce a network latency of 35ms between two brokers, resulting in a round-trip time (RTT) of 70ms.

To run with an evenly distributed partition leadership, we used the partitioning rebalancing API, which Zeebe provides.

Theory

Based on the benchmark process model above, we considered the impact of commands and events on the process model (and also in general).

^{Whiteboard session: Drawing commands/events}

We calculated around 30 commands are necessary to execute the process instance from start to end.

We tried to summarize what affects the processing latency and came to the following formula:

PEL = X * Commit Latency + Y * Processing Latency + OH
PEL - Process Execution Latency
OH - Overhead, which we haven't considered (e.g. Jobs * Job Completion Latency)

When we started, X and Y were equal, but the idea was to change factors. This is why we split them up. The other latencies were based on:

Commit Latency = Network Latency + Append Latency
Network Latency = 2 * request duration
Append Latency = Write to Disk + Flush
Processing Latency = Processing Command (apply state changes) 
                   + Commit Transaction (RocksDB) 
                   + execute side effects

Below is a picture of our whiteboard session, where we discussed potential influences and what potential solution could mitigate which factor:

^{Whiteboard session: Discussion potential factors and influences}

Proof of concepts

Based on the formula, it was a bit more clear to us what might affect the process execution latency and where it might make sense to change or reduce time. For example, reducing the append latency affects commit latency and will affect process execution latency. Additionally, reducing the factor of how often commit latency is applied will highly affect the result.

Append and commit latency

Before we started with the performance hackdays, there was one configuration already present which we built more than two years agoand made available via an experimental feature: the disabling of the raft flush. We have seen several users applying it to reach certain performance targets, but it comes with a cost. It is not safe to use it, since on fail-over certain guarantees of raft no longer apply.

As part of the hackdays we were interested in a similar performance, but with more safety. This is the reason why we tried several different other possibilities but also compared that with disabling the flush completely.

Flush improvement

In one of our POC’s, we tried to flush on another thread. This gave a similar performance as with completely disabling it, but it also has similar safety issues. Combining the async flush with awaiting the completion before committing brought back the old performance (base) and the safety. This was no solution.

Implementing a batch flush (flush only after a configured threshold,) having this in a separate thread, and waiting for the completion degraded the performance. However, we again had better safety than with disabling flush.

We thought about flushing async in a batch, without waiting for commit and making this configurable. This would allow users to trade safety versus performance.

Write improvement

We had a deeper look into system calls such as madvise.

Zeebe stores its log in a segmented journal which is memory mapped at runtime. The OS manages what is in memory at any time via the page cache, but does not know the application itself. The madvise system call allows us to provide hints to the OS on when to read/write/evict pages.

The idea was to provide hints to reduce memory churn/page faults and reduce I/O

We tested with MADV_SEQUENTIAL , hinting that we will access the file sequentially and a more aggressive read-ahead should be performed (while previous pages can be dropped sooner).

Based on our benchmarks, we hadn’t seen much difference under low/mid load. However, read IO was greatly reduced under high load. We have seen slightly increased write I/O throughput under high load due to reduced IOPS contention. In general, there was a small improvement only in throughput/latency. Surprisingly, still it showed similar page faults as before.

Reduce transaction commits

Based on our formula above, we can see that the processing latency is affected by the RocksDB write and transaction commit duration. This means reducing one of these could benefit the processing latency.

State directory separation

Zeebe stores the current state (runtime) and snapshots on different folders on disk (under the same parent). When a Zeebe broker restarts, we recreate the state (runtime) every time from a snapshot. This is to avoid having data in the state which might not have been committed yet.

This means we don’t necessarily need to keep the state (runtime) on disk, and RocksDB does a lot of IO-heavy work which might not be necessary. The idea was to separate the state directory in a way that it can be separately mounted (in Kubernetes) such that we can run RocksDB in tmpfs, for example.

Based on our benchmarks, only p30 and lower have been improved with this POC:

Disable WAL

RocksDB has a write-ahead log to be crash resistant. This is not necessary for us to recreate the state every time. We considered disabling it, we will see later in this post what influence it has. It is a single configuration, which is easy to change.

Processing of uncommitted

We mentioned earlier that we have thought about changing the factor of how many commits influence the overall calculation. What if we process commands already, even if they are not committed yet, and only send results to the user if the commit of the commands is done?

We worked on a POC to implement uncommitted processing, but it was a bit more complex than we thought due to the buffering of requests, etc. This is why we didn’t find a good solution during our hackdays. We still ran a benchmark to verify how it would behave:

The results were quite interesting and promising, but we considered them a bit too good. The production ready implementation might be different, since we have to consider more edge-cases.

Batch processing

Part of another POC we did was something we called batch processing. The implementation was rather easy.

The idea was to process the follow-up commands directly and continue the execution of an instance until no more follow-up commands are produced. This normally means we have reached a wait state, like a service task. Camunda Platform 7 users will know this behavior, as this is the Camunda Platform 7 default. The result was promising as well:

In our example process model above, this would reduce the factor of commit latencies from ~30 commands to 15, which is significant. The best IO you can do, however, is no IO.

Combining the POCs

By combining several POCs, we reached our target line which showed us that it is possible and gave us some good insights on where to invest in order to improve our system further in the future.

The improvements did not just improve overall latency of the system. In our weekly benchmarks we had to increase the load because the system was able to reach higher throughput. Before we reached ~133 (on avg) process instances per second (PI/s) over three partitions, now 163 PI/s (on avg) while also reducing the latency by a factor of 2.

In the last weeks, we took several ideas from the hackdays to implement some production-ready solutions for Zeebe 8.2. For example:

We plan to work on some more like:

You can expect some better performance with the 8.2 release; I’m really looking forward to April! :)

Thanks to all participants of the hackdays for the great and fun collaboration, and to our manager (Sebastian Bathke) who made this possible. It was a really nice experience.

Participants (alphabetically sorted):

Thanks to all the reviewers of this blog post: Christina Ausley, Deepthi Devaki Akkoorath, Nicolas Pepin-Perreault, Ole Schönburg, Philipp Ossler and Sebastian Bathke

Camunda's Hacktoberfest 2022

Mia Moore — Tue, 04 Oct 2022 10:47:21 +0000

Happy Hacktoberfest, everyone!

We are so excited to be participating in Hacktoberfest for the third time this year. At Camunda, we believe that open source can help unlock the full potential of process automation. We’re passionate about automating processes, creating easy-to-use products, and collaborating with our community members.

For every Camunda challenge completion, we will make a donation to One Tree Planted which plants trees across the globe focusing on areas in need of habitat rehabilitation.

If you complete the challenge you will also have the option of choosing to receive the limited edition Camunda x Hacktoberfest 2022 t-shirt.

Note: Documentation improvements are always welcome!

Get the whole lowdown at our blog post here - Hacktoberfest 2022 at Camunda.

Or review the following repos for Hacktoberfest issues:

Happy hacking!

Zbchaos — A new fault injection tool for Zeebe

Christopher Kujawa — Thu, 15 Sep 2022 12:16:14 +0000

^{Photo by Brett Jordan on Unsplash}

During Summer Hackdays 2022, I worked on a project called “Zeebe chaos” (zbchaos), a fault injection CLI tool. This allows us engineers to more easily run chaos experiments against Zeebe, build up confidence in the system’s capabilities, and discover potential weaknesses.

Requirements

To understand this blog post, it is useful to have a certain understanding of Kubernetes and Zeebe itself.

Summer Hackdays:

Hackdays are a regular event at Camunda, where people from different departments (engineering, consulting, DevRel, etc.) work together on new ideas, pet projects, and more.

Often, the results are quite impressive and are also presented in the following CamundaCon. For example, check out the agenda of this year’s CamundaCon 2022.

Check out previous Summer Hackdays here:

Zeebe chaos CLI

Working on the Zeebe project is not only about engineering a distributed system or a process engine, it is also about testing, benchmarking, and experimenting with our capabilities.

We run regular chaos experiments against Zeebe to build up confidence in our system and to determine whether we have weaknesses in certain areas. In the past, we have written many bash scripts to inject faults (chaos). We wanted to replace them with better tooling: a new CLI. This allows us to make it more maintainable, but also lowers the barrier for others to experiment with the system.

The CLI targets Kubernetes, as this is our recommended environment for Camunda Platform 8 Self-Managed, and the environment our own SaaS offering runs on.

The tool builds upon our existing Helm charts, which are normally used to deploy Zeebe within Kubernetes.

Requirements

To use the CLI you need to have access to a Kubernetes cluster, and have our Camunda Platform 8 Helm charts deployed. Additionally, feel free to try out Camunda Platform 8 Self-Managed.

Chaos Engineering:

You might be wondering why we need this fault injection CLI tool or what this “chaos” stands for. It comes from chaos engineering, a practice we introduced back in 2019 to the Zeebe Project.

Chaos Engineering was defined by the Principles of Chaos. It should help to build confidence in the system's capabilities and find potential weaknesses through regular chaos experiments. We define and execute such experiments regularly.

Take a look at my talk at CamundaCon 2020.2 to get to know more about Chaos Engineering at Camunda (and Zeebe).

Chaos experiments

As mentioned, we regularly write and run new chaos experiments to build up confidence in our system and undercover weaknesses. The first thing you have to do for your chaos experiment is to define a hypothesis that you want to prove. For example, processing should still be possible after a node goes down. Based on the hypothesis, you know what kind of property or steady state you want to verify before and after injecting faults into the system.

A chaos experiment consists of three phases:

Verify the steady state.
Inject chaos.
Verify the steady state.

For each of these phases, the zbchaos CLI provides certain features outlined below.

Verify steady state

In the steady state phase, we want to verify certain properties of the system, like invariants, etc.

One of the first things we typically want to check is the Zeebe topology. With zbchaos you can run:

$ zbchaos topology
0 |LEADER (HEALTHY) |FOLLOWER (HEALTHY) |LEADER (HEALTHY)
1 |FOLLOWER (HEALTHY) |LEADER (HEALTHY) |FOLLOWER (HEALTHY)
2 |FOLLOWER (HEALTHY) |FOLLOWER (HEALTHY) |FOLLOWER (HEALTHY)

Zbchaos will do all the necessary magic for you. Finding a Zeebe gateway, do a port-forward, request the topology, and print it in a compact format. This makes the chaos engineers’ life much easier.

Another basic check is verifying the readiness of all deployed Zeebe components. To achieve this, we can use:

$ zbchaos verify readiness
All Zeebe nodes are running.

This verifies the Zeebe Broker Pod status and the status of the Zeebe Gateway deployment status. If one of these is not ready yet, it will loop and not return before they are ready. This is beneficial in automation scripts.

After you have verified the general health and readiness of the system, you also need to verify whether the system is working functionally. This is also called “verifying the steady state.” This can be achieved by:

$ zbchaos verify steady-state — partitionId 2

This command checks that a process model can be deployed and a process instance can be started for the specified partition. As you cannot influence the partition for new process instances, process instances are started in a loop until that partition is hit. If you don’t specify the partitionId, partition one is used.

Inject chaos

After we verify our steady state we want to inject faults or chaos into our system, and afterward check again our steady state. The zbchaos CLI already provides several possibilities to inject faults outlined below.

Before we step through how we can inject failures, we need to understand what kind of components a Zeebe cluster consists of and what the architecture looks like.

We have two types of nodes: the broker, and the gateway.

A broker is a node that does the processing work. It can participate in one or more Zeebe partitions (internally each partition is a raft group, which can consist of one or more nodes). A broker can have different roles for each partition (Leader, Follower, etc.)

For more details about the replication, check our documentation and the raft documentation.

The Zeebe gateway is the contact point to the Zeebe cluster to which clients connect. Clients send commands to the gateway and the gateway is in charge of distributing the commands to the partition leaders. This depends on the command type of course. For more details, check out the documentation.

By default, the Zeebe gateways are replicated as if Camunda Platform 8 Self-Managed was installed via our Helm charts, which makes it interesting to also experiment with the gateways.

Shutdown nodes

With zbchaos we can shutdown brokers (gracefully and non-gracefully) which have a specific role and take part in a specific partition. This is quite useful in experimenting since we often want to terminate or restart brokers based on the participation and role (e.g. terminate the Leader of partition X or restart all followers of partition Y.)

Graceful

A graceful restart can be initiated like this:

$ zbchaos restart -h
Restarts a Zeebe broker with a certain role and given partition.

    Usage:
    zbchaos restart [flags]

    Flags:
      -h, --help help for restart
      --partitionId int Specify the id of the partition (default 1)
      --role string Specify the partition role [LEADER, FOLLOWER, INACTIVE] (default “LEADER”)

    Global Flags:
    -v, — verbose verbose output

This sends a Kubernetes delete command to the pod, which takes part of the specific partition and has the specific role. This is based on the current Zeebe topology, provided by the Zeebe gateway. All of this is handled by the zbchaos toolkit. The chaos engineer doesn’t need to find this information manually.

Non-graceful

Similar to the graceful restart is the termination of the broker. It will send a delete to the specific Kubernetes Pod, and will set the **–gracePeriod **to zero.

$ zbchaos terminate -h
Terminates a Zeebe broker with a certain role and given partition.

    Usage:
      zbchaos terminate [flags]
      zbchaos terminate [command]

    Available Commands:
      gateway Terminates a Zeebe gateway

    Flags:
      -h, --help help for terminate
      --nodeId int Specify the nodeId of the Broker (default -1)
      --partitionId int Specify the id of the partition (default 1)
      --role string Specify the partition role [LEADER, FOLLOWER] (default “LEADER”)

    Global Flags:
    -v, --verbose verbose output

    Use “zbchaos terminate [command] --help” for more information about a command.

Gateway

Both commands above target the Zeebe brokers. Sometimes, it is also interesting to target the Zeebe gateway. For that, we can just append the gateway subcommand to the restart or terminate command.

Disconnect brokers

It is not only interesting to experiment with graceful and non-graceful restarts, but it is also interesting to experiment with network issues. This kind of fault undercovers other interesting weaknesses (bugs).

With the zbchaos CLI, it is possible to disconnect different brokers. We can specify at which partition they participate and what kind of role they have. These network partitions can also be set up in one direction if the –one-direction flag is used.

$ zbchaos disconnect -h
Disconnect Zeebe nodes, uses sub-commands to disconnect leaders, followers, etc.

    Usage:
     zbchaos disconnect [command]

    Available Commands:
     brokers Disconnect Zeebe Brokers

    Flags:
     -h, — help help for disconnect

    Global Flags:
     -v, — verbose verbose output

    Use “zbchaos disconnect [command] — help” for more information about a command.
    [zell ~/ cluster: zeebe-cluster ns:zell-chaos]$ zbchaos disconnect brokers -h
    Disconnect Zeebe Brokers with a given partition and role.

    Usage:
     zbchaos disconnect brokers [flags]

    Flags:
     — broker1NodeId int Specify the nodeId of the first Broker (default -1)
     — broker1PartitionId int Specify the partition id of the first Broker (default 1)
     — broker1Role string Specify the partition role [LEADER, FOLLOWER] of the first Broker (default “LEADER”)
     — broker2NodeId int Specify the nodeId of the second Broker (default -1)
     — broker2PartitionId int Specify the partition id of the second Broker (default 2)
     — broker2Role string Specify the partition role [LEADER, FOLLOWER] of the second Broker (default “LEADER”)
     -h, — help help for brokers
     — one-direction Specify whether the network partition should be setup only in one direction (asymmetric)

    Global Flags:
     -v, — verbose verbose output

The network partition will be established with ip route tables, which are installed on the specific broker pods.

Right now this is only supported for the brokers, but hopefully, we will add support for the gateways soon as well.

To connect the brokers again, the following can be used:

$ zbchaos connect brokers

This removes the ip routes on all pods again.

Other features

All the described commands support a verbose flag, which allows the user to determine what kind of action is done, how it connects to the cluster, and more.

For all of the commands, a bash-completion can be generated via zbchaos completion, which is very handy.

Outcome and future

In general, I was quite happy with the outcome of Summer Hackdays 2022, and it was a lot of fun to build and use this tool already. I was able to finally spend some more time writing go code and especially a go CLI. I learned to use the Kubernetes go-client and how to write go tests with fakes for the Kubernetes API, which was quite interesting. You can take a look at the tests here.

We plan to extend the CLI in the future and use it in our upcoming experiments.

For example, I recently did a new chaos day, a day I use to run new experiments, and wrote a post about it. In this article, I extended the CLI, with features like sending messages to certain partitions.

At some point, we want to use the functionality within our automated chaos experiments as Zeebe workers and replace our old bash scripts.

Thanks to Christina Ausley and Bernd Ruecker for reviewing this post :)

A Technical Sneak Peek into Camunda’s Connector Architecture

Bernd Ruecker — Wed, 03 Aug 2022 13:19:56 +0000

What is a connector? How does the code for a connector look like? And how can connectors be operated in various scenarios?

When Camunda Platform 8 launched earlier this year, we announced connectors and provided some preview connectors available in our SaaS offering, such as sending an email using SendGrid, invoking a REST API, or sending a message to Slack.

Since then, many people have asked us what a connector is, how such a connector is developed, and how it can be used in Self-Managed. We haven’t yet published much information on the technical architecture of connectors as it is still under development, but at the same time, I totally understand that perhaps you want to know more to feel as excited as me about connectors.

In this blog post, I’ll briefly share what a connector is made of, how the code for a connector roughly looks, and how connectors can be operated in various scenarios. Note that the information is a preview, and details are subject to change.

What is a connector?

A connector is a component that talks to a third-party system via an API and thus allows orchestrating that system via Camunda (or let that system influence Camunda’s orchestration).

The connector consists of a bit of programming code needed to talk to the third-party system and some UI parts hooked into Camunda Modeler.

This is pretty generic, I know. Let’s get a bit more concrete and differentiate types of connectors:

Outbound connectors : Something needs to happen in the third-party system if a process reaches a service task. For example, calling a REST endpoint or publishing some message to Slack.
Inbound connectors : Something needs to happen within the workflow engine because of an external event in the third-party system. For example, because a Slack message was published or a REST endpoint is called. Inbound connectors now can be of three different kinds:

Webhook : An HTTP endpoint is made available to the outside, which when called, can start a process instance, for example.
Subscription : A subscription is opened on the third-party system, like messaging or Apache Kafka, and new entries are then received and correlated to a waiting process instance in Camunda, for example.
Polling : Some external API needs to be regularly queried for new entries, such as a drop folder on Google Drive or FTP.

Outbound example

Let’s briefly look at one outbound connector: the REST connector. You can define a couple of properties, like which URL to invoke using which HTTP method. This is configured via Web Modeler, which basically means those properties end up in the XML of the BPMN process model. The translation of the UI to the XML is done by the element template mechanism. This makes connectors convenient to use.

Now there is also code required to really do the outbound call. The overall Camunda Platform 8 integration framework provides a software development kit (SDK) to program such a connector against. Simplified, an outbound REST connector provides an execute method that is called whenever a process instance needs to invoke the connector, and a context is provided with all input data, configuration, and abstraction for the secret store.

https://medium.com/media/e960514728b5646935597f26a1d141c2/href

Now there needs to be some glue code calling this function whenever a process instance reaches the respective service task. This is the job of the connector runtime. This runtime registers job workers with Zeebe and calls the outbound connector function whenever there are new jobs.

This connector runtime is independent of the concrete connector code executed. In fact, a connector runtime can handle multiple connectors at the same time. Therefore, a connector brings its own metadata:

https://medium.com/media/9e273b9ad78b7042779ee2d46ff74f84/href

With this, we’ve built a Spring Boot-based runtime that can discover all outbound connectors on the classpath and register the required job workers. This makes it super easy to test a single connector, as you can run it locally, but you can also stitch together a Spring Boot application with all the connectors you want to run in your Camunda Platform 8 Self-Managed installation.

At the same time, we have also built a connector runtime for our own SaaS offering, running in Google Cloud. While we also run a generic, Java-based connector runtime, all outbound connectors themselves are deployed as Google Functions. Secrets are handled by the Google Cloud Security Manager in this case.

The great thing here is that the connector code itself does not know anything about the environment it runs in, making connectors available in the whole Camunda Platform 8 ecosystem.

Inbound example

Having talked about outbound, inbound is a very different beast. An inbound connector either needs to open up an HTTP endpoint, a subscription, or start polling. It might even require some kind of state, for example, to remember what was already polled. Exceptions in a connector should be visible to an operator, even if there is no process instance to pinpoint it to.

We are currently designing and validating architecture on this end, so consider it in flux. Still, some of the primitives from inbound connectors will also be true:

Parameters can be configured via the Modeler UI and stored in the BPMN process.
The core connector code will be runnable in different environments.
Metadata will be provided so that the connector runtime can easily pick up new connectors.

A prototypical connector receiving AMQP messages (e.g., from RabbitMQ) looks like this:

https://medium.com/media/677d6ea9714af94cd4ca42873c0e9841/href

And here is the related visualization:

Status and next steps

Currently, only a fraction of what we work on is publicly visible. Therefore, there are currently some limitations on connectors in Camunda Platform version 8.0, mainly:

The SDK for connectors is not open to the public simply because we need to finalize some things first, as we want to avoid people building connectors that need to be changed later on.
The code of existing connectors (REST, SendGrid, and Slack) is not available and cannot be run on Self-Managed environments yet.
The UI support is only available within Web Modeler, not yet within Desktop Modeler.

We are working on all of these areas and plan to release the connector SDK later this year. We can then provide sources and binaries to existing connectors to run them in Self-Managed environments or to understand their inner workings. Along with the SDK, we plan to release connector templates that allow you to easily design the UI attributes and parameters required for your connector and provide you with the ability to share the connector template with your project team.

At the same time, we are also working on providing more out-of-the-box connectors (like the Slack connector that was just released last week) and making them available open source. We are also in touch with partners who are eager to provide connectors to the Camunda ecosystem. As a result, we plan to offer some kind of exchange where you can easily see which connectors are available, their guarantees, and their limitations.

Still, the whole connector architecture is built to allow everybody to build their own connectors. This especially also enables you to build private connectors for your own legacy systems that can be reused across your organization.

Summary

The main building block to implementing connectors is our SDK for inbound and outbound connectors, whereas outbound connectors can be based on webhooks, subscriptions, or polling. This allows writing connector code that is independent of the connector runtime so that you can leverage connectors in the Camunda SaaS offering and your own Self-Managed environment.

At the same time, connector templates will allow a great modeling experience when using connectors within your own models. We are making great progress, and you can expect to see more later this year. Exciting times ahead!

Bernd Ruecker is co-founder and chief technologist of Camunda as well as the author ofPractical Process Automation with O’Reilly. He likes speaking about himself in the third person. He is passionate about developer-friendly process automation technology. Connect viaLinkedIn or follow him onTwitter. As always, he loves getting your feedback. Comment below orsend him an email.

Why Process Orchestration Needs Advanced Workflow Patterns

Bernd Ruecker — Mon, 01 Aug 2022 19:23:19 +0000

Life is seldom a straight line, and the same is true for processes. Therefore, you must be able to accurately express all the things happening in your business processes for proper end-to-end process orchestration. This requires workflow patterns that go beyond basic control flow patterns (like sequence or condition). If your orchestration tool does not provide those advanced workflow patterns, you will experience confusion amongst developers, you will need to implement time-consuming workarounds, and you will end up with confusing models. Let’s explore this by examining an example of why these advanced workflow patterns matter in today’s blog post.

Initial process example

Let’s assume you’re processing incoming orders of hand-crafted goods to be shipped individually. Each order consists of many different order positions, which you want to work on in parallel with your team to save time and deliver quicker. However, while your team is working on the order, the customer is still able to cancel, and in that case, you need to be able to revoke any deliveries that have been scheduled already. A quick drawing on the whiteboard yields the following sketch of this example:

Let’s create an executable process model for this use case. I will first show you a possible process using ASL (Amazon States Language) and AWS Step Functions, and secondly with Camunda Platform and BPMN (Business Process Model and Notation) to illustrate the differences between these underlying workflow languages.

Modeling using AWS Step Functions

The following model is created using ASL, which is part of AWS Step Functions and, as such, a bespoke language. Let’s look at the resulting diagram:

To discuss it, I will use workflow patterns, which are a proven set of patterns you will need to express any workflow.

The good news is that ASL can execute a workflow pattern called “dynamic parallel branches,” which allows parallelizing execution of the order positions. This is good; otherwise, we would need to start multiple workflow instances for the order positions and do all synchronizations by hand.

But this is where things get complicated. ASL does not offer reactions to external messages; thus, you cannot interrupt your running workflow instance if an external event happens, like the customer cancels their order. Therefore, you need a workaround. One possibility is to use a parallel branch that waits for the cancellation event in parallel to execute the multiple instance tasks, marked with (1) in the illustration above.

When implementing that wait state around cancelation, you will undoubtedly miss a proper correlation mechanism, as you cannot easily correlate events from the outside to the running workflow instance. Instead, you could leverage the task token generated from AWS and keep it in an external data store so that you can locate the correct task token for a given order id. This means you have to implement a bespoke message correlation mechanism yourself, including persistence as described in Integrating AWS Step Functions callbacks and external systems.

When the cancelation message comes in, the workflow advances in that workaround path and needs to raise an error so all order delivery tasks are canceled, and the process can directly move on to cancelation, marked with (2) in the above illustration.

But even in the desired case that the order does not get canceled; you need to leverage an error. This is marked with (3) in the illustration above. This is necessary to interrupt the task of waiting for the cancelation message.

You need to use a similar workaround again when you want to wait for payment, but stop this waiting after a specified timeout. Therefore, you will start a timer in parallel, marked with (4), and use an error to stop it later, marked with (5).

Note that when you configure the wait state, you might assume you may misuse Step Functions here, as you configure the time in seconds, meaning you have to enter a big number (864,000 seconds) to wait ten days.

Of course, you could also implement your requirements differently. For example, you might implement all order cancelation logic entirely outside of the process model and just terminate the running order fulfillment instance via API. But note that by doing so, you will lose a lot of visibility around what happens in your process, not only during design time but also during operations or improvement endeavors.

Additionally, you distribute logic that belongs together all over the place (step function, code, etc.) For example, a change in order fulfillment might mean you have to rethink your cancelation procedure, which is obvious if cancelation is part of the model.

To summarize, the lack of advanced workflow patterns requires workarounds, which are not only hard to do but also make the model hard to understand and thus weakens the value proposition of an orchestration engine.

Modeling with BPMN

Now let’s contrast this with modeling using the ISO standard BPMN within Camunda:

This model is directly executable on engines that support BPMN, like Camunda. As you can see, BPMN supports all required advanced workflow patterns to make it not only easy to model this process but also yields a very understandable model.

Let’s briefly call out the workflow patterns (besides the basics like sequence, condition, and wait) that helped to make this process so easy to implement:

This model can be perfectly used to discuss the process with various stakeholders, and can further be shown in technical operations (e.g., if some process instance gets stuck) or business analysis (e.g., to understand which orders are canceled most and in which state of the process execution). Below is a sample screenshot of the operations tooling showing a process instance with six order items, where one raised an incident. You can see how easy it gets to dive into potential operational problems.

Let’s not let history repeat itself!

I remember one of my projects using the workflow engine JBoss jBPM 3.x back in 2009. I was in Switzerland for a couple of weeks, sorting out exception scenarios and describing patterns on how to deal with those. Looking back, this was hard because jBPM 3 lacked a lot of essential workflow patterns, especially around the reaction to events or error scopes, which I did not know back then. In case you enjoy nostalgic pictures as much as I do, this is a model from back then:

I’m happy to see BPMN removed the need for all of those workarounds necessary, creating a lot of frustration among developers. Additionally, the improved visualization really allowed me to discuss process models with a larger group of people with various experience levels and backgrounds in process orchestration.

Interestingly enough, many modern workflow or orchestration engines lack the advanced workflow patterns described above. Often, this comes with the promise of being simpler than BPMN. But in reality, claims of simplicity mean they lack essential patterns. Hence, if you follow the development of these modeling languages over time, you will see that they add patterns once in a while, and whenever such a tool is successful, it almost inevitably ends up with a language complexity comparable to BPMN but in a proprietary way. As a result, process models in those languages are typically harder to understand.

At the same time, developing a workflow language is very hard, so chances are high that vendors will take a long time to develop proper pattern support. I personally don’t understand this motivation, as the knowledge about workflow patterns is available, and BPMN implements it in an industry-proven way, even as an ISO standard.

Conclusion

The reality of a business process requires advanced workflow patterns. If a product does not natively support them, its users will need to create technical workarounds, as you could see in the example earlier:

ASL lacked pattern and required complex workarounds.
BPMN supports all required patterns and produces a very comprehensible model.

Emulating advanced patterns with basic constructs and/or programming code, as necessary for ASL, means:

Your development takes longer.
Your solution might come with technical weaknesses, like limited scalability or observability.
You cannot use the executable process model as a communication vehicle for business and IT.

To summarize, ensure you use an orchestration product supporting all-important workflow patterns, such as Camunda, which uses BPMN as workflow language.

How to Achieve Geo-redundancy with Zeebe

Bernd Ruecker — Tue, 28 Jun 2022 14:14:35 +0000

Camunda Platform 8 reinvented the way an orchestration and workflow engine works. We applied modern distributed system concepts and can now even allow geo-redundant workloads, often referred to as multi-region active-active clusters. Using this technology, organizations can build resilient systems that can withstand disasters in the form of a complete data center outage.

For example, a recent customer project in a big financial institution connected a data center in Europe with one in the United States and this did not affect their throughput, meaning they can still run the same number of process instances per second. But before talking about multi-regions and performance, let’s disassemble this fascinating topic step-by-step in today’s blog post.

Many thanks to ourdistributed systems guru, Falko, for providing a ton of input about this topic, and my great colleague Nele for helping to get everything in order in this post.

Hang on — geo-redundant? Multi-region? Active-active?

First, let’s quickly explain some important basic terminology we are going to use in this post:

Geo-redundancy (also referred to as geo-replication): We want to replicate data in a geographically distant second data center. This means even a massive disaster like a full data center going down will not result in any data loss. For some use cases, this becomes the de-facto standard as most businesses simply cannot risk losing any data.
Multi-region : Most organizations deploy to public clouds and the public cloud providers name their different data centers a region. So in essence, deploying to two different regions makes sure; those deployments will end up in separate data centers.
Availability zones : A data center, or a region, is separated into availability zones. Those zones are physically separated, meaning an outage because of technical failures is limited to one zone. Still, all zones of a region are geographically located in one data center.
Active-active : When replicating data to a second machine, you could simply copy the data there, just to have it when disaster strikes. This is called a passive backup. Today, most use cases strive for the so-called active-active scenario instead, where data is actively processed on both machines. This makes sure you can efficiently use the provisioned hardware (and not keep a passive backup machine idle all the time).
Zeebe : The workflow engine within Camunda Platform 8.

So let’s rephrase what we want to look at today: How to run a multi-region active-active Zeebe cluster (which then is automatically geo-redundant and geo-replicated). That’s a mouthful!

Resilience levels

Firstly, do you really need multi-region redundancy? To understand this better, let’s sketch the levels of resilience you can achieve:

Clustering : You build a cluster of nodes in one zone. You can stand hardware or software failures of individual nodes.
Multi-zone : You distribute nodes into multiple zones, increasing availability as you can now stand an outage of a full zone. Zone outages are very rare.
Multi-region : You distribute nodes into multiple regions, meaning geographically distributed data centers. You will likely never experience an outage of a full region, as this might only happen because of exceptional circumstances.

So while most normal projects are totally fine with clustering, the sweet spot is multi-zone. Assuming you run on Kubernetes provided by one of the Hyperscalers, multi-zone is easy to set up and thus does not cause a lot of effort or costs. At the same time, it provides an availability that is more than sufficient for most use cases. Only if you really need to push this availability and need to withstand epic disasters do you need to go for multi-region deployments. I typically see this with big financial or telecommunication companies. That said, there might also be other drivers besides availability for a multi-region setup:

Locality: Having a cluster spanning multiple regions, clients can talk to the nodes closest to them. This can decrease network latencies.
Migration: When you need to migrate to another region at your cloud provider, you might want to gradually take workloads over and run both regions in parallel for some time to avoid any downtimes.

In today’s blog post, we want to unwrap Zeebe’s basic architecture to support any of those resilience scenarios, quickly describe a multi-zone setup, and also turn our attention to multi-region, simply because it is possible and we are regularly asked about it. Finally, we’ll explain how Zeebe scales and how we can turn any of those scenarios into an active-active deployment.

Replication in Zeebe

To understand how we can achieve resilience in Zeebe, you first need to understand how Zeebe does replication. Zeebe uses distributed consensus — more specifically theRaft Consensus Algorithm — for replication.There is an awesomevisual explanation of the Raft Consensus Algorithm available online, so I will not go into all the details here. The basic idea is that there is a single leader and a set of followers. The most common setup is to have one leader and two followers, and you’ll see why soon.

When the Zeebe brokers start up, they elect a leader. Only the leader is allowed to write data. The data written by the leader is replicated to all followers. Only after a successful replication is the data considered committed and can be processed by Zeebe (this is explained in more detail in how we built a highly scalable distributed state machine). In essence, all (committed) data is guaranteed to exist on the leader and all followers all the time.

There is one important property you can configure for your Zeebe cluster — the replication factor. A replication factor of three means data is available three times, on the leader as well as replicated to two followers, as indicated in the image below.

A derived property is what is called the quorum. This is the number of nodes required to hold so-called elections. Those elections are necessary for the Zeebe cluster to select who is the leader and who is a follower. To elect a leader, at least round_down(replication factor / 2) + 1 nodes need to be available. In the above example, this means round_down(3/2)+1 = 2 nodes are needed to reach a quorum.

So a cluster with a replication factor of three can process data if at least two nodes are available. This number of nodes is also needed to consider something committed in Zeebe.

The replication factor of three is the most common, as it gives you a good compromise of the number of replicas (additional hardware costs) and availability (I can tolerate losing one node).

A sample failure scenario

With this in mind, let’s quickly run through a failure scenario, where one node crashes:

One node crashing will not affect the cluster at all, as it still can reach a quorum. Thus, it can elect a new leader and continue working. You should simply replace or restart that node as soon as possible to keep an appropriate level of redundancy.

Note that every Zeebe cluster with a configured replication factor has basic resilience built in.

Multi-zone Zeebe clusters

When running on Kubernetes in a public cloud, you can easily push availability further by distributing the different Zeebe nodes to different availability zones. Therefore, you can leverage multi-zone clusters in Kubernetes. For example, in Google Cloud (GCP) this would mean regional clusters (mind the confusing wording: a regional cluster is spread across multiple zones within one region). Then, you can set a constraint, that your Zeebe nodes, running as a stateful set, are all running in different zones from each other. Et voila, you added multi-zone replication:

From the Zeebe perspective, the scenario of a zone outage is now really the same as the one of a node outage. You can also run more than three Zeebe nodes, as we will discuss later in this post.

Multi-region Zeebe clusters

As multi-zone replication was so easy, let’s also look at something technically more challenging (reminding ourselves, that not many use cases actually require it): multi-region clusters.

You might have guessed it by now — the logic is basically the same. You distribute your three Zeebe nodes to three different regions. But unfortunately, this is nothing Kubernetes does out of the box for you, at least not yet. There is so much going on in this area that I expect new possibilities to emerge any time soon (just naming Linkerd’s multi-cluster communication with StatefulSets as an example).

In our customer project, this was not a show stopper, as we went with the following procedure that proved to work well:

Spin up three Kubernetes clusters in different regions (calling them “west”, “central”, and “east” here for brevity).
Set up DNS forwarding between those clusters (see solution #3 of Cockroach running a distributed system across Kubernetes Clusters) and add the proper firewall rules so that the clusters can talk to each other.
Create a Zeebe node in every cluster using tweaked Helm charts. Those tweaks made sure to calculate and set the Zeebe broker ids correctly (which is mathematically easy, but a lot of fun to do in shell scripts;-)). This will lead to “west-zeebe-0” being node 0, “central-zeebe-0” being 1, and “east-zeebe-0” being 2.

Honestly, those scripts are not ready to be shared without hand-holding, but if you plan to set up a multi-region cluster, please simply reach outand we can discuss your scenario and assist.

Note that we set up as many regions as we have replicas. This is by design, as the whole setup becomes rather simple if:

The number of nodes is a multiple of your replication factor (in our example 3, 6, 9, …).
The nodes can be equally distributed among regions (in our example 3 regions for 3, 6, or 9 nodes).

Running Zeebe in exactly two data centers

Let’s discuss a common objection at this point: we don’t want to run in three data centers, we want to run it in exactly two! My hypothesis is that this yields from a time when organizations operated their own data centers, which typically meant there were only two data centers available. However, this changed a lot with the move to public cloud providers.

Truthfully, it is actually harder to run a replicated Zeebe cluster spanning two data centers than spanning three. This is because of the replication factor and using multiples — as you could see above. So in a world dominated by public cloud providers, where it is not a big deal to utilize another region, we would simply recommend replicating to three data centers.

Nevertheless, in the customer scenario, there was the requirement to run Zeebe in two regions. So we quickly want to sketch how this could be done. Therefore, we run 4 nodes to have two nodes in every region. This allows one node to go down and still guarantees a copy of all data in both regions. Therefore, three nodes are not enough to be able to deal with an outage of a whole region.

The following image illustrates our concrete setup:

There is one key difference to the three-region scenario: When you lose one region, an operator will need to jump in and take manual action. When two nodes are missing, the cluster has no quorum anymore (remember: replication factor 4 / 2 + 1 = 3) and cannot process data as visualized in the following diagram:

To get your cluster back to work, you need to add one more (empty) cluster node, having the Zeebe node id of the original node three (at the time of writing, the cluster size of Zeebe is fixed and cannot be increased on the fly, this is why you cannot simply add new nodes). The cluster automatically copies the data to this new node and can elect a new leader as the cluster is back online.

Adding this node is consciously a manual step to avoid a so-called split-brain situation. Assume that the network link between region one and region two goes down. Every data center is still operating but thinks the other region is down. There is no easy way for an automated algorithm within one of the regions to decide if it should start new nodes, but avoid starting new nodes in both regions. This is why this decision is pushed to a human operator. As losing whole regions is really rare, this is tolerable. Please note again that this is only necessary for the two-region scenario, not when using three regions (as they still have a quorum when one region is missing).

When the region comes back, you can start node 4 again, and then replace the new node 3 with the original one:

The bottom line is that using two regions is possible, but more complex than simply using three regions. Whenever you are not really constrained by the number of physical data centers available to you (like with public cloud providers), we recommend choosing a thoughtful number of regions.

Scaling workloads using partitions

So far, we simplified things a little bit. We were not building real active-active clusters, as followers do not do any work other than replicating. Also, we did not really scale Zeebe. Let’s look at this next.

Zeebe uses so-called partitions for scaling, as further explained in how we built a highly scalable distributed state machine. In the above examples, we looked at exactly one partition. In reality, a Camunda Platform 8 installation runs multiple partitions. The exact number depends on your load requirements, but it should reflect what was described above about multiples.

So a replication factor of three means we might run 12 partitions on six nodes, or 18 partitions on six nodes, for example. Now, leaders and followers of the various partitions are distributed onto the various Zeebe nodes, making sure those nodes are not only followers but also leaders for some of the partitions. This way, every node will also do “real work”.

The following picture illustrates this, whereas P1 — P12 stands for the various partitions:

Now, there is a round-robin pattern behind distributing leaders and their followers to the nodes. We can now leverage this pattern to guarantee geo-redundancy by adding the nodes to the various data centers in a clever round-robin too. As you can see above, for example in P1 the leader is in region 2, and the followers are in regions 1 and 3, so every data center has a copy of the data as described earlier. And this is also true for all other partitions. An outage will not harm the capability of the Zeebe cluster to process data. The following illustration shows what happens if region 3 goes down; the partitions only need to elect some new leaders:

And how does geo-redundancy affect performance?

Finally, let’s also have a quick look at how multi-region setups affect the performance and throughput of Zeebe. The elephant in the room is of course that network latency between geographically separate data centers is unavoidable. Especially if you plan for epic disasters, your locations should not be too close. Or if you want to ensure geographic locality, you might even want various data centers to be close to the majority of your customers, which might simply mean you will work with data centers all over the world. In our recent customer example, we used one GCP region in London and one in the US, Northern Virginia to be precise. The latency between those data centers is estimated to be roughly 80ms (according to https://geekflare.com/google-cloud-latency/), but latencies can also go further up to a couple of hundred milliseconds.

Spoiler alert: This is not at all a problem for Zeebe and does not affect throughput.

To add some spice to this, let’s quickly look at why this is a problem in most architectures. For example, in Camunda Platform 7 (the predecessor of the current Camunda Platform 8), we used a relational database and database transactions to store the workflow engine state. In this architecture, replication needs to happen as part of the transaction (at least if we need certain consistency guarantees, which we do) resulting in transactions that take a long time. Conflicts between transactions are thus more likely to occur, for example, because two requests want to correlate something to the same BPMN process instance. Second, typical resource pools for transactions or database connections might also end up being exhausted in high-load scenarios.

In summary, running Camunda Platform 7 geographically distributed is possible, but especially under high load, it bears challenges.

With the Camunda Platform 8 architecture, the engine does not leverage any database transaction. Instead, it uses a lot of ring buffers to queue things to do. And waiting for IO, like the replication reporting success, does not block any resource and further does not cause any contention in the engine. This is described in more detail in how we built a highly scalable distributed state machine.

Long story short: Our experiments clearly supported the hypothesis that geo-redundant replication does not affect throughput. Of course, processing every request will have higher latency. Or to put in other words, your process cycle times will increase, as the network latency is still there. However, it only influences that one number in a very predictable way. In the customer scenario, a process that typically takes around 30 seconds was delayed by a couple of seconds in total, which was not a problem at all. We have not even started to optimize for replication latency, but have a lot of ideas.

Summary

In this post, you could see that Zeebe can easily be geo-replicated. The sweet spot is a replication factor of three and replication across three data centers. In public cloud speak, this means three different regions. Geo-replication will of course add latency but does not affect throughput. Still, you might not even need such a high degree of availability and be happy to run in multiple availability zones of your data center or cloud provider. As this is built into Kubernetes, it is very easy to achieve.

Please reach out to us if you have any questions, specific scenarios, or simply want to share great success stories!

What to do When You Can’t Quickly Migrate to Camunda 8

Bernd Ruecker — Wed, 25 May 2022 19:03:31 +0000

Managing a brownfield when you simply don’t have a green one

With Camunda Platform 8 out of the door now, I’ve been having frequent discussions around migration. Many of them go along the lines of: “We are invested in Camunda 7, including a lot of best practices, project templates, and even code artifacts. We can’t quickly migrate to Camunda 8, so what should we do now?” I call this a brownfield. If you are in this situation, this blog post is for you.

Greenfield recommendation

But let’s start with the easy things first. Let’s assume you just entered the world of process automation and orchestration with Camunda, and you’re starting from scratch. In this case, we strongly recommend starting with Camunda 8 right away, for example, using the Java greenfield stack: Java, Spring Boot, Spring Zeebe, and Camunda Platform 8 — SaaS.

Can’t use Camunda 8 just yet?

But there are some edge cases where you might not want to use Camunda 8 right away. The typical reasons include:

You can’t leverage Camunda 8 — SaaS, but also don’t have Kubernetes at your disposal to install the platform self-managed. While installing Camunda 8 on bare-metal or VMs is possible, it is also not super straightforward and might not be your choice if you have to set up many engines in a big organization that embraces microservices. Of course, you could probably leverage existing Infrastructure as Code (IaC) toolchains to ease this task (like Terraform or Ansible).
You are missing a concrete feature because Camunda 8 needs to catch up on feature parity. The prime examples are around BPMN elements like compensation or conditions.
You stick to a principle not to run x.0 software versions in production (while I do see the point here, I want to add that I don’t think this applies to Camunda 8.0. It is technically a Camunda Cloud 1.4 release with quite some people already in production with it).

Independent of the exact reason, this means that you should start on a greenfield with Camunda 7. It’s worth repeating that this should be an exception. In this case, the recommendation is to start with the latest Camunda 7 greenfield stack: Camunda Run as a remote engine via Docker and External Tasks. If you code in Java, your process solution stack will be Java, Spring Boot, and the Camunda REST Client. If you program in other languages, you should simply leverage the REST API. This is conceptually pretty close to a Camunda 8 architecture. Let’s call it the external task approach.

There is one downside of this stack, though — the Java developer experience is not as great as it is with Camunda 8. Historically, Camunda users preferred embedded engines using Java Delegates. This stack offers a great experience for Java developers. Camunda Run does not offer that same level of developer experience, even though it has improved over the years. While this is normally not a real problem, it might decrease developer motivation around Camunda projects. So if this is a real problem in your context, it is worth going with the greenfield stack from some years ago: Java, Spring Boot, Camunda Spring Boot Starter, and Java Delegates. This stack is also mentioned as the example in our migration guide, as it is by far the most common Camunda 7 stack you’ll meet in the wild. Let’s call this the Java Delegate approach.

So I see both approaches as valid choices. But, of course, if you start with Camunda 7 now, you need to think ahead and prepare for a future Camunda 8 migration. This is where the approaches differ; with Java Delegates, you have a harder time making sure to stick to what we call Clean Delegates, as Java Delegates technically allow pretty dirty hacks. But there will be more on this later in this blog post.

Greenfield recommendation summary

So let’s quickly recap our recommendations so far:

Use Camunda Platform 8 — SaaS.
If this is not possible, use Camunda Platform 8 — Self-Managed.
If this is not possible, use Camunda Platform 7 Run and the external task approach.
If this is not possible, use Camunda Platform 7 Spring Boot Starter, but implement Clean Delegates.

Brownfields

Now let’s turn our attention back to the brownfield companies. In such situations, the company already uses Camunda 7 and will not migrate overnight to Camunda 8 (which neither makes sense nor is necessary). In an ideal world, you would simply start new projects with Camunda 8 and migrate your existing projects step by step over time. But often, it is not that easy.

For example, your company might have invested a lot of effort into integrating Camunda 7 into its ecosystem. This goes far beyond the code of one process solution but includes best practices, examples, code snippets, reusable connectors, and many more. In such cases, you might still want to start new projects with Camunda 7 until you have a clear idea (and budget) of how to migrate all of those things.

Or your project is already in-flight and will be finished better with Camunda 7. Or an initiative pops up to extend an existing Camunda 7 process solution, and you cannot make the migration to Camunda 8 part of that endeavor.

In those cases, the typical question is, “Should we keep doing what we are doing, or should we quickly try to change our architecture to get closer to Camunda 8 already?”

The short answer is to keep doing what you are doing. This will make migration efforts easier at a later point in time, as you will have one common architecture to migrate. If you adjust your Camunda 7 architecture now, you might end up with two different architecture blueprints you need to migrate. Both external task and Java delegate approaches are OK!

But you should make sure to establish some practices as quickly as possible that will ease migration projects later on. Those are described in the rest of this post. While external tasks might enforce some practices, Clean Delegates are equally easy (or sometimes even easier) to migrate.

Practices to ease migration

In order to implement Camunda 7 process solutions that can be easily migrated, you should stick to the following rules (that are good development practices you should follow anyway), which will be explained in more detail later:

Implement what we call Clean Delegates_ — _concentrate on reading and writing process variables, plus business logic delegation. Data transformations will be mostly done as part of your delegate (and especially not as listeners, as mentioned below). Separate your actual business logic from the delegates and all Camunda APIs. Avoid accessing the BPMN model and invoking Camunda APIs within your delegates.
Don’t use listeners or Spring beans in expressions to do data transformations via Java code.
Don’t rely on an ACID transaction manager spanning multiple steps or resources.
Don’t expose Camunda API (REST or Java) to other services or front-end applications.
Use primitive variable types or JSON payloads only (no XML or serialized Java objects).
Use simple expressions or plug-in FEEL. FEEL is the only supported expression language in Camunda 8. JSONPath is also relatively easy to translate to FEEL. Avoid using special variables in expressions, e.g., execution or task.
Use your own user interface or Camunda Forms; the other form mechanisms are not supported out-of-the-box in Camunda 8.
Avoid using any implementation classes from Camunda; generally, those with *.impl.* in their package name.
Avoid using engine plugins.

For the moment, it might also be good to check the BPMN elements supported in Camunda 8, but this gap will most likely be closed soon.

Execution Listeners and Task Listeners are areas in Camunda 8 that are still under discussion. Currently, those use cases need to be solved slightly differently. Depending on your use case, the following Camunda 8 features can be used:

Input and output mappings using FEEL
Tasklist API
History API
Exporters
Client interceptors
Gateway interceptors
Job workers on user tasks

I expect to soon have a solution in Camunda 8 for most of the problems that listeners solve. Still, it might be good practice to use as few listeners as possible, and especially don’t use them for data mapping as described below.

Clean Delegates

With Java Delegates and the workflow engine being embedded as a library, projects can do dirty hacks in their code. Casting to implementation classes? No problem. Using a ThreadLocal or trusting a specific transaction manager implementation? Yeah, possible. Calling complex Spring beans hidden behind a simple JUEL (Java unified expression language) expression? Well, you guessed it — doable!

Those hacks are the real show stoppers for migration, as they simply cannot be migrated to Camunda 8. Actually, Camunda 8 increased isolation intentionally.

So you should concentrate on what a Java Delegate is intended to do:

Read variables from the process and potentially manipulate or transform that data to be used by your business logic.
Delegate to business logic — this is where Java Delegates got their name from. In a perfect world, you would simply issue a call to your business code in another Spring bean or remote service.
Transform the results of that business logic into variables you write into the process.

Here’s an example of an ideal JavaDelegate:

https://medium.com/media/14969fcfb5a201a3928afe44e1905193/href

And you should never cast to Camunda implementation classes, use any ThreadLocal object, or influence the transaction manager in any way. Java Delegates should further always be stateless and not store any data in their fields.

The resulting delegate can be easily migrated to a Camunda 8 API, or simply be reused by the adapter provided in this migration community extension.

No transaction managers

You should not trust ACID transaction managers to glue together the workflow engine with your business code. Instead, you need to embrace eventual consistency and make every service task its own transactional step. If you are familiar with Camunda 7 lingo, this means that all BPMN elements will be async=true. A process solution that relies on five service tasks to be executed within one ACID transaction, probably rolling back in case of an error, will make migration challenging.

Don’t expose Camunda API

You should try to apply the information hiding principle and not expose too much of the Camunda API to other parts of your application.

In the above example, you should not hand over an execution context to your CrmFacade, which is hopefully intuitive anyway:

_// DO NOT DO THIS!_

crmFacade.createCustomer(execution);

The same holds true for when a new order is placed, and your order fulfillment process should be started. Instead of the front-end calling the Camunda API to start a process instance, you are better off providing your own endpoint to translate between the inbound REST call and Camunda, like this for example:

https://medium.com/media/45e3c7d97407716db6e53cfe7875e412/href

Use primitive variable types or JSON

Camunda 7 provides quite flexible ways to add data to your process. For example, you could add Java objects that would be serialized as byte code. Java byte code is brittle and also tied to the Java runtime environment. Another possibility is magically transforming those objects on the fly to XML using Camunda Spin. It turned out this was black magic and led to regular problems, which is why Camunda 8 does not offer this anymore. Instead, you should do any transformation within your code before talking to Camunda. Camunda 8 only takes JSON as a payload, which automatically includes primitive values.

In the above example, you can see that Jackson was used in the delegate for JSON to Java mapping:

https://medium.com/media/14969fcfb5a201a3928afe44e1905193/href

This way, you have full control over what is happening, and such code is also easily migratable. And the overall complexity is even lower, as Jackson is quite known to Java people — a kind of de-facto standard with a lot of best practices and recipes available.

Simple expressions and FEEL

Camunda 8 uses FEEL as its expression language. There are big advantages to this decision. Not only are the expression languages between BPMN and DMN harmonized, but also the language is really powerful for typical expressions. One of my favorite examples is the following onboarding demo we regularly show. A decision table will hand back a list of possible risks, whereas every risk has a severity indicator (yellow, red) and a description.

The result of this decision shall be used in the process to make a routing decision:

To unwrap the DMN result in Camunda 7, you could write some Java code and attach that to a listener when leaving the DMN task (this is already an anti-pattern for migration as you will read next). This code is not super readable:

https://medium.com/media/9a32a3c2263763b436ceee8e71fc237b/href

With FEEL, you can evaluate that data structure directly and have an expression on the “red” path:

= some risk in riskLevels satisfies risk = "red"

Isn’t this a great expression? If you think, yes, and you have such use cases, you can even hook in FEEL as the scripting language in Camunda 7 today (as explained by Scripting with DMN inside BPMN or User Task Assignment based on a DMN Decision Table).

But the more common situation is that you will keep using JUEL in Camunda 7. If you write simple expressions, they can be easily migrated automatically, as you can see in the test case of the migration community extension. You should avoid more complex expressions if possible. Very often, a good workaround to achieve this is to adjust the output mapping of your Java Delegate to prepare data in a form that allows for easy expressions.

You should definitely avoid hooking in Java code during an expression evaluation. The above listener to process the DMN result was one example of this. But a more diabolic example could be the following expression in Camunda 7:

#{ dmnResultChecker.check( riskDMNresult ) }

Now, the dmnResultChecker is a Spring bean that can contain arbitrary Java logic, possibly even querying some remote service to query whether we currently accept yellow risks or not (sorry, this is not a good example). Such code can not be executed within Camunda 8 FEEL expressions, and the logic needs to be moved elsewhere.

Camunda Forms

Finally, while Camunda 7 supports different types of task forms, Camunda 8 only supports Camunda Forms (and will actually be extended over time). If you rely on other form types, you either need to make Camunda Forms out of them or use a bespoke tasklist where you still support those forms.

Summary

In today’s blog post, I wanted to show you which path to take if Camunda 8 is not yet an option for you. In summary, it’s best you keep doing what you’re already doing. This normally means leveraging the external task approach or the Java Delegate approach. Both options are OK.

With Java Delegates, you have to be very mindful to avoid hacks that will hinder a migration to Camunda 8. This article sketched the practices you should stick to in order to make migration easier whenever you want to do it, which is mostly about writing clean delegates, sticking to common architecture best practices, using primitive values or JSON, and writing simple expressions.

As always, I am happy to hear your feedback or discuss any questions you might have.

How Open is Camunda Platform 8?

Bernd Ruecker — Wed, 25 May 2022 14:59:19 +0000

With Camunda Platform 8 being available to the public, we regularly answer questions about our open source strategy and the licenses for its various components. Let’s sort this out in today’s blog post by looking at the specifics of the components, sketching a path to put Camunda 8 into production without the need to pay us any money, and the difference between open source and source-available licenses.

Component overview

Let’s look at the various components that make up Camunda Platform 8. The following illustration colors the components according to their license:

Green : Open source license.
Green stripes : Source-available license (for the curious, the difference between open source and source-available is explained below, for most people, there is no real difference).
Blue : This software is available but only free for non-production use. If you want to put these components into production, you will need to buy a license (via enterprise subscription) from Camunda.
Red : This software is only available within Camunda Platform 8 — SaaS and can’t be run self-managed. Note: This is subject to change, and some of the red components should turn blue over time.

The short summary is that you can run everything green (including green stripes) as self-managed in production without needing a license. The green components are open source, as coined by the Open Source Initiative. The striped components use a source-available license. Regarding Zeebe, this is the Zeebe Community License v1.0. It is based on the very liberal open source MIT license but with one restriction — users are not allowed to use the components for providing a commercial workflow service in the cloud. This is typically not a limitation for any of our existing customers, users, or prospects. If you want to know more about open source licensing, visit Why We Created The Zeebe Community License and Zeebe License Overview and FAQ.

Furthermore, you can run all the blue components during development and testing. This not only allows you to try them out but will help you with your development efforts. If you want to keep using them while going into production, you will need to buy a license from Camunda. Later in this blog post, I will explain how you can go live without those components, as there is a possible path.

Now, let’s quickly look at a typical question in this context: why are the blue boxes not available for production, even in a limited version?

Why free for non-production and not open core?

With Camunda Platform 7, we have an open core model where parts of the components are available open source, and the full-feature set is only available to you if you buy an enterprise subscription. So for example, the basic tier of Camunda Cockpit allows you to see running instances in open source, but only the Enterprise Edition of Camunda Cockpit shows the historical data and provides the full-feature set.

While this looks good at first glance, it actually adds a lot of friction and confusion for our users. First, they have to understand the feature differences in detail. Second, most people even miss that there is a more powerful version of Cockpit available, leading them to redevelop features that are already there. And finally, even if the customer’s team requires the power of the Enterprise Edition of Cockpit, selling the license is hard in situations where decision-makers might not care enough about the daily friction of operations to spend the money. In other words, our power users often want an Enterprise license and have a good business case for it but are still let down by their decision-makers.

This is why we made the whole model radically simpler. You can have all the tools with all the features during development without any fluff. Everything is easily accessible (available on DockerHub, for example), can help you learn Camunda, and speed up development. For example, Camunda Operate (the Cockpit equivalent in Camunda 8) helps you to understand what’s going on in your workflow engine, especially when you are new and start developing.

You will only need to buy a license when you put it into production. But the argument for the Enterprise Edition is now very simple to understand — without it, you can’t use those productivity tools. So far, our users are actually pretty happy about that change, as it makes it easier for them to ask for the necessary budget.

If for whatever reason your company doesn’t want to pay for the Enterprise Edition, there is still a way to production, as described below. However, it is less convenient and involves more work for you. Whether this is worth saving the subscription money is your company’s decision.

We believe this model has a very good balance of interests:

First, you can easily start developing process solutions with Camunda Platform 8, but also run severe workloads in production with a completely source-available stack.
Second, there is sufficient motivation to pay for the additional software, which guarantees that Camunda will stick around.
Third, this allows Camunda to stay focused and continue to invest in great software and the community.

How SaaS changes the game

So far, we’ve talked about self-managed installations. Somehow, this still seems to be the default in the heads of most people. They want to download and run the software, but this is changing. When you really think about it, you don’t want software — you want some service or feature the software is delivering. This is what cloud and SaaS (software as a service) provide. With Camunda 8, we introduced our own SaaS offering, where you can completely consume it in the cloud.

Now, this changes one important aspect — you have to be clear if you’re searching for open source or something that is free to use. And most people actually search for the latter, which can also be delivered without open source.

So with Camunda Platform 8 — SaaS, the equivalent of a Community Edition is a free tier, where users can use the service (within certain boundaries) without generating any bills. As I’m writing this blog post, we are working to extend our free tier with Camunda 8. The current situation is that you can already have a free plan for modeling use cases. And we are working on a free tier to support execution use cases, but still have to work out some details. In contrast to providing a Community Edition for download, every running cluster in the cloud adds up on our own GCP bill, so we have to be diligent about it.

In general, I expect a big mindset shift over the next few years in this regard. Users will mostly consume SaaS services, and having a free tier will be more important to them than software being open source.

At this point, I want to add one important side note — our SaaS focus will not mean that our open source commitment will be weakened, on the contrary. We have a big group of passionate people in our community that do miracles for us, and we continuously increase our investment in the community.

Camunda 8 has all the key ingredients to make a vital open source community work:

The source code for core components is available.
Code, issues, and discussions live in the open on GitHub. The frequent pull requests to our documentation are great examples of this.
Extension points allow community contributions.
Frequent meetups, talks, and blog posts.
A great developer relations team that deeply cares about the community.

A path to production with source-available software

Let’s come back to self-managed software and sketch a path to production that neither requires a commercial license nor breaks any license agreements. For production, this basically comes down to using only the source-available parts of Camunda 8:

Additionally, you will need to find solutions to replace the tools you cannot use.

Tasklist

You will need to implement your own task management solution based on using workers subscribing to Zeebe as described in the docs. That also means you have to build your own persistence to allow task queries, as the Tasklist API is part of the Tasklist component and is not free for production use.

Operate

Operate is the component you will miss most, as you typically want to gain a clear understanding of what is going on in your workflow engine and take corrective actions.

For looking at data, you can access it in Elastic (check the Elastic Exporter for details), leverage the metrics, or build your own exporters to push it to some data storage component that is convenient for you. Exporters can also filter or pre-process data on the fly. It is worth noting that the Operate data pre-processing logic backing the History API is part of Operate and not free for production use.

For influencing process instances (like canceling them), you can use the existing Zeebe API, which is also exposed as the command-line tool zbctl.

This flexibility allows you to hook functionality into your own front-ends. Of course, this takes effort, but it is definitely possible, and we know of users that have done it. As already mentioned, you should contrast that effort with the costs of the license.

Optimize

Optimize is hard to replace because it goes quite deep into process-based analytics, which is hard to build on your own. If you can’t use Optimize, the closest you might get to it is by adding your own exporters to push the data to an existing general-purpose BI (Business Intelligence), DWH (Data Warehouse), or data lake solution.

Conclusion

In this blog post, I wanted to make it very clear what components of the Camunda 8 stack are open source (or source-available) and which are not free for production use. I gave some pointers to go into production with a pure source-available stack but also tried to explain the efforts that might require, which is, of course, the upselling potential the company needs. I hope this was understandable, and I’m happy to discuss this in the Camunda forum in case there are open questions.

Implementing My Fire Service Notification System with Camunda Platform 8

Thomas Heinrichs — Thu, 21 Apr 2022 23:00:00 +0000

When working as a volunteer in the fire brigade, you can be called for service at any given moment — no matter the time or day. In my village, I’m alerted about 80 to 90 times a year. If an emergency happens, it’s important to be fast. You need to leave the house and get into your car right away to show up at the fire station in time. This is even more important if the alert gives you an indication that lives are in danger.

Since the pandemic started, the work habits of many people have changed. One significant change is that working from home has become the new normal. That’s a great change for the fire brigade since more people are now accessible in case of an emergency. But what does this mean for the actual firefighters working from home?

Usually, they don’t have the opportunity to properly sign off from work when called for duty. That’s why I came up with the idea to build a fire service notification system using BPMN and Camunda Platform 8 that automatically informs all relevant stakeholders as soon as an emergency happens.

Starting with a Process

Before starting any implementation, I always visualize an ideal process for what I want to build. A benefit of this approach is that I can reuse the model as a basis for directly executing it within a process.

To visualize the process for the fire service notification system, I am going to use the business process model and notation (BPMN) 2.0 standard. For those who don’t know yet, BPMN provides the capability to model processes in a graphical notation and execute modeled processes. With Camunda Platform 8 Modeler, I can now collaboratively design this process by following the standard. Check out my model in this tool and see the diagram I created below.

Now, let’s quickly go over the fire service notification process:

First of all, a message event is sent after I press a physical buzzer.
This will trigger a business rule task that decides which stakeholders to notify — my family or co-workers. That should prevent my co-workers from being alerted that I am on fire service in the middle of the night.
Afterward, I’ll note the starting time and date of the emergency.
Then, all relevant stakeholders will be notified, depending on my decision in step 2. The cool thing is that I can parallelize sending out my messages via Slack, SMS, and Mail.
I’m now halfway through the process! It will wait and only continue when I’m back from duty and trigger the buzzer again.
The time spent on service needs to be calculated before alerting all relevant stakeholders that I am back to work again. (In Germany it’s quite important for the employer to have this piece of information in order to get compensated by the government.)
Before the process ends, all relevant parties will be notified that I am back again. Of course, this will be parallelized, and who gets notified will depend on the time and date.
Lastly, the end event signifies that the notifications have been successfully sent.

Let’s Talk About Decisions

As mentioned in the previous section, I need to use a business rule task to decide whether I’m going to notify work-related stakeholders or not — depending on the time and day. Using the DMN standard makes it possible to easily create this decision without adding too much complexity to the overall process. This allows it to be easily understood and modified by non-coders, which is beneficial for me when I need to explain this to my family.

For example, I went with the decision in the model below. I need two input parameters for the time and weekday of the emergency. This determines whether to notify all stakeholders or just my family. The Notification Scope maps to a process variable and is used in the exclusive gateways to make a decision.

For example, if an emergency happens on Wednesday at 2 p.m., I’m going to notify “all” stakeholders.

Check out this DMN tutorial to learn more about the benefits of the DMN standard.

It’s Coding Time!

The process and decision are set — now it’s time to code the solution. To code the solution, I will use a workflow engine because it can directly execute the models from above. Now, you may ask yourself, “why use a process engine at all?” The easy answer is: because it gives you more flexibility! Adding steps to the process doesn’t affect your already existing code. It also helps you gain transparency into what your software is doing at a certain point in time.

I’m going to use Camunda Platform 8 SaaS as an orchestrator. By using this SaaS solution, I don’t need to take care of hosting a workflow engine on my own hardware. It also provides me with all the tools I need to operate and analyze my process. With my process and decision models deployed, I can now focus on writing a Spring Boot application that contains code I need on top of the process model — basically, some glue code to integrate with an SMTP server, Slack, and Twilio.

I’ll begin with creating a new Spring Boot project and adding the Spring-Zeebe dependency, which encapsulates the logic to connect to the engine. It also makes sure that I’m properly authenticated while establishing a connection to the remote workflow engine. To do so, I’ll add this maven dependency to my ‘pom.xml’:

<dependency>
  <groupId>io.camunda</groupId>
  <artifactId>spring-zeebe-starter</artifactId>
  <version>1.3.4</version>
</dependency>

Then, I’ll implement ‘ZeebeWorker’ inside my main class. Besides using the ‘@SpringBootApplication’ annotation, I also need ‘@EnableZeebeClient’. I can write a worker, as shown below. I’ll add the ‘@ZeebeWorker’ annotation and specify the connection to the service task in the BPMN model by the task type:

@ZeebeWorker(type = "capture_time_worker")
public void handleJob_capture_time(final JobClient client, final ActivatedJob job) {
  // call business logic to get current time 
  client.newCompleteCommand(job.getKey())
         .variables("{\"startingTime\":"+ "\""+ time +"\"}")
         .send()
   .exceptionally( throwable -> { throw new RuntimeException("Could not complete job " + job, throwable); });
}

This code snippet was used in my first service task that sets the starting time of the fire service. I can call whichever business logic I’d like and set variables within the ‘newCompleteCommand’. The variables can be received by using ‘job.getVariablesAsMap().get(“”)’.
I need some more workers for sending an email, posting a Slack update, sending an SMS , and calculating the time difference between the beginning and end of the fire service. These look very similar to what we have seen above, and just differ in terms of the business logic/variables retrieved and passed to the process instance. For example, the code for sending an email could look like this and will be called from the worker.

Now, I’m good to test since all these things have been implemented.

Running the Process

For the sake of simplicity, I’m not going to discuss how to build an IoT buzzer. For my purposes, I’ve chosen a pre-built WiFi button from mystorm. It’s battery-powered, magnetic, and fits perfectly into my apartment. Since the button is programmable, it can easily call and start my process instance by making an HTTP call. Below you can see a picture of it.

To test this process, I’m going to start an instance and hand over some variables (e.g., email, SMS, and Slack recipients, as well as the name of the person who is leaving for fire service). This can be easily achieved by using this visual helper tool.

Having started a process instance, I have the option to check on the instance’s lifecycle by using Operate. This tool provides real-time visibility to monitor, analyze, and resolve problems. This is also great if something abnormal occurs. For example, imagine Twilio gives me an exception, then Operate will show me this problem as visualized in the picture below. The tool gives me the ability to check the stack trace and do some lightweight troubleshooting right away. If I would need to fix something in my code base, I could retrigger the process from this monitoring tool.

The image below demonstrates how the instance would look if everything has been executed properly.

Another great feature in Operate includes checking on all the variables of your process instance. That gives you powerful insight and is a nice way to change their values if they’re causing havoc.

What’s Next?

Since the process is working as expected, the first milestone has been achieved. Designing the process and developing the integrations was rather straightforward using Camunda Platform 8 SaaS. Here are some of the notifications sent to various channels below.

Even though this is not a typical use case for Camunda Platform 8, I’ll be running a few process instances this year. Nevertheless, it’s interesting to play around with this technology and demonstrate it’s potential in such a way. And who knows, maybe I’ll onboard some fellow firefighters to this tool as well. In such a case, I’m confident that Camunda Platform 8 can handle the load.

In addition, the workflow engine provided me with a lot of flexibility during development. During operation, I made use of automatic retry cycles that made sure my employer got the message. This automation will prove its benefits once an actual emergency happens. Feel free to check out the source code on GitHub to create a similar automation for your own needs.

An interesting follow up to this blog post would be analyzing this process in Camunda Optimize, a tool for creating reports and analyzing processes. Maybe I can find some interesting correlations between the type of emergency and duration. Stay tuned!

If you want to check out the talk I gave about this topic during the Camunda Community Summit ‘22 check out the demand content. Many other interesting talks are also available for free - so it is definitely worth checking out. Follow me on Twitter if you want to stay updated on upcoming events and workshops.

The post Implementing My Fire Service Notification System with Camunda Platform 8 appeared first on Camunda.