Forem: Bernd Ruecker

Pro-code, Low-code, and the Role of Camunda

Bernd Ruecker — Mon, 11 Dec 2023 14:15:33 +0000

Pro-code is our heart and soul, but people and processes are diverse. Our optional low-code features support more use cases without getting in the way of pro-code developers.

Developers regularly ask me about Camunda’s product strategy. Especially around the Camunda 8 launch they raised concerns that we “forgot our roots” or “abandoned our developer-friendliness” — the exact attributes that developers love us for. They presume that we “jumped on the low-code train” instead, because we now have funding and need to “chase the big dollars.” As a developer at heart myself I can tell you that nothing is further from the truth, so let me explain our strategy in this post.

Here is the TL/DR : We will stay 100% developer-friendly and pro-code is our heart and soul (or bread and butter if you prefer). But people that create process solutions are diverse, as are the processes that need to be automated. So for some use cases low-code does make sense, and it is great to be able to support those cases. But low-code features in Camunda are optional and do not get in the way of pro-code developers.

For example, your worker code can become a reusable Connector (or be replaced by an out-of-the-box one) that is configured in the BPMN model using element templates. But you don’t have to use that and can just stay in your development environment to code your way forward. This flexibility allows you to use Camunda for a wide variety of use cases, which prevents business departments from being forced into shaky low-code solutions just because IT lacks resources.

But step by step…

Camunda 8 loves developers

First of all, Camunda 8 focuses on the developer experience in the same way — or even more strongly — than former Camunda versions. The whole point of providing Camunda as a product was to break out of unhandy huge BPM or low-code suites, that are simply impossible to use in professional software engineering projects (see the Camunda story here for example). This hasn’t changed. The heart of Camunda is around bringing process orchestration into the professional software developers toolbelt.

Especially with Camunda 8, we put a lot of focus on providing an excellent developer experience and a great programming model. And we now also extend that beyond the Java ecosystem. We might still have to do some homework here and there (for example getting the Spring integration to a supported product component 2024) — but it is very close to what we always had. Let me give you some short examples (you can find working code on GitHub).

Writing worker code (aka Java Delegates):

https://medium.com/media/60c54cb9c45dec8d12b937ffa07a7b65/href

Using the Spring Boot Starter as Maven dependency:

https://medium.com/media/8b6f8d2cae1868bd4ae7778f3603127e/href

Writing a JUnit test case (with an in-memory engine):

https://medium.com/media/c1e5a8e01021edfd2b0e9c9585576971/href

The only real change from Camunda version 7 to 8 is that the orchestration engine (or workflow engine if you prefer that term) runs as a separate Java process. So the above Spring Boot Starter actually starts a client that connects to the engine, not the whole engine itself. I wrote about why this is a huge advantage in moving from embedded to remote workflow engines. Summarized, it is about isolating your code from the engine’s code and simplifying your overall solution project (think about optimizing the engine configuration or resolving third-party dependency version incompatibilities).

The adjusted architecture without relational database allows us to continuously look at scalability and performance and make big leaps with Camunda 8, allowing use cases we could not tackle with Camunda 7 (e.g. multiple thousands of process instances per second, geo-redundant active/active datacenters, etc.).

A common misconception is that you have to use our cloud/SaaS offering, but this is not true. You can run the engine self-managed as well and there are different options to do that. The SaaS offering is an additional possibility you can leverage, freeing you from thinking about how to run and operate Camunda, but it is up to you if you want to make use of it.

This is a general recurring theme in Camunda 8: We added more possibilities you can leverage to make your own life easier — but we do not force anyone to use them.

The prime example of new possibilities are our low-code accelerators (e.g. Connectors). Let’s quickly dive into why we do low-code next before touching on how Connectors can help more concretely.

Existing customers adopt Camunda for many use cases

We learned from our customers that they want to use Camunda for a wide variety of use cases. Many of the use cases are core end-to-end business processes, like customer onboarding, order fulfillment, claim settlement, payment processing, trading, or the like.

But customers also need to automate simpler processes. Those processes are less complex, less critical, and typically less valuable, but still those processes are there and automating them has a return on investment or is simply necessary to fulfill customer expectations. Good examples are around master data changes (e.g. address or bank account data), bank transfer limits, annual mileage reports for insurances, delay compensation, and so on.

In the past, organizations often did not consider using Camunda for those processes, as they could not set up and staff software development projects for simpler, less critical processes.

And the non-functional requirements for those simpler process automation solutions differ. While the super critical high complex use cases are always implemented with the help of the IT team, to make sure the quality meets the expectations for this kind of solution and everything runs smoothly, the use cases on the lower end of that spectrum don’t have to comply with the same requirements. If they are down, it might not be the end of the world. If they get hacked, it might not be headline news. If there are wired bugs, it might just be annoying. So it is probably OK to apply a different approach to create solutions for these less critical processes.

Categorizing use cases

The important thing is to make a conscious choice and not apply the wrong approach for the process at hand. What we have seen working successfully is to categorize use cases and place them into three buckets:

Red : Processes are mission critical for the organization. They are also complex to automate and probably need to operate at scale. Performance and information security can be very relevant, and regulatory requirements might need to be fulfilled. Often we talk about core end-to-end business processes here, but sometimes also other processes might be that critical. For these use cases you need to do professional software engineering using industry best practices like version control, automated testing, continuous integration and continuous delivery. The organization wants to apply some governance, for example around which tools can be used and what best practices need to be applied.
Yellow : Processes are less critical, but still the organization’s operations would be seriously affected if there are problems. So you need to apply a healthy level of governance, but need to accept that solutions are not created in the same quality as for red use cases, mostly because you simply have a shortage of software developers.
Green : Simple automations, often being very local to one business unit or even an individual. These are often quick fixes stitched together to make one’s life a bit easier, but the overall organization might not even recognize if they break apart. For those uncritical use cases, the organization can afford leaving a lot of freedom to people, so typically there is no governance or quality assurance applied.

While the red use cases are traditionally done with Camunda, and the green use cases are traditionally done with Office-like tooling or low-code solutions (like Airtable or Zapier), the yellow bucket gets interesting. And this is a long tail of processes, that all needs to be automated with a fair level of governance, quality assurance and information security.

We already know organizations using Camunda for those yellow use cases. In order to do this and to ease solution creation, they developed low-code tooling on top of Camunda. A prime example is Goldman Sachs, who built a quite extensive platform based on Camunda 7 (side note: they alsotalk about a differentiation between core banking use cases and the long tail of simpler processes across the firm in later presentations). Speaking to those customers we found a recurring theme and used this feedback to design product extensions that those organizations could have used off-the-shelf (if it would have been there when they started). And we designed this solution to not get in the way of professional software developers when implementing red use cases around critical core processes.

I am not going into too much detail around all of those low code accelerators in this post, but it is mostly around Connectors, rich forms, data handling, the out-of-the-box experience of tools like Tasklist, and browser-based tooling.

For me it is important to re-emphasize the pattern mentioned earlier: Those accelerators are an offer — you don’t have to use them. And if you look deeper, those accelerators are not mystic black boxes. A Connector, for example, is “just” a reusable job worker with a focused properties panel (if you are interested in code, check out any of our existing out-of-the-box Connectors), whereas the property panel can even be generated from Java code. Camunda Marketplace helps you to make this reusable piece of functionality discoverable. Existing Connectors are available in their source and can be extended if needed.

Democratization and acceleration by Connectors

There are two main motivations to use Connectors.

Software developers might simply become more productive by using them, and this is what we call acceleration. For example, it might simply be quicker to use a Twilio Connector instead of figuring out the REST API for sending an SMS and how it is best called from Java. As mentioned, if this is not true for you, e.g. because you have an internal library you simply use to hide the complexity of using Twilio, this is great, then you just keep using that. Also, when you want to write more JUnit tests, it might be simpler to write integration code in Java yourself. This is fine! You are not forced to use Connectors, it is an offer, and if it makes your life easier, use them.

The other more important advantage is that it allows a more diverse set of people to take part in solution creation, which is referred to as democratization. So for example, a tech-savvy business person could probably stitch together a simpler process using Connectors, even if they cannot write any programming code. Remember, we are talking about the long tail of simpler processes (yellow) here.

A powerful pattern then is that software developers enable other roles within the organization. One way of doing this can be to have a Center of Excellence where custom Connectors are built specifically shaped around the needs of the organization. And those Connectors are then used by other roles to stitch together the processes. One big advantage is that your IT team has control over how Connectors are built and used, allowing them to enforce important governance rules, e.g. around information security or secret handling (something which is a huge problem with typical low code solutions).

You could also mix different roles in one team creating a solution, and the developer can focus on the technical problems to set up Connectors properly, and more business-like people can concentrate on the process model. And of course there are many nuances in the middle.

This is comparable to a situation we know from software vendors embedding Camunda into their software for customization. Their software product then typically comes with a default process model and consultants can customize the processes to end-customer needs within certain limits the software team built-in.

Avoiding the danger zone when doing vendor rationalization and tool harmonization

Many organizations currently try to reduce the number of vendors and tools they are using. This is understandable on many levels, but it is very risky if the different non-functional requirements of green, yellow, and red processes are ignored.

For example, procurement departments might not want to have multiple process automation tools. But for them, the difference between Camunda and a low-code vendor is not very tangible as they both automate processes.

For red use cases, customers can still easily argue why they cannot use a low-code tool because those tools simply don’t fit into professional software development approaches. But for yellow use cases, this gets much more complicated to argue. This can lead to a situation where low-code tools, made for green use cases, are applied for yellow ones. This might work for very simple yellow processes, but can easily become risky if processes are getting too complex, or simply if requirements around stability, resilience, easing maintenance, scalability or information security rise over time. This is why I consider this a big danger zone for companies to be in.

Camunda’s low-code acceleration features allow you to use Camunda in more yellow use cases, as you don’t have to involve software developers for everything. But if non-functional requirements rise, you can always fulfill those with Camunda, as it is built for red use cases as well. Just as an example, you could start adding automated tests whenever the solution starts to be too shaky. Or you could scale operations, if you face an unexpected high demand (think of flight cancellations around the Covid pandemic — this was a yellow use case for airlines, but it became highly important to be able to process them efficiently basically overnight).

To summarize: It’s better to target yellow use cases with a pro-code solution like Camunda with added low-code acceleration layers that you can use, but don’t have to. This prevents risky situations with low-code solutions that cannot cope with rising non-functional requirements.

And to link back to our product strategy: With Camunda 8 we worked hard to allow even “redder” use cases (because of improved performance, scalability, and resilience), as well as more yellow use cases at the same time. So you can go further left (red) and right (yellow) at the same time.

Summary

In today’s post I re-emphasized that Camunda is and will be developer-friendly. Pro-code (red) use cases are our bread and butter business, and honestly those are super exciting use cases where we can play to our strengths. This is strategically highly relevant, even if you might see a lot of marketing messaging around low-code accelerations at the moment.

Those low-code accelerators allow building less complex solutions (yellow) too, where typically other roles take part in solution creation (democratization, acceleration, and enablement). This helps you to reduce the risk of using the wrong tool for yellow use cases ending up in headline news.

You can read more about our vision for low-code here, or if you’re curious about how our Connectors work, feel free to check out our docs to learn more.

A Technical Sneak Peek into Camunda’s Connector Architecture

Bernd Ruecker — Wed, 03 Aug 2022 13:19:56 +0000

What is a connector? How does the code for a connector look like? And how can connectors be operated in various scenarios?

When Camunda Platform 8 launched earlier this year, we announced connectors and provided some preview connectors available in our SaaS offering, such as sending an email using SendGrid, invoking a REST API, or sending a message to Slack.

Since then, many people have asked us what a connector is, how such a connector is developed, and how it can be used in Self-Managed. We haven’t yet published much information on the technical architecture of connectors as it is still under development, but at the same time, I totally understand that perhaps you want to know more to feel as excited as me about connectors.

In this blog post, I’ll briefly share what a connector is made of, how the code for a connector roughly looks, and how connectors can be operated in various scenarios. Note that the information is a preview, and details are subject to change.

What is a connector?

A connector is a component that talks to a third-party system via an API and thus allows orchestrating that system via Camunda (or let that system influence Camunda’s orchestration).

The connector consists of a bit of programming code needed to talk to the third-party system and some UI parts hooked into Camunda Modeler.

This is pretty generic, I know. Let’s get a bit more concrete and differentiate types of connectors:

Outbound connectors : Something needs to happen in the third-party system if a process reaches a service task. For example, calling a REST endpoint or publishing some message to Slack.
Inbound connectors : Something needs to happen within the workflow engine because of an external event in the third-party system. For example, because a Slack message was published or a REST endpoint is called. Inbound connectors now can be of three different kinds:

Webhook : An HTTP endpoint is made available to the outside, which when called, can start a process instance, for example.
Subscription : A subscription is opened on the third-party system, like messaging or Apache Kafka, and new entries are then received and correlated to a waiting process instance in Camunda, for example.
Polling : Some external API needs to be regularly queried for new entries, such as a drop folder on Google Drive or FTP.

Outbound example

Let’s briefly look at one outbound connector: the REST connector. You can define a couple of properties, like which URL to invoke using which HTTP method. This is configured via Web Modeler, which basically means those properties end up in the XML of the BPMN process model. The translation of the UI to the XML is done by the element template mechanism. This makes connectors convenient to use.

Now there is also code required to really do the outbound call. The overall Camunda Platform 8 integration framework provides a software development kit (SDK) to program such a connector against. Simplified, an outbound REST connector provides an execute method that is called whenever a process instance needs to invoke the connector, and a context is provided with all input data, configuration, and abstraction for the secret store.

https://medium.com/media/e960514728b5646935597f26a1d141c2/href

Now there needs to be some glue code calling this function whenever a process instance reaches the respective service task. This is the job of the connector runtime. This runtime registers job workers with Zeebe and calls the outbound connector function whenever there are new jobs.

This connector runtime is independent of the concrete connector code executed. In fact, a connector runtime can handle multiple connectors at the same time. Therefore, a connector brings its own metadata:

https://medium.com/media/9e273b9ad78b7042779ee2d46ff74f84/href

With this, we’ve built a Spring Boot-based runtime that can discover all outbound connectors on the classpath and register the required job workers. This makes it super easy to test a single connector, as you can run it locally, but you can also stitch together a Spring Boot application with all the connectors you want to run in your Camunda Platform 8 Self-Managed installation.

At the same time, we have also built a connector runtime for our own SaaS offering, running in Google Cloud. While we also run a generic, Java-based connector runtime, all outbound connectors themselves are deployed as Google Functions. Secrets are handled by the Google Cloud Security Manager in this case.

The great thing here is that the connector code itself does not know anything about the environment it runs in, making connectors available in the whole Camunda Platform 8 ecosystem.

Inbound example

Having talked about outbound, inbound is a very different beast. An inbound connector either needs to open up an HTTP endpoint, a subscription, or start polling. It might even require some kind of state, for example, to remember what was already polled. Exceptions in a connector should be visible to an operator, even if there is no process instance to pinpoint it to.

We are currently designing and validating architecture on this end, so consider it in flux. Still, some of the primitives from inbound connectors will also be true:

Parameters can be configured via the Modeler UI and stored in the BPMN process.
The core connector code will be runnable in different environments.
Metadata will be provided so that the connector runtime can easily pick up new connectors.

A prototypical connector receiving AMQP messages (e.g., from RabbitMQ) looks like this:

https://medium.com/media/677d6ea9714af94cd4ca42873c0e9841/href

And here is the related visualization:

Status and next steps

Currently, only a fraction of what we work on is publicly visible. Therefore, there are currently some limitations on connectors in Camunda Platform version 8.0, mainly:

The SDK for connectors is not open to the public simply because we need to finalize some things first, as we want to avoid people building connectors that need to be changed later on.
The code of existing connectors (REST, SendGrid, and Slack) is not available and cannot be run on Self-Managed environments yet.
The UI support is only available within Web Modeler, not yet within Desktop Modeler.

We are working on all of these areas and plan to release the connector SDK later this year. We can then provide sources and binaries to existing connectors to run them in Self-Managed environments or to understand their inner workings. Along with the SDK, we plan to release connector templates that allow you to easily design the UI attributes and parameters required for your connector and provide you with the ability to share the connector template with your project team.

At the same time, we are also working on providing more out-of-the-box connectors (like the Slack connector that was just released last week) and making them available open source. We are also in touch with partners who are eager to provide connectors to the Camunda ecosystem. As a result, we plan to offer some kind of exchange where you can easily see which connectors are available, their guarantees, and their limitations.

Still, the whole connector architecture is built to allow everybody to build their own connectors. This especially also enables you to build private connectors for your own legacy systems that can be reused across your organization.

Summary

The main building block to implementing connectors is our SDK for inbound and outbound connectors, whereas outbound connectors can be based on webhooks, subscriptions, or polling. This allows writing connector code that is independent of the connector runtime so that you can leverage connectors in the Camunda SaaS offering and your own Self-Managed environment.

At the same time, connector templates will allow a great modeling experience when using connectors within your own models. We are making great progress, and you can expect to see more later this year. Exciting times ahead!

Bernd Ruecker is co-founder and chief technologist of Camunda as well as the author ofPractical Process Automation with O’Reilly. He likes speaking about himself in the third person. He is passionate about developer-friendly process automation technology. Connect viaLinkedIn or follow him onTwitter. As always, he loves getting your feedback. Comment below orsend him an email.

Why Process Orchestration Needs Advanced Workflow Patterns

Bernd Ruecker — Mon, 01 Aug 2022 19:23:19 +0000

Life is seldom a straight line, and the same is true for processes. Therefore, you must be able to accurately express all the things happening in your business processes for proper end-to-end process orchestration. This requires workflow patterns that go beyond basic control flow patterns (like sequence or condition). If your orchestration tool does not provide those advanced workflow patterns, you will experience confusion amongst developers, you will need to implement time-consuming workarounds, and you will end up with confusing models. Let’s explore this by examining an example of why these advanced workflow patterns matter in today’s blog post.

Initial process example

Let’s assume you’re processing incoming orders of hand-crafted goods to be shipped individually. Each order consists of many different order positions, which you want to work on in parallel with your team to save time and deliver quicker. However, while your team is working on the order, the customer is still able to cancel, and in that case, you need to be able to revoke any deliveries that have been scheduled already. A quick drawing on the whiteboard yields the following sketch of this example:

Let’s create an executable process model for this use case. I will first show you a possible process using ASL (Amazon States Language) and AWS Step Functions, and secondly with Camunda Platform and BPMN (Business Process Model and Notation) to illustrate the differences between these underlying workflow languages.

Modeling using AWS Step Functions

The following model is created using ASL, which is part of AWS Step Functions and, as such, a bespoke language. Let’s look at the resulting diagram:

To discuss it, I will use workflow patterns, which are a proven set of patterns you will need to express any workflow.

The good news is that ASL can execute a workflow pattern called “dynamic parallel branches,” which allows parallelizing execution of the order positions. This is good; otherwise, we would need to start multiple workflow instances for the order positions and do all synchronizations by hand.

But this is where things get complicated. ASL does not offer reactions to external messages; thus, you cannot interrupt your running workflow instance if an external event happens, like the customer cancels their order. Therefore, you need a workaround. One possibility is to use a parallel branch that waits for the cancellation event in parallel to execute the multiple instance tasks, marked with (1) in the illustration above.

When implementing that wait state around cancelation, you will undoubtedly miss a proper correlation mechanism, as you cannot easily correlate events from the outside to the running workflow instance. Instead, you could leverage the task token generated from AWS and keep it in an external data store so that you can locate the correct task token for a given order id. This means you have to implement a bespoke message correlation mechanism yourself, including persistence as described in Integrating AWS Step Functions callbacks and external systems.

When the cancelation message comes in, the workflow advances in that workaround path and needs to raise an error so all order delivery tasks are canceled, and the process can directly move on to cancelation, marked with (2) in the above illustration.

But even in the desired case that the order does not get canceled; you need to leverage an error. This is marked with (3) in the illustration above. This is necessary to interrupt the task of waiting for the cancelation message.

You need to use a similar workaround again when you want to wait for payment, but stop this waiting after a specified timeout. Therefore, you will start a timer in parallel, marked with (4), and use an error to stop it later, marked with (5).

Note that when you configure the wait state, you might assume you may misuse Step Functions here, as you configure the time in seconds, meaning you have to enter a big number (864,000 seconds) to wait ten days.

Of course, you could also implement your requirements differently. For example, you might implement all order cancelation logic entirely outside of the process model and just terminate the running order fulfillment instance via API. But note that by doing so, you will lose a lot of visibility around what happens in your process, not only during design time but also during operations or improvement endeavors.

Additionally, you distribute logic that belongs together all over the place (step function, code, etc.) For example, a change in order fulfillment might mean you have to rethink your cancelation procedure, which is obvious if cancelation is part of the model.

To summarize, the lack of advanced workflow patterns requires workarounds, which are not only hard to do but also make the model hard to understand and thus weakens the value proposition of an orchestration engine.

Modeling with BPMN

Now let’s contrast this with modeling using the ISO standard BPMN within Camunda:

This model is directly executable on engines that support BPMN, like Camunda. As you can see, BPMN supports all required advanced workflow patterns to make it not only easy to model this process but also yields a very understandable model.

Let’s briefly call out the workflow patterns (besides the basics like sequence, condition, and wait) that helped to make this process so easy to implement:

This model can be perfectly used to discuss the process with various stakeholders, and can further be shown in technical operations (e.g., if some process instance gets stuck) or business analysis (e.g., to understand which orders are canceled most and in which state of the process execution). Below is a sample screenshot of the operations tooling showing a process instance with six order items, where one raised an incident. You can see how easy it gets to dive into potential operational problems.

Let’s not let history repeat itself!

I remember one of my projects using the workflow engine JBoss jBPM 3.x back in 2009. I was in Switzerland for a couple of weeks, sorting out exception scenarios and describing patterns on how to deal with those. Looking back, this was hard because jBPM 3 lacked a lot of essential workflow patterns, especially around the reaction to events or error scopes, which I did not know back then. In case you enjoy nostalgic pictures as much as I do, this is a model from back then:

I’m happy to see BPMN removed the need for all of those workarounds necessary, creating a lot of frustration among developers. Additionally, the improved visualization really allowed me to discuss process models with a larger group of people with various experience levels and backgrounds in process orchestration.

Interestingly enough, many modern workflow or orchestration engines lack the advanced workflow patterns described above. Often, this comes with the promise of being simpler than BPMN. But in reality, claims of simplicity mean they lack essential patterns. Hence, if you follow the development of these modeling languages over time, you will see that they add patterns once in a while, and whenever such a tool is successful, it almost inevitably ends up with a language complexity comparable to BPMN but in a proprietary way. As a result, process models in those languages are typically harder to understand.

At the same time, developing a workflow language is very hard, so chances are high that vendors will take a long time to develop proper pattern support. I personally don’t understand this motivation, as the knowledge about workflow patterns is available, and BPMN implements it in an industry-proven way, even as an ISO standard.

Conclusion

The reality of a business process requires advanced workflow patterns. If a product does not natively support them, its users will need to create technical workarounds, as you could see in the example earlier:

ASL lacked pattern and required complex workarounds.
BPMN supports all required patterns and produces a very comprehensible model.

Emulating advanced patterns with basic constructs and/or programming code, as necessary for ASL, means:

Your development takes longer.
Your solution might come with technical weaknesses, like limited scalability or observability.
You cannot use the executable process model as a communication vehicle for business and IT.

To summarize, ensure you use an orchestration product supporting all-important workflow patterns, such as Camunda, which uses BPMN as workflow language.

How to Achieve Geo-redundancy with Zeebe

Bernd Ruecker — Tue, 28 Jun 2022 14:14:35 +0000

Camunda Platform 8 reinvented the way an orchestration and workflow engine works. We applied modern distributed system concepts and can now even allow geo-redundant workloads, often referred to as multi-region active-active clusters. Using this technology, organizations can build resilient systems that can withstand disasters in the form of a complete data center outage.

For example, a recent customer project in a big financial institution connected a data center in Europe with one in the United States and this did not affect their throughput, meaning they can still run the same number of process instances per second. But before talking about multi-regions and performance, let’s disassemble this fascinating topic step-by-step in today’s blog post.

Many thanks to ourdistributed systems guru, Falko, for providing a ton of input about this topic, and my great colleague Nele for helping to get everything in order in this post.

Hang on — geo-redundant? Multi-region? Active-active?

First, let’s quickly explain some important basic terminology we are going to use in this post:

Geo-redundancy (also referred to as geo-replication): We want to replicate data in a geographically distant second data center. This means even a massive disaster like a full data center going down will not result in any data loss. For some use cases, this becomes the de-facto standard as most businesses simply cannot risk losing any data.
Multi-region : Most organizations deploy to public clouds and the public cloud providers name their different data centers a region. So in essence, deploying to two different regions makes sure; those deployments will end up in separate data centers.
Availability zones : A data center, or a region, is separated into availability zones. Those zones are physically separated, meaning an outage because of technical failures is limited to one zone. Still, all zones of a region are geographically located in one data center.
Active-active : When replicating data to a second machine, you could simply copy the data there, just to have it when disaster strikes. This is called a passive backup. Today, most use cases strive for the so-called active-active scenario instead, where data is actively processed on both machines. This makes sure you can efficiently use the provisioned hardware (and not keep a passive backup machine idle all the time).
Zeebe : The workflow engine within Camunda Platform 8.

So let’s rephrase what we want to look at today: How to run a multi-region active-active Zeebe cluster (which then is automatically geo-redundant and geo-replicated). That’s a mouthful!

Resilience levels

Firstly, do you really need multi-region redundancy? To understand this better, let’s sketch the levels of resilience you can achieve:

Clustering : You build a cluster of nodes in one zone. You can stand hardware or software failures of individual nodes.
Multi-zone : You distribute nodes into multiple zones, increasing availability as you can now stand an outage of a full zone. Zone outages are very rare.
Multi-region : You distribute nodes into multiple regions, meaning geographically distributed data centers. You will likely never experience an outage of a full region, as this might only happen because of exceptional circumstances.

So while most normal projects are totally fine with clustering, the sweet spot is multi-zone. Assuming you run on Kubernetes provided by one of the Hyperscalers, multi-zone is easy to set up and thus does not cause a lot of effort or costs. At the same time, it provides an availability that is more than sufficient for most use cases. Only if you really need to push this availability and need to withstand epic disasters do you need to go for multi-region deployments. I typically see this with big financial or telecommunication companies. That said, there might also be other drivers besides availability for a multi-region setup:

Locality: Having a cluster spanning multiple regions, clients can talk to the nodes closest to them. This can decrease network latencies.
Migration: When you need to migrate to another region at your cloud provider, you might want to gradually take workloads over and run both regions in parallel for some time to avoid any downtimes.

In today’s blog post, we want to unwrap Zeebe’s basic architecture to support any of those resilience scenarios, quickly describe a multi-zone setup, and also turn our attention to multi-region, simply because it is possible and we are regularly asked about it. Finally, we’ll explain how Zeebe scales and how we can turn any of those scenarios into an active-active deployment.

Replication in Zeebe

To understand how we can achieve resilience in Zeebe, you first need to understand how Zeebe does replication. Zeebe uses distributed consensus — more specifically theRaft Consensus Algorithm — for replication.There is an awesomevisual explanation of the Raft Consensus Algorithm available online, so I will not go into all the details here. The basic idea is that there is a single leader and a set of followers. The most common setup is to have one leader and two followers, and you’ll see why soon.

When the Zeebe brokers start up, they elect a leader. Only the leader is allowed to write data. The data written by the leader is replicated to all followers. Only after a successful replication is the data considered committed and can be processed by Zeebe (this is explained in more detail in how we built a highly scalable distributed state machine). In essence, all (committed) data is guaranteed to exist on the leader and all followers all the time.

There is one important property you can configure for your Zeebe cluster — the replication factor. A replication factor of three means data is available three times, on the leader as well as replicated to two followers, as indicated in the image below.

A derived property is what is called the quorum. This is the number of nodes required to hold so-called elections. Those elections are necessary for the Zeebe cluster to select who is the leader and who is a follower. To elect a leader, at least round_down(replication factor / 2) + 1 nodes need to be available. In the above example, this means round_down(3/2)+1 = 2 nodes are needed to reach a quorum.

So a cluster with a replication factor of three can process data if at least two nodes are available. This number of nodes is also needed to consider something committed in Zeebe.

The replication factor of three is the most common, as it gives you a good compromise of the number of replicas (additional hardware costs) and availability (I can tolerate losing one node).

A sample failure scenario

With this in mind, let’s quickly run through a failure scenario, where one node crashes:

One node crashing will not affect the cluster at all, as it still can reach a quorum. Thus, it can elect a new leader and continue working. You should simply replace or restart that node as soon as possible to keep an appropriate level of redundancy.

Note that every Zeebe cluster with a configured replication factor has basic resilience built in.

Multi-zone Zeebe clusters

When running on Kubernetes in a public cloud, you can easily push availability further by distributing the different Zeebe nodes to different availability zones. Therefore, you can leverage multi-zone clusters in Kubernetes. For example, in Google Cloud (GCP) this would mean regional clusters (mind the confusing wording: a regional cluster is spread across multiple zones within one region). Then, you can set a constraint, that your Zeebe nodes, running as a stateful set, are all running in different zones from each other. Et voila, you added multi-zone replication:

From the Zeebe perspective, the scenario of a zone outage is now really the same as the one of a node outage. You can also run more than three Zeebe nodes, as we will discuss later in this post.

Multi-region Zeebe clusters

As multi-zone replication was so easy, let’s also look at something technically more challenging (reminding ourselves, that not many use cases actually require it): multi-region clusters.

You might have guessed it by now — the logic is basically the same. You distribute your three Zeebe nodes to three different regions. But unfortunately, this is nothing Kubernetes does out of the box for you, at least not yet. There is so much going on in this area that I expect new possibilities to emerge any time soon (just naming Linkerd’s multi-cluster communication with StatefulSets as an example).

In our customer project, this was not a show stopper, as we went with the following procedure that proved to work well:

Spin up three Kubernetes clusters in different regions (calling them “west”, “central”, and “east” here for brevity).
Set up DNS forwarding between those clusters (see solution #3 of Cockroach running a distributed system across Kubernetes Clusters) and add the proper firewall rules so that the clusters can talk to each other.
Create a Zeebe node in every cluster using tweaked Helm charts. Those tweaks made sure to calculate and set the Zeebe broker ids correctly (which is mathematically easy, but a lot of fun to do in shell scripts;-)). This will lead to “west-zeebe-0” being node 0, “central-zeebe-0” being 1, and “east-zeebe-0” being 2.

Honestly, those scripts are not ready to be shared without hand-holding, but if you plan to set up a multi-region cluster, please simply reach outand we can discuss your scenario and assist.

Note that we set up as many regions as we have replicas. This is by design, as the whole setup becomes rather simple if:

The number of nodes is a multiple of your replication factor (in our example 3, 6, 9, …).
The nodes can be equally distributed among regions (in our example 3 regions for 3, 6, or 9 nodes).

Running Zeebe in exactly two data centers

Let’s discuss a common objection at this point: we don’t want to run in three data centers, we want to run it in exactly two! My hypothesis is that this yields from a time when organizations operated their own data centers, which typically meant there were only two data centers available. However, this changed a lot with the move to public cloud providers.

Truthfully, it is actually harder to run a replicated Zeebe cluster spanning two data centers than spanning three. This is because of the replication factor and using multiples — as you could see above. So in a world dominated by public cloud providers, where it is not a big deal to utilize another region, we would simply recommend replicating to three data centers.

Nevertheless, in the customer scenario, there was the requirement to run Zeebe in two regions. So we quickly want to sketch how this could be done. Therefore, we run 4 nodes to have two nodes in every region. This allows one node to go down and still guarantees a copy of all data in both regions. Therefore, three nodes are not enough to be able to deal with an outage of a whole region.

The following image illustrates our concrete setup:

There is one key difference to the three-region scenario: When you lose one region, an operator will need to jump in and take manual action. When two nodes are missing, the cluster has no quorum anymore (remember: replication factor 4 / 2 + 1 = 3) and cannot process data as visualized in the following diagram:

To get your cluster back to work, you need to add one more (empty) cluster node, having the Zeebe node id of the original node three (at the time of writing, the cluster size of Zeebe is fixed and cannot be increased on the fly, this is why you cannot simply add new nodes). The cluster automatically copies the data to this new node and can elect a new leader as the cluster is back online.

Adding this node is consciously a manual step to avoid a so-called split-brain situation. Assume that the network link between region one and region two goes down. Every data center is still operating but thinks the other region is down. There is no easy way for an automated algorithm within one of the regions to decide if it should start new nodes, but avoid starting new nodes in both regions. This is why this decision is pushed to a human operator. As losing whole regions is really rare, this is tolerable. Please note again that this is only necessary for the two-region scenario, not when using three regions (as they still have a quorum when one region is missing).

When the region comes back, you can start node 4 again, and then replace the new node 3 with the original one:

The bottom line is that using two regions is possible, but more complex than simply using three regions. Whenever you are not really constrained by the number of physical data centers available to you (like with public cloud providers), we recommend choosing a thoughtful number of regions.

Scaling workloads using partitions

So far, we simplified things a little bit. We were not building real active-active clusters, as followers do not do any work other than replicating. Also, we did not really scale Zeebe. Let’s look at this next.

Zeebe uses so-called partitions for scaling, as further explained in how we built a highly scalable distributed state machine. In the above examples, we looked at exactly one partition. In reality, a Camunda Platform 8 installation runs multiple partitions. The exact number depends on your load requirements, but it should reflect what was described above about multiples.

So a replication factor of three means we might run 12 partitions on six nodes, or 18 partitions on six nodes, for example. Now, leaders and followers of the various partitions are distributed onto the various Zeebe nodes, making sure those nodes are not only followers but also leaders for some of the partitions. This way, every node will also do “real work”.

The following picture illustrates this, whereas P1 — P12 stands for the various partitions:

Now, there is a round-robin pattern behind distributing leaders and their followers to the nodes. We can now leverage this pattern to guarantee geo-redundancy by adding the nodes to the various data centers in a clever round-robin too. As you can see above, for example in P1 the leader is in region 2, and the followers are in regions 1 and 3, so every data center has a copy of the data as described earlier. And this is also true for all other partitions. An outage will not harm the capability of the Zeebe cluster to process data. The following illustration shows what happens if region 3 goes down; the partitions only need to elect some new leaders:

And how does geo-redundancy affect performance?

Finally, let’s also have a quick look at how multi-region setups affect the performance and throughput of Zeebe. The elephant in the room is of course that network latency between geographically separate data centers is unavoidable. Especially if you plan for epic disasters, your locations should not be too close. Or if you want to ensure geographic locality, you might even want various data centers to be close to the majority of your customers, which might simply mean you will work with data centers all over the world. In our recent customer example, we used one GCP region in London and one in the US, Northern Virginia to be precise. The latency between those data centers is estimated to be roughly 80ms (according to https://geekflare.com/google-cloud-latency/), but latencies can also go further up to a couple of hundred milliseconds.

Spoiler alert: This is not at all a problem for Zeebe and does not affect throughput.

To add some spice to this, let’s quickly look at why this is a problem in most architectures. For example, in Camunda Platform 7 (the predecessor of the current Camunda Platform 8), we used a relational database and database transactions to store the workflow engine state. In this architecture, replication needs to happen as part of the transaction (at least if we need certain consistency guarantees, which we do) resulting in transactions that take a long time. Conflicts between transactions are thus more likely to occur, for example, because two requests want to correlate something to the same BPMN process instance. Second, typical resource pools for transactions or database connections might also end up being exhausted in high-load scenarios.

In summary, running Camunda Platform 7 geographically distributed is possible, but especially under high load, it bears challenges.

With the Camunda Platform 8 architecture, the engine does not leverage any database transaction. Instead, it uses a lot of ring buffers to queue things to do. And waiting for IO, like the replication reporting success, does not block any resource and further does not cause any contention in the engine. This is described in more detail in how we built a highly scalable distributed state machine.

Long story short: Our experiments clearly supported the hypothesis that geo-redundant replication does not affect throughput. Of course, processing every request will have higher latency. Or to put in other words, your process cycle times will increase, as the network latency is still there. However, it only influences that one number in a very predictable way. In the customer scenario, a process that typically takes around 30 seconds was delayed by a couple of seconds in total, which was not a problem at all. We have not even started to optimize for replication latency, but have a lot of ideas.

Summary

In this post, you could see that Zeebe can easily be geo-replicated. The sweet spot is a replication factor of three and replication across three data centers. In public cloud speak, this means three different regions. Geo-replication will of course add latency but does not affect throughput. Still, you might not even need such a high degree of availability and be happy to run in multiple availability zones of your data center or cloud provider. As this is built into Kubernetes, it is very easy to achieve.

Please reach out to us if you have any questions, specific scenarios, or simply want to share great success stories!

What to do When You Can’t Quickly Migrate to Camunda 8

Bernd Ruecker — Wed, 25 May 2022 19:03:31 +0000

Managing a brownfield when you simply don’t have a green one

With Camunda Platform 8 out of the door now, I’ve been having frequent discussions around migration. Many of them go along the lines of: “We are invested in Camunda 7, including a lot of best practices, project templates, and even code artifacts. We can’t quickly migrate to Camunda 8, so what should we do now?” I call this a brownfield. If you are in this situation, this blog post is for you.

Greenfield recommendation

But let’s start with the easy things first. Let’s assume you just entered the world of process automation and orchestration with Camunda, and you’re starting from scratch. In this case, we strongly recommend starting with Camunda 8 right away, for example, using the Java greenfield stack: Java, Spring Boot, Spring Zeebe, and Camunda Platform 8 — SaaS.

Can’t use Camunda 8 just yet?

But there are some edge cases where you might not want to use Camunda 8 right away. The typical reasons include:

You can’t leverage Camunda 8 — SaaS, but also don’t have Kubernetes at your disposal to install the platform self-managed. While installing Camunda 8 on bare-metal or VMs is possible, it is also not super straightforward and might not be your choice if you have to set up many engines in a big organization that embraces microservices. Of course, you could probably leverage existing Infrastructure as Code (IaC) toolchains to ease this task (like Terraform or Ansible).
You are missing a concrete feature because Camunda 8 needs to catch up on feature parity. The prime examples are around BPMN elements like compensation or conditions.
You stick to a principle not to run x.0 software versions in production (while I do see the point here, I want to add that I don’t think this applies to Camunda 8.0. It is technically a Camunda Cloud 1.4 release with quite some people already in production with it).

Independent of the exact reason, this means that you should start on a greenfield with Camunda 7. It’s worth repeating that this should be an exception. In this case, the recommendation is to start with the latest Camunda 7 greenfield stack: Camunda Run as a remote engine via Docker and External Tasks. If you code in Java, your process solution stack will be Java, Spring Boot, and the Camunda REST Client. If you program in other languages, you should simply leverage the REST API. This is conceptually pretty close to a Camunda 8 architecture. Let’s call it the external task approach.

There is one downside of this stack, though — the Java developer experience is not as great as it is with Camunda 8. Historically, Camunda users preferred embedded engines using Java Delegates. This stack offers a great experience for Java developers. Camunda Run does not offer that same level of developer experience, even though it has improved over the years. While this is normally not a real problem, it might decrease developer motivation around Camunda projects. So if this is a real problem in your context, it is worth going with the greenfield stack from some years ago: Java, Spring Boot, Camunda Spring Boot Starter, and Java Delegates. This stack is also mentioned as the example in our migration guide, as it is by far the most common Camunda 7 stack you’ll meet in the wild. Let’s call this the Java Delegate approach.

So I see both approaches as valid choices. But, of course, if you start with Camunda 7 now, you need to think ahead and prepare for a future Camunda 8 migration. This is where the approaches differ; with Java Delegates, you have a harder time making sure to stick to what we call Clean Delegates, as Java Delegates technically allow pretty dirty hacks. But there will be more on this later in this blog post.

Greenfield recommendation summary

So let’s quickly recap our recommendations so far:

Use Camunda Platform 8 — SaaS.
If this is not possible, use Camunda Platform 8 — Self-Managed.
If this is not possible, use Camunda Platform 7 Run and the external task approach.
If this is not possible, use Camunda Platform 7 Spring Boot Starter, but implement Clean Delegates.

Brownfields

Now let’s turn our attention back to the brownfield companies. In such situations, the company already uses Camunda 7 and will not migrate overnight to Camunda 8 (which neither makes sense nor is necessary). In an ideal world, you would simply start new projects with Camunda 8 and migrate your existing projects step by step over time. But often, it is not that easy.

For example, your company might have invested a lot of effort into integrating Camunda 7 into its ecosystem. This goes far beyond the code of one process solution but includes best practices, examples, code snippets, reusable connectors, and many more. In such cases, you might still want to start new projects with Camunda 7 until you have a clear idea (and budget) of how to migrate all of those things.

Or your project is already in-flight and will be finished better with Camunda 7. Or an initiative pops up to extend an existing Camunda 7 process solution, and you cannot make the migration to Camunda 8 part of that endeavor.

In those cases, the typical question is, “Should we keep doing what we are doing, or should we quickly try to change our architecture to get closer to Camunda 8 already?”

The short answer is to keep doing what you are doing. This will make migration efforts easier at a later point in time, as you will have one common architecture to migrate. If you adjust your Camunda 7 architecture now, you might end up with two different architecture blueprints you need to migrate. Both external task and Java delegate approaches are OK!

But you should make sure to establish some practices as quickly as possible that will ease migration projects later on. Those are described in the rest of this post. While external tasks might enforce some practices, Clean Delegates are equally easy (or sometimes even easier) to migrate.

Practices to ease migration

In order to implement Camunda 7 process solutions that can be easily migrated, you should stick to the following rules (that are good development practices you should follow anyway), which will be explained in more detail later:

Implement what we call Clean Delegates_ — _concentrate on reading and writing process variables, plus business logic delegation. Data transformations will be mostly done as part of your delegate (and especially not as listeners, as mentioned below). Separate your actual business logic from the delegates and all Camunda APIs. Avoid accessing the BPMN model and invoking Camunda APIs within your delegates.
Don’t use listeners or Spring beans in expressions to do data transformations via Java code.
Don’t rely on an ACID transaction manager spanning multiple steps or resources.
Don’t expose Camunda API (REST or Java) to other services or front-end applications.
Use primitive variable types or JSON payloads only (no XML or serialized Java objects).
Use simple expressions or plug-in FEEL. FEEL is the only supported expression language in Camunda 8. JSONPath is also relatively easy to translate to FEEL. Avoid using special variables in expressions, e.g., execution or task.
Use your own user interface or Camunda Forms; the other form mechanisms are not supported out-of-the-box in Camunda 8.
Avoid using any implementation classes from Camunda; generally, those with *.impl.* in their package name.
Avoid using engine plugins.

For the moment, it might also be good to check the BPMN elements supported in Camunda 8, but this gap will most likely be closed soon.

Execution Listeners and Task Listeners are areas in Camunda 8 that are still under discussion. Currently, those use cases need to be solved slightly differently. Depending on your use case, the following Camunda 8 features can be used:

Input and output mappings using FEEL
Tasklist API
History API
Exporters
Client interceptors
Gateway interceptors
Job workers on user tasks

I expect to soon have a solution in Camunda 8 for most of the problems that listeners solve. Still, it might be good practice to use as few listeners as possible, and especially don’t use them for data mapping as described below.

Clean Delegates

With Java Delegates and the workflow engine being embedded as a library, projects can do dirty hacks in their code. Casting to implementation classes? No problem. Using a ThreadLocal or trusting a specific transaction manager implementation? Yeah, possible. Calling complex Spring beans hidden behind a simple JUEL (Java unified expression language) expression? Well, you guessed it — doable!

Those hacks are the real show stoppers for migration, as they simply cannot be migrated to Camunda 8. Actually, Camunda 8 increased isolation intentionally.

So you should concentrate on what a Java Delegate is intended to do:

Read variables from the process and potentially manipulate or transform that data to be used by your business logic.
Delegate to business logic — this is where Java Delegates got their name from. In a perfect world, you would simply issue a call to your business code in another Spring bean or remote service.
Transform the results of that business logic into variables you write into the process.

Here’s an example of an ideal JavaDelegate:

https://medium.com/media/14969fcfb5a201a3928afe44e1905193/href

And you should never cast to Camunda implementation classes, use any ThreadLocal object, or influence the transaction manager in any way. Java Delegates should further always be stateless and not store any data in their fields.

The resulting delegate can be easily migrated to a Camunda 8 API, or simply be reused by the adapter provided in this migration community extension.

No transaction managers

You should not trust ACID transaction managers to glue together the workflow engine with your business code. Instead, you need to embrace eventual consistency and make every service task its own transactional step. If you are familiar with Camunda 7 lingo, this means that all BPMN elements will be async=true. A process solution that relies on five service tasks to be executed within one ACID transaction, probably rolling back in case of an error, will make migration challenging.

Don’t expose Camunda API

You should try to apply the information hiding principle and not expose too much of the Camunda API to other parts of your application.

In the above example, you should not hand over an execution context to your CrmFacade, which is hopefully intuitive anyway:

_// DO NOT DO THIS!_

crmFacade.createCustomer(execution);

The same holds true for when a new order is placed, and your order fulfillment process should be started. Instead of the front-end calling the Camunda API to start a process instance, you are better off providing your own endpoint to translate between the inbound REST call and Camunda, like this for example:

https://medium.com/media/45e3c7d97407716db6e53cfe7875e412/href

Use primitive variable types or JSON

Camunda 7 provides quite flexible ways to add data to your process. For example, you could add Java objects that would be serialized as byte code. Java byte code is brittle and also tied to the Java runtime environment. Another possibility is magically transforming those objects on the fly to XML using Camunda Spin. It turned out this was black magic and led to regular problems, which is why Camunda 8 does not offer this anymore. Instead, you should do any transformation within your code before talking to Camunda. Camunda 8 only takes JSON as a payload, which automatically includes primitive values.

In the above example, you can see that Jackson was used in the delegate for JSON to Java mapping:

https://medium.com/media/14969fcfb5a201a3928afe44e1905193/href

This way, you have full control over what is happening, and such code is also easily migratable. And the overall complexity is even lower, as Jackson is quite known to Java people — a kind of de-facto standard with a lot of best practices and recipes available.

Simple expressions and FEEL

Camunda 8 uses FEEL as its expression language. There are big advantages to this decision. Not only are the expression languages between BPMN and DMN harmonized, but also the language is really powerful for typical expressions. One of my favorite examples is the following onboarding demo we regularly show. A decision table will hand back a list of possible risks, whereas every risk has a severity indicator (yellow, red) and a description.

The result of this decision shall be used in the process to make a routing decision:

To unwrap the DMN result in Camunda 7, you could write some Java code and attach that to a listener when leaving the DMN task (this is already an anti-pattern for migration as you will read next). This code is not super readable:

https://medium.com/media/9a32a3c2263763b436ceee8e71fc237b/href

With FEEL, you can evaluate that data structure directly and have an expression on the “red” path:

= some risk in riskLevels satisfies risk = "red"

Isn’t this a great expression? If you think, yes, and you have such use cases, you can even hook in FEEL as the scripting language in Camunda 7 today (as explained by Scripting with DMN inside BPMN or User Task Assignment based on a DMN Decision Table).

But the more common situation is that you will keep using JUEL in Camunda 7. If you write simple expressions, they can be easily migrated automatically, as you can see in the test case of the migration community extension. You should avoid more complex expressions if possible. Very often, a good workaround to achieve this is to adjust the output mapping of your Java Delegate to prepare data in a form that allows for easy expressions.

You should definitely avoid hooking in Java code during an expression evaluation. The above listener to process the DMN result was one example of this. But a more diabolic example could be the following expression in Camunda 7:

#{ dmnResultChecker.check( riskDMNresult ) }

Now, the dmnResultChecker is a Spring bean that can contain arbitrary Java logic, possibly even querying some remote service to query whether we currently accept yellow risks or not (sorry, this is not a good example). Such code can not be executed within Camunda 8 FEEL expressions, and the logic needs to be moved elsewhere.

Camunda Forms

Finally, while Camunda 7 supports different types of task forms, Camunda 8 only supports Camunda Forms (and will actually be extended over time). If you rely on other form types, you either need to make Camunda Forms out of them or use a bespoke tasklist where you still support those forms.

Summary

In today’s blog post, I wanted to show you which path to take if Camunda 8 is not yet an option for you. In summary, it’s best you keep doing what you’re already doing. This normally means leveraging the external task approach or the Java Delegate approach. Both options are OK.

With Java Delegates, you have to be very mindful to avoid hacks that will hinder a migration to Camunda 8. This article sketched the practices you should stick to in order to make migration easier whenever you want to do it, which is mostly about writing clean delegates, sticking to common architecture best practices, using primitive values or JSON, and writing simple expressions.

As always, I am happy to hear your feedback or discuss any questions you might have.

How Open is Camunda Platform 8?

Bernd Ruecker — Wed, 25 May 2022 14:59:19 +0000

With Camunda Platform 8 being available to the public, we regularly answer questions about our open source strategy and the licenses for its various components. Let’s sort this out in today’s blog post by looking at the specifics of the components, sketching a path to put Camunda 8 into production without the need to pay us any money, and the difference between open source and source-available licenses.

Component overview

Let’s look at the various components that make up Camunda Platform 8. The following illustration colors the components according to their license:

Green : Open source license.
Green stripes : Source-available license (for the curious, the difference between open source and source-available is explained below, for most people, there is no real difference).
Blue : This software is available but only free for non-production use. If you want to put these components into production, you will need to buy a license (via enterprise subscription) from Camunda.
Red : This software is only available within Camunda Platform 8 — SaaS and can’t be run self-managed. Note: This is subject to change, and some of the red components should turn blue over time.

The short summary is that you can run everything green (including green stripes) as self-managed in production without needing a license. The green components are open source, as coined by the Open Source Initiative. The striped components use a source-available license. Regarding Zeebe, this is the Zeebe Community License v1.0. It is based on the very liberal open source MIT license but with one restriction — users are not allowed to use the components for providing a commercial workflow service in the cloud. This is typically not a limitation for any of our existing customers, users, or prospects. If you want to know more about open source licensing, visit Why We Created The Zeebe Community License and Zeebe License Overview and FAQ.

Furthermore, you can run all the blue components during development and testing. This not only allows you to try them out but will help you with your development efforts. If you want to keep using them while going into production, you will need to buy a license from Camunda. Later in this blog post, I will explain how you can go live without those components, as there is a possible path.

Now, let’s quickly look at a typical question in this context: why are the blue boxes not available for production, even in a limited version?

Why free for non-production and not open core?

With Camunda Platform 7, we have an open core model where parts of the components are available open source, and the full-feature set is only available to you if you buy an enterprise subscription. So for example, the basic tier of Camunda Cockpit allows you to see running instances in open source, but only the Enterprise Edition of Camunda Cockpit shows the historical data and provides the full-feature set.

While this looks good at first glance, it actually adds a lot of friction and confusion for our users. First, they have to understand the feature differences in detail. Second, most people even miss that there is a more powerful version of Cockpit available, leading them to redevelop features that are already there. And finally, even if the customer’s team requires the power of the Enterprise Edition of Cockpit, selling the license is hard in situations where decision-makers might not care enough about the daily friction of operations to spend the money. In other words, our power users often want an Enterprise license and have a good business case for it but are still let down by their decision-makers.

This is why we made the whole model radically simpler. You can have all the tools with all the features during development without any fluff. Everything is easily accessible (available on DockerHub, for example), can help you learn Camunda, and speed up development. For example, Camunda Operate (the Cockpit equivalent in Camunda 8) helps you to understand what’s going on in your workflow engine, especially when you are new and start developing.

You will only need to buy a license when you put it into production. But the argument for the Enterprise Edition is now very simple to understand — without it, you can’t use those productivity tools. So far, our users are actually pretty happy about that change, as it makes it easier for them to ask for the necessary budget.

If for whatever reason your company doesn’t want to pay for the Enterprise Edition, there is still a way to production, as described below. However, it is less convenient and involves more work for you. Whether this is worth saving the subscription money is your company’s decision.

We believe this model has a very good balance of interests:

First, you can easily start developing process solutions with Camunda Platform 8, but also run severe workloads in production with a completely source-available stack.
Second, there is sufficient motivation to pay for the additional software, which guarantees that Camunda will stick around.
Third, this allows Camunda to stay focused and continue to invest in great software and the community.

How SaaS changes the game

So far, we’ve talked about self-managed installations. Somehow, this still seems to be the default in the heads of most people. They want to download and run the software, but this is changing. When you really think about it, you don’t want software — you want some service or feature the software is delivering. This is what cloud and SaaS (software as a service) provide. With Camunda 8, we introduced our own SaaS offering, where you can completely consume it in the cloud.

Now, this changes one important aspect — you have to be clear if you’re searching for open source or something that is free to use. And most people actually search for the latter, which can also be delivered without open source.

So with Camunda Platform 8 — SaaS, the equivalent of a Community Edition is a free tier, where users can use the service (within certain boundaries) without generating any bills. As I’m writing this blog post, we are working to extend our free tier with Camunda 8. The current situation is that you can already have a free plan for modeling use cases. And we are working on a free tier to support execution use cases, but still have to work out some details. In contrast to providing a Community Edition for download, every running cluster in the cloud adds up on our own GCP bill, so we have to be diligent about it.

In general, I expect a big mindset shift over the next few years in this regard. Users will mostly consume SaaS services, and having a free tier will be more important to them than software being open source.

At this point, I want to add one important side note — our SaaS focus will not mean that our open source commitment will be weakened, on the contrary. We have a big group of passionate people in our community that do miracles for us, and we continuously increase our investment in the community.

Camunda 8 has all the key ingredients to make a vital open source community work:

The source code for core components is available.
Code, issues, and discussions live in the open on GitHub. The frequent pull requests to our documentation are great examples of this.
Extension points allow community contributions.
Frequent meetups, talks, and blog posts.
A great developer relations team that deeply cares about the community.

A path to production with source-available software

Let’s come back to self-managed software and sketch a path to production that neither requires a commercial license nor breaks any license agreements. For production, this basically comes down to using only the source-available parts of Camunda 8:

Additionally, you will need to find solutions to replace the tools you cannot use.

Tasklist

You will need to implement your own task management solution based on using workers subscribing to Zeebe as described in the docs. That also means you have to build your own persistence to allow task queries, as the Tasklist API is part of the Tasklist component and is not free for production use.

Operate

Operate is the component you will miss most, as you typically want to gain a clear understanding of what is going on in your workflow engine and take corrective actions.

For looking at data, you can access it in Elastic (check the Elastic Exporter for details), leverage the metrics, or build your own exporters to push it to some data storage component that is convenient for you. Exporters can also filter or pre-process data on the fly. It is worth noting that the Operate data pre-processing logic backing the History API is part of Operate and not free for production use.

For influencing process instances (like canceling them), you can use the existing Zeebe API, which is also exposed as the command-line tool zbctl.

This flexibility allows you to hook functionality into your own front-ends. Of course, this takes effort, but it is definitely possible, and we know of users that have done it. As already mentioned, you should contrast that effort with the costs of the license.

Optimize

Optimize is hard to replace because it goes quite deep into process-based analytics, which is hard to build on your own. If you can’t use Optimize, the closest you might get to it is by adding your own exporters to push the data to an existing general-purpose BI (Business Intelligence), DWH (Data Warehouse), or data lake solution.

Conclusion

In this blog post, I wanted to make it very clear what components of the Camunda 8 stack are open source (or source-available) and which are not free for production use. I gave some pointers to go into production with a pure source-available stack but also tried to explain the efforts that might require, which is, of course, the upselling potential the company needs. I hope this was understandable, and I’m happy to discuss this in the Camunda forum in case there are open questions.

How to Benchmark Your Camunda 8 Cluster

Bernd Ruecker — Wed, 20 Apr 2022 08:45:36 +0000

“Can I execute 10, 100, or 1,000 process instances per second on this Camunda 8 cluster?”

This is a typical question we get these days, and it can be answered using benchmarking. Let’s explore this fascinating topic in today’s post.

Our benchmarking journey and some technical background

Internally, benchmarking was an interesting journey, which I quickly want to recap first. From day one, one engineering goal behind Zeebe, the workflow engine within Camunda 8, was to develop a highly scalable workflow engine that can go far beyond what existing technology could do. Hence, measuring progress in the areas of throughput or latency was always top of our minds.

Feel free to skip this section if you are just interested in how to run your own benchmark in the latest and greatest way.

When we first released Zeebe in 2018, we hand-crafted a benchmark setup using AWS, Terraform, and Ansible. This setup created a Zeebe cluster alongside a simple load generator using the Zeebe Java client and then measured process instances started per second. While this was a good starting point, it was not yet great.

This first approach to running a benchmark resulted in two major lessons learned:

We did not look at the right metrics. “Process instances started per second” is easy to measure, but this number does not tell you if those processes can be completed, or if you simply start a big wave of instances that pile up. Hence, a more interesting metric to look at is “process instances completed per second”.
Service tasks need to be taken into consideration. For every service task, the workflow engine internally needs to create a job, pass it to a worker and wait for its completion. Those computations compete upon resources with other computations.

Let’s use the following example process:

To complete this process, the workflow engine needs to:

Create the new process instance
Manage “Do A”
Manage “Do B”
Manage “Do C”
Complete the process instance

Now the computations around completing those three service tasks compete with the computations around process instance creation. This means that if you start process instances too fast, you will likely not complete all service tasks anymore and this means, process instances will not be completed but pile up instead. This is also why the Zeebe engine will apply backpressure on process instance starts. Note that the current design favors service tasks over starting process instances, as service tasks completions will not receive any backpressure.

This is all interesting, but what does that mean for benchmarking? Basically, we need to balance process instance starts and service task completions. Our first attempt to do so was around 2019 when we created more realistic scenarios. In a nutshell, we said:

We need to create a starter application that will create process instances at a defined rate, passing on a defined payload.
We also need a worker application for completing service tasks. We wanted it to behave realistically, so we simulated a delay in job completion. We used 100ms as a typical delay the completion of an external REST service takes.

We packaged both as Java applications and allowed to configure, deploy and scale them via Kubernetes. Since then, we used those benchmark starters regularly and you can for example see it in action here.

First attempt to automate benchmarking using multiple starter and worker pods

But there were still two problems that regularly hit us:

Having two independent applications and containers for starter and worker means, that they are, well, independent of each other. But in reality, they need to be balanced (because of the competition between process starts and service tasks). This balancing had to be done manually. This was difficult even for experienced people and often led to wrong conclusions.
We needed to scale those starters and workers to fully utilize our Zeebe clusters. The way the starters and workers were built required a massive amount of hardware. This lead to expensive benchmarks and hindered adoption, as not every customer can easily spin up huge Kubernetes clusters.

So we did a further improvement round:

We combined the starter and worker into one Java application
We added a balancing algorithm that tries to find the optimal rate of process instance starts. Optimal means that we make sure all started process instances can be processed and completed while utilizing Zeebe at maximum capacity. The algorithm can use backpressure to adjust. While this algorithm is probably far from being optimal yet, it is a good starting point and allows incremental improvements, which we are currently working on.
We consequently applied asynchronous/reactive programming correctly and leveraged the scheduling capabilities of the Java platform. This reduced the required hardware for the starter application massively and allows us to use a small machine to utilize even large Zeebe clusters. Yeah for reactive programming!

This seems to work great. We now have a tool, where a simple Java application can benchmark Zeebe clusters without requiring much knowledge about how the benchmark works internally. The balancing algorithm should optimize itself. This tool is made available as a Camunda community extension (https://github.com/camunda-community-hub/camunda-8-benchmark) and can serve as a starting point for your own benchmarks.

Now we run benchmarks controlled by one application that balances itself

Benchmarking vs. sizing and tuning

At this point, I want to throw in another important thing I have learned over the last years doing benchmarks: You have to be clear about your objective!

There are different things you can do:

Sizing to find the cluster configuration that can fulfill a given goal.
Performance tuning on a Zeebe cluster, e.g. figuring out if giving it 2 vCPUs more yields in an improvement that is worth the investment and makes this cluster configuration better.
Efficiency tuning , e.g. finding resources that are underutilized in a given cluster, meaning other parameters are the bottleneck. By reducing those resources you can save money.
Benchmarking to understand the maximum throughput on a given cluster.

Most often in the past, we did all of the above at the same time. Whenever we defined a cluster configuration (e.g. for our cloud SaaS service), we obviously benchmarked it. Running a benchmark might also yield some insights on performance bottlenecks, which can lead down a route of either performance or efficiency tuning. This will lead to an improved cluster configuration, which requires another benchmark run. This is an optimization loop you can basically do forever :-) The following diagram expresses this process:

You might already guess see the important trick: You need to be happy with the cluster configuration at some point in time. This can be hard, as engineers have a good instinct that the current configuration is not yet optimal (spoiler: it might never be). Still, you have to stop optimizing and accept a good enough configuration.

Also remind yourself, that no benchmark can ever be 100% realistic and therefore numbers should be taken with a grain of salt. My general recommendation is to stop tuning your cluster probably earlier rather than later, live with some fuzziness around throughput numbers, but then calculate enough buffer that you can cope with a reduced throughput for whatever reason. A typical rule of thumb is to size your cluster to accommodate at least 200% of your expected load. If you expect big peaks, this number might be even higher, to be able to guarantee throughput in peak times. You can also find some more thoughts about defining goals in our best practice “Sizing your environment”.

The metrics we are looking at

Let’s summarize all the important metrics we are looking at in our benchmarks:

PI/s : Process instances per second. In general, while we need to control the rate of started process instances, we should always measure the completed process instances per second.
Backpressure : The benchmark also records the number of requests per second that were not executed because of backpressure. This rate gives you a good indication of the utilization of your cluster. Our experience is, that a cluster that fully utilized its hardware resources, gives around 3–10% backpressure.
Tasks/s : As discussed, every service task comes with an overhead. And typically, one process instance comes with many tasks, the default benchmark we run contains 10 tasks as kind of an average. That means if we want to complete 333 PI/s we need to be able to complete 3330 Tasks/s. In the Camunda context, you might also see FNI/s instead, which means flow node instances per second. A flow node is every element in a BPMN process, so not only the service tasks, but also the start event, gateway, or the like. While this is the most precise metric, it is also the most abstract one. This is why we look at Tasks/s or PI/s instead, as this can be much better understood by a variety of people.
Cycle time : For some processes, cycle time is important. For example, in a recent customer scenario around trading, we needed to guarantee that process instances are completed within 1 second.

A sample benchmark

Let’s walk you through an example.

Therefore, I will benchmark a Camunda 8 SaaS cluster of size “S” with our“typical” process and a “typical” payload.

The process contains 10 service tasks, one gateway, and two timers waiting for one minute each. I provide the following configuration parameters to the camunda-8-benchmark project:

benchmark.startPiPerSecond=25: We start with a rate of 25 PI/s and the benchmark will automatically adjust to an optimal rate from there. While this number does not matter too much, the closer you are to the target the faster the benchmark will reach an optimum.
benchmark.taskCompletionDelay=200: Simulate this delay of task completion for every service task.

With our two timers having a 1-minute delay each, and 10 service tasks having a 200 ms delay each, the cycle time must be at least 2 minutes and 2 seconds for “business” reasons.

Starting the benchmark, you can look at what’s going on using the provided Grafana Dashboard. In the beginning, it will increase the start rate, as the cluster is underutilized and no backpressure is reported:

After a while though, especially when service tasks kick in more and more, we will see backpressure and the start rate is slowly reduced. Now, you basically have to wait some minutes for the system to find a good optimum. In the picture below you can see the first 30 minutes of the benchmark. It took roughly 10 minutes to get to a relatively stable state.

From this benchmark run, we can then derive throughput and cycle time numbers. Ideally, you should look at those numbers only after the initial warm-up phase.

The result is:

We can run roughly complete 17 PI/s on this cluster (as you can see, the number of started and completed process instances are roughly the same, which is good)
Service Tasks/s (= Jobs/s) is roughly 10 times PI/s, which makes a lot of sense given our process has 10 service tasks.
The cycle time is about 133s, which is not far off the 122s calculated optimum. For most cases, this cycle time is totally great, if latency matters for your use case it might make sense to investigate a bit and optimize for it, which is a topic for another day.

This video also walks you through running this benchmark.

Just as a side note: Of course, you can also use Camunda Optimize to see the process-related data (like count and cycle times) related to this benchmark. It’s less nerdy but even easier to use :-)

Running your own benchmark

No go and run your own benchmark using https://github.com/camunda-community-hub/camunda-8-benchmark. As described, one benchmark application might be sufficient to put smaller clusters under load. You can also run this starter via docker or on Kubernetes and scale it. You don’t necessarily need to adjust or configure anything but the Zeebe endpoint, but most often you want to adjust the BPMN process model, the payload, and the simulated time a service task will take. Please have a look at the readme of the benchmark project for details.

“Our production system doesn’t process anywhere near what the benchmark results showed!”

As mentioned, a benchmark can never be 100% realistic. Still, you should try to mimic realistic behavior to get valuable results and insights. The most important things to consider to make it realistic are:

Use a process model close to what you plan to do. Do you have many service tasks, gateways, or timers? Create a process model that also has them (quick disclaimer: we have not yet built message correlation logic into the benchmarking tool, this is something for another day).
Use a payload (aka process variables) that is close to reality. If you follow our best practices around handling data in processes, you should not have much data in process variables. But sometimes, users put a big JSON payload in a process variable, which can impact performance heavily.

Having said this, I wanted to add that it is a good idea anyway to follow our best practices to build process solutions that will run smoothly in production.

Running bigger workloads

In today’s post I looked at the smallest cluster we offer in Camunda SaaS. While 17 PI/s is actually sufficient for most use cases out there, it is of course not the big number we love to brag about. I am currently preparing another blog post to describe a scenario that my colleague Falko benchmarked for a customer in the financial industry. They run 6000 PI/s successfully while keeping the cycle time below one second. Wow!

Next steps

If you look at the benchmarking procedure closely, you will notice, that it concentrates on benchmarking the Zeebe workflow engine itself. If you look at, for example, the history architecture of Camunda 8, you can see that there are other components that are also important to keep up with this load, most prominently Camunda Operate and Camunda Optimize. Currently, we still have a gap there (which is why the official numbers provided with Camunda 8 are lower than the ones above). To address this, we currently add the other components to the benchmark chain and adjust metrics again. So a completed PI might also need to be visible in Operate and Optimize to count, for example.

Conclusion

Running a benchmark to figure out what your Zeebe cluster can do is easy using the camunda-8-benchmark community extension. To get started, you might only need a developer machine that is connected to the internet.

Moving from embedded to remote workflow engines

Bernd Ruecker — Tue, 08 Feb 2022 19:24:33 +0000

For a long time, we have advocated for an architecture that runs the Camunda workflow engineembedded into your own Java application, preferably via the Camunda Spring Boot Starter. But over time, we gradually moved away from this default recommendation in favor of a remote engine. In Zeebe, we don’t support embedding the engine at all.

In today’s blog post I want to explain the reasoning behind this move and why we recommend a remote engine. However, let’s first understand why the embedded engine was originally an appealing choice and observe what has changed over time. If you don’t care about the development over time, feel free to skip the history lesson and fast forward to the assessment of engine architectures.

A little bit of history on engine architecture recommendations

Looking back ten years to 2012, Jakarta EE application servers were popular, and Jakarta EE was still named J2EE. Most Java applications were deployed on such application servers. With Camunda, we consciously provided integration into those application servers by providing a container-managed engine, even for IBM Websphere, which was used in big corporate accounts. This integration was huge because it allowed developers to focus on developing their process solutions, without fiddling with how to configure the workflow engine, how to get transactions to work, etc. And we had great integrations: the workflow engine could easily leverage thread pools or transactions managed by the application server. At this time, the container-managed engine was the default.

But there were the Spring rebels fighting J2EE (recall the 2004 book that laid the groundwork for the success of Spring, J2EE Development without EJB.) Those folks used Spring and deployed on Tomcat. While there was a container-managed engine on Tomcat, it turned out that users did not like to fiddle with the Tomcat installation itself, but rather create one self-contained deployment (including the embedded workflow engine) they could put on any standard Tomcat. This worked much better in corporate environments, where a default Tomcat could be provisioned for you, but this Tomcat could not be customized.

While this was a favorite model, it bore some problems. For example, classloading when deploying your application next to Camunda’s web applications. It further left you with a wired mixture of configuring things in Tomcat, and others in the application. This all led to the rise of Spring Boot, where an application is completely self-contained. This also came in time with ideas around microservices and the rise of containers.

Around 2015, a group of enthusiastic people created the Spring Boot starter for Camunda, which made it into the official product already in 2017 given the big uptake by the community. Soon it became the default recommendation for new projects. A typical architecture looked like our greenfield stack recommendation:

But at the same time, the external task pattern emerged and quickly gained traction. One important factor of its success was that most process solutions at that time grew more complex and were no longer one self-contained application. Rather, they orchestrated remote endpoints and became part of a distributed system (for example in microservices architectures). In recent months, we made some effort to make external tasks the default programming model, even if we are still lacking some convenience. However, this can be built. Looking at Zeebe as the workflow engine within Camunda Cloud, it only knows the external task pattern and comes with great convenience for the developer.

Using external tasks you can easily provision the engine remotely, instead of embedding one. But why is this a good idea now?

The move to more distributed systems is flanked by trends like Docker, Kubernetes, and the overall move to the cloud. This is all interesting because it makes it easy to consume the resources your application needs as a service. Do you need the capability “relational database?” You simply provision one, either in the cloud or via one Docker command. Gone are the times when you had to install something manually. It is all automated, reproducible, easy, and reliable.

So when you need a capability “workflow engine,” you can also simply provision one. This is even easier than embedding it for most scenarios.

So in 2020, we defined a new distribution that focused solely on this use case: Camunda Run. The idea was a self-contained workflow engine that is highly configurable also in a Docker or Kubernetes environment, without the need to understand Java. This allows running the workflow engine as a service. Many bigger customers already do that for their internal development projects, as a kind of internal cloud. We are also providing our own cloud service by now (based on Zeebe).

The current recommendation

Now we are switching our default recommendation for Camunda Platform 7 towards using a remote engine, more concretely Camunda Run, as workflow engine, external tasks, and the REST API (typically wrapped into a client for your programming language). You might find this community extension helpful if you develop in Java. The upcoming greenfield stack recommendation will look like the following:

This stack is also close to what you would use in Camunda Cloud:

The remote engine approach makes it much easier to switch between the two stacks. You allow your organization to focus on one architectural style. And if you ever think about migrating from Camunda Platform 7 to Camunda Cloud someday in the future, this stack will make it much easier for you.

Next, let’s examine the pros and cons of the embedded engine in the light of today.

Weaknesses of the embedded engine

Let’s look at concrete issues we previously experienced with the embedded engine, exposing weaknesses of this model:

No isolation between the engine and the application, meaning:
Troubleshooting gets harder : In many support cases, all people involved needed significant time to investigate and understand the current architecture and configuration, which not only bound resources on both ends, but also delayed the problem resolution. Problems cannot easily be pinpointed to the engine or the application code, but can be anywhere in between.
Libraries are mixed : The application automatically pulls in all dependencies of the workflow engine, probably even leading to version conflicts that are not always easy to resolve.
Extensibility weakens stability : Applications embedding Camunda had manifold possibilities to influence the core engine behavior. This could affect the stability of the core engine or introduce vulnerabilities that are hard to diagnose.
Rebuild and redeployment necessary : Workflow engine configuration changes or version updates (even patch versions) might require a complete rebuild and redeployment of the application.
Complex configurations : While it is great that you can influence the thread pool of the engine, for example, it also makes things quite complicated and gives way too many options. And when running multiple engines, for example in bigger organizations, they might all be configured slightly differently.
Harder to get started and more Java know-how required : If you are a Spring Boot pro, the Spring Boot Starter comes naturally to you. But in any other case, we found it can be confusing. It is much simpler to ask people to run a Docker container or start an engine in the cloud.
No polyglot environments : Embedded engines can support just one programming language. In the case of Camunda, this language is Java. Modern architectures are much more polyglot and should support multiple languages.
More load on developers : The embedded engine puts the burden of integrating and configuring the workflow engine itself onto the developer. As developers are really a rare species, you should better free their time as much as possible. Additionally, an embedded engine often cannot be configured from the outside to an extent your infrastructure folks would love to, e.g. to tweak it for production.

Benefits of a remote engine

The remote engine comes with a few benefits, mostly addressing the weaknesses above:

Decoupling: The workflow engine is provisioned and configured independently of the application and process solution. Problems can be easily located on one of the components, and vulnerabilities are not transcending into other components.
Improved scaling patterns : The workflow engine can be scaled independently of the application code. Camunda can optimize the performance of the core engine, as it has full control of what is running in this scope.
Allow software as a service (SaaS): The workflow engine can be operated as a service for you, either in a public cloud (like Camunda Cloud) or probably as an in-house service (as customers of ours do). Still, you can develop your application locally or on-premise, as applications can remotely connect to the workflow engine.
Easier getting started experience : You can provision an engine by a simple Docker command and don’t need to mess with configurations in your own application.

To share an anecdote, let’s look at a proof of concept that happened at a big car manufacturer two weeks ago. They used Apache Kafka and MongoDB. It was natural to them to have a Docker Compose file to start necessary resources, so it felt absolutely right to add two lines to start Camunda too. Then, they could distribute work. An infrastructure person looked into wiring Camunda with their PostgreSQL of choice and configuring some security stuff, while developers started right away at the same time to model and execute a BPMN process.

A remote engine requires you to use the external task pattern, which leads to further advantages simply by applying the external task pattern:

Temporal decoupling: Your glue code (what we call workers) can be offline for a while, the work from the workflow engine will simply wait for it to come back. You get this temporal decoupling without the need to use a message broker.
Polyglott architectures (non-Java environments): Java Delegates work only in Java, but external tasks allow all languages via leveraging REST.
Decouple worker implementation : The worker is an own software component and not tied to the workflow engine at all. This means it can control many aspects itself. For example, if you need to execute a service task that takes hours (e.g. video transcoding), this is no problem. If you want to limit service invocations to one simultaneous call (e.g. for licensing reasons), you can easily control this.

Of course, you can also do external tasks with an embedded engine, basically just talking to your own application via REST. While this might be an interesting stepping stone to prepare your architecture for a remote engine, while actually no need to provision one, it is a relatively rare case in reality.

Some myths around remote engines

You might want to have a look at “How to write glue code without Java Delegates in Camunda Cloud,” debunking some common myths related to working with external tasks in the remote engine setup:

You can still call your service endpoints via any protocol (e.g. REST, AMQP, Kafka).
You can have all worker code in one Java application.
The programming model looks surprisingly similar to Java Delegates when using Spring.
Exception handling can still be delegated to the workflow engine.
The performance overhead in terms of latency is not very big.

Challenges of a remote engine

Of course, there are also typical challenges developers face with the remote engine setup. Let’s look at those and how they are typically addressed.

Convenience in programming model : Remote communication via REST or gRPC is less convenient for developers than simply using a client library in their programming language. This can be mitigated by providing proper client libraries that hide details of the remote communication, like Zeebe does for example for Spring, Java, or others. “How to write glue code without Java Delegates in Camunda Cloud” gives you a good overview of what I mean.
Transactions in remote communication : With a remote engine, you cannot share transactions between your code and the workflow engine. I dedicated a blog post to this topic: “Achieving consistency without transaction managers.”
Running a separate resource : The remote engine is an own resource, program, or container that you need to operate for your own application to run. Thanks to cloud services or Docker, this can be solved very easily nowadays and seems to be less of a problem than it was years ago. You could also simply download and unpack the workflow engine locally and start it up on your own computer, you just need Java installed.
Writing unit tests : There is the specific challenge: that you want unit tests to run self-contained, without any dependency on the environment. This can generally be solved with the Testcontainers project, but for example Zeebe also provides a “mini engine” that can be run inflight in a JUnit test, eliminating this problem completely in the Java world. Refer to zeebe-process-test for details. Other programming languages might follow.

Summary

The remote engine mode will be the default for future process solutions. This gives you a lot of advantages and fits into modern ways to build software architecture. We want to provide the same convenience for developers and offer a comparable programming model for developers, especially in Camunda Cloud, but we also will catch up in Camunda Platform 7. I hope this blog post could assure you that this is not only a conscious but a very sound decision. If not, reach out to me or ask in the forum any time.

How to write glue code without Java Delegates in Camunda Cloud

Bernd Ruecker — Tue, 08 Feb 2022 19:24:00 +0000

Introduced in 2015, the external task pattern is on the rise. Instead of the workflow engine actively calling some code (push), the external task pattern adds the work in a sort of queue and lets workers pull for it. This method is also known as publish/subscribe. The workflow engine publishes work, and workers subscribe to be able to do it.

Within Camunda Platform 7, we work on making the external task pattern the default recommendation, and for Camunda Cloud, this is the only way of writing glue code. Specifically, Java Delegates are not possible in Camunda Cloud anymore. This sometimes seems to leave people puzzled, so this blog post will answer why this is not a problem, and dive into the benefits you can gain from external tasks. This blog post will also debunk some myths about this pattern and clarify that:

You can still call your service endpoints via any protocol (e.g. REST, AMQP, Kafka).
You can have all worker code in one Java application.
The programming model looks surprisingly similar to JavaDelegates when using Spring.
Exception handling can still be delegated to the workflow engine.
The performance overhead in terms of latency is small.

Architecture considerations

Let’s debunk two architectural myths of external tasks.

First, applying external tasks does not necessarily mean services you formerly called via REST now need to fetch their work themselves. While this is an architectural option, it is not the typical case. Let’s look at an example:

You likely will implement a worker that still does the REST call towards the payment microservice (left side of the illustration above). This is the API the microservice exposes and should be used. The worker is in the scope of the order fulfillment process solution or microservice. Nobody outside of the order fulfillment team even needs to know that Camunda or an external task worker is used at all.

Compare that to the solution on the right side of the example, where the payment microservice directly fetches its work from Camunda. In this case, Camunda is the middleware used for various microservices to communicate amongst each other. While this is feasible and has its upsides, I have rarely seen it in the wild. Read more on further discussion of the differences.

The second myth is that you have to write multiple applications if you have multiple service tasks, one for every external task worker. While you can separate workers into multiple applications, it is rare. It is much more common to run all (or at least most) of your workers in one application.

This application belongs logically to the process solution and registers a worker for every external task. This process solution can also auto-deploy the process model.

Writing glue code

This leads us to the question of how to write glue code. Within that realm, there is another myth: it must be complicated because there is remote communication involved. The good news is that this is not necessarily true for Camunda Cloud, as there are programming language clients that provide a great developer experience. For example, using the Spring integration, you can write worker code like this:

https://medium.com/media/60c54cb9c45dec8d12b937ffa07a7b65/href

If you compare this code to a JavaDelegate, it looks surprisingly similar. We even created a community extension containing an adapter to reuse existing JavaDelegates for Camunda Cloud. While I would not necessarily recommend doing this as it’s better to migrate your classes manually, it nicely shows that this is conceptually not too hard.

That said, there are some things you can do in JavaDelegates that are no longer possible in external task workers:

Access workflow engine internals
Influence the workflow engine behavior
Integrate with thread pools or transactions of the workflow engine
Dirty hacks using Threadlocals or the like

In general, I feel it’s a good thing you cannot do these things anymore, as they regularly lead teams into trouble.

Note that we are also working on increasing the convenience of external tasks with Camunda Platform 7, and just started this community extension.

Handling exceptions

When writing glue code, you can also pass problems within your worker to the workflow engine to handle them. For example, the workflow engine can trigger retrying or raising an incident in the operations tooling. The code is pretty straightforward, and yet again quite comparable to JavaDelegates:

https://medium.com/media/1d0245b201a8190aba33e6a12fbb1c02/href

However, there is one important failure case not yet sufficiently handled: what if a worker crashes and does not fetch any work anymore? Currently, Camunda Cloud recognizes this indirectly by service tasks not being processed for too long. Ideally, the workflow engine itself should recognize that work is no longer fetched. Then, it could indicate this in the operations tooling. We are currently looking into this feature. So long you can rely on typical systems monitoring to detect a crashed Java worker application.

And transactions?

Similarly, how do you achieve consistency between your business code and the workflow engine? What if any of the components fail? With JavaDelegates, many users delegated these problems to either transaction manager, often without knowing it. Please refer to the blog post about achieving consistency without transaction managers for how to handle this with external tasks, but also to understand why this is a preferable mental model today.

Latency of remote communication

One last myth I want to address in this post is that remote workers need to be “slow.” Often in such discussions, slow is not further defined, but looking at Camunda Platform 7, depending on the configuration of the job executor, it really can take seconds for an external task to be picked up (which can be optimized by the way!) In Camunda Cloud, the whole interaction is optimized from the ground up so that only a bit of latency for the remote communication is added. In a recent experiment, I measured the overhead of a remote worker to be roughly 50ms:

In most projects, this is not a problem at all, especially as it does not affect the throughput of the workflow engine. In other words, you can still process the same number of process instances, they simply require 50ms longer per service task. Note that we are further optimizing this number for low latency scenarios we are seeing among customers.

Summary

As you can see, you have a programming model that is as convenient as JavaDelegates. At the same time, you have code that is properly isolated from the workflow engine (moving from embedded to remote workflow engines dives into all advantages of the remote engine setup).

This is why I am personally so excited about switching to external tasks as our default glue code pattern. If you are not convinced or still have questions, please reach out to me or ask in the forum at any time.

Achieving consistency without transaction managers

Bernd Ruecker — Tue, 08 Feb 2022 19:23:24 +0000

Do you need to integrate multiple components without the help of ACID (atomicity, consistency, isolation, and durability) transaction managers? Then, this blog post is for you.

I will first briefly explain what transaction managers are and why you might not have them at your disposal in modern architecture. I will also sketch a solution for how to work without transaction managers in general, but will also look at the project I know best as a concrete example: the Camunda workflow engine.

What’s a transaction manager?

You may know transactions from accessing relational databases. For example, in Java you could write (using Spring) the following code, using a relational database for payments and orders underneath:

@Transactional
public void paymentReceived(...) {
  // ...
  paymentRepository.save( payment );
  orderService.markOrderPaid( payment.getOrderId(), payment.getId() );
  // ...
}

Spring will make sure one transaction manager is used, which ensures one atomic operation. If there is any problem or exception, neither the payment is saved nor the order is marked as paid. This is also known as strong consistency, as on the database level you ensure the state is always consistent. There cannot be any order marked as paid that is not saved in the payment table.

In Java, the Spring abstractions for transaction management are pretty common, so is Jakarta transactions (JTA). This allows you to use annotations to simply mark transaction boundaries, like @Transactional, @Required, @RequiresNew. Some more explanations about those can be found in the Java EE 6 JTA tutorial. This is a very convenient programming model, and as such, was considered a best practice for a decade of writing software.

In the Camunda workflow engine (referring to version 7.x in this post), we also leverage transaction managers. Since you can run the workflow engine embedded as a library in your own application, it allows the following design (which is actually quite common amongst Camunda users): the workflow engine shares the database connection and transaction manager with the business logic. Some Java code (implemented as so-called JavaDelegates) was executed directly in the context of the workflow engine, invoking your business code, all via local method calls within one Java virtual machine (JVM).

This setup allows the workflow engine to join a common atomic transaction with the business logic. If anything in any component fails, everything is rolled back.

Do not rely on transaction managers too much!

This sounds like a great design, so why shouldn’t you rely on transaction managers too much in your system? The problem with transaction managers is that they only work well in one special case: you pinpoint everything to a single physical database.

As soon as you store workflow engine data in a separate database or you have separate databases for the payment microservice and the order fulfillment microservice, the transaction manager cannot leverage database transactions anymore. Then, the trouble starts. You might have heard otherwise by using distributed transactions and two-phase commits. However, those protocols should be considered not working. I want to spare you the details, but you can look into my talk, “Lost in transaction?” and specifically the paper, Life beyond Distributed Transactions: an Apostate’s Opinion if you are curious.

To summarize, you should assume that technical transactions cannot combine multiple distributed resources like two physical databases, a database and a messaging system, or simply two microservices.

Almost every system needs to leave that cozy place of just interacting with one physical database. Do you talk to some remote service via REST? No transaction manager will help you. Do you want to use Apache Kafka? No transaction manager. Do you send messages via AMQP? Well, you might have guessed, no transaction manager (if you stick to my statement above, that distributed transactions don’t work).

The point is, in modern architectures, I would consider having a transaction manager at your disposal a lucky exception. This means that you have to know the strategies to live without one anyway. The question is not if you will need those strategies, but where you apply them.

With that background, let’s go back to the Camunda example. Assume your payment logic is a separate microservice that is called via REST. Now, the picture looks different as you will run two separate technical transactions:

I will look into failure scenarios below, but due to recent discussions around the so-called external task pattern in Camunda, I want to make one further point. With external tasks, the workflow engine no longer directly invokes Java code. Instead, an own worker thread subscribes to work and executes it separately from the engine context. As we also encourage Camunda users to run a remote engine, communication is implemented via REST. The worker does not share the transaction with the workflow engine anymore, so the picture will look slightly different:

While this may appear more complicated, it really isn’t. Let’s examine this as we discuss the strategies to handle situations without transaction managers in the section below.

Living without a transaction manager

Every time you cross a transaction boundary, the transaction manager does not solve potential failure scenarios for you. This means you must handle those yourself as described below. Visit my talk on “3 common pitfalls in microservice integration and how to avoid them” for more details on these problems.

There are basically five possible failure scenarios when two components interact:

Component A fails before it invokes the other component. Local rollback in component A. No problem.
The (network) connection to component B fails and A gets an exception: local rollback in component A. It’s possible the connection might have succeeded and B might have committed some changes. Potential inconsistency.
Component B fails: local rollback in B, exception is sent to A and leads to a rollback in A too. No problem.
Connection problem while delivering the result to A: component A does not know what happened in B and needs to assume failure of B. Potential inconsistency.
Component A received the result that B already committed, but cannot commit its local transaction because of problems. Inconsistency.

You can translate connection problems also to applications crashing in the wrong moment and will end up with the same scenarios. This is the reality you need to face. The great news is that solving those scenarios is actually not much of a problem in most cases. The most common strategy used is retrying and it leverages the so-called at-least once semantics.

Retrying and at-least once semantics

What this means is that whenever component A is in doubt about what just happened, it retries whatever it just did, until it gets a clear result that can also be committed locally. This is the only scenario where component A can be sure that B also did its work. This strategy ensures component B is called at least once; it can’t happen that it is never called without anybody noticing. It might actually be called multiple times. Because of the latter, component B must provide idempotency.

In the Camunda JavaDelegate example, this strategy can be easily applied. If a JavaDelegate calls a REST endpoint, it will retry this call until it successfully returns a result (which might also be an error message, but it must be a valid HTTP response that clearly indicates the state of the payment service). There is built-in functionality of the workflow engine.

Looking at the Camunda external task example, it can be applied in the same way, just on two levels. The external task worker will retry the REST call until it receives a proper response from the payment service and will forward that result to the workflow engine successfully. The only difference here is that network problems could occur on two connections, and crashes could happen in three components instead of two, but all of this does not actually change much. The design still makes sure the payment service will be called at least once. There is also built-in functionality in the workflow engine for retrying external tasks.

As a general rule of thumb, you will typically apply the at-least once semantic to most calls that leave your transaction boundary. This also means there are moments of inconsistency in every distributed system (e.g. because component B successfully committed something, but A does not know this yet). Moments of inconsistency are simply unavoidable in complex systems, and actually not so much of a problem when you are aware of this problem head-on. The term coined for this behavior is eventual consistency , and should actually be embraced in every architecture with a certain degree of complexity (but this is a theme for an own blog post).

One further remark here: providing reliable retries typically involves some form of persistence. This is something you get with Camunda automatically, but you could also leverage messaging (e.g. RabbitMQ or Apache Kafka) for this.

To summarize, eventual consistency, retrying, and at-least once semantics are concepts that are important to understand anyway.

Anecdotes from Jakarta EE

Ten years ago, I worked on a lot of projects leveraging JTA. One common problem we faced (and still regularly discuss amongst Camunda users) was the following: The workflow engine and some business logic (back then implemented as Enterprise Java Beans) share one transaction. Any component could mark the transaction as failed (“rollback only”). This is undoable. No component in the whole chain can commit anything afterwards, which is what you want with atomic operations.

However, there are situations where you still want to commit something. For example, you might want to handle certain errors by writing some business level logging, decrement retries in the workflow engine, or triggering error events in a BPMN process. You cannot do any of those things within the current transaction, as it is already marked for rollback.

The only way to get around this is to run a separate Enterprise Java Bean configured to open a separate translation (“requires new”). While this might be easy in your own code, you cannot easily get such behavior in a product like the Camunda workflow engine, which is built to operate in many different transactional scenarios.

Even if there are solutions to solve this, this scenario still shows a couple of things:

You need to understand transactional behavior.
Influencing transactional behavior gets hard if it is abstracted away from the developer.
Failure situations can get hard to diagnose.

The lession to learn here is: do not share transaction managers between components but tackle potential pitfalls head-on. Embracing eventual consistency might be better than relying on transaction managers.

Conclusion

In this post, I showed that eventual consistency (especially retries) and at-least once semantics are important concepts to know. They allow you to integrate multiple components, even beyond the scope of what a transaction manager can handle, which is limited to one physical database.

Once applied, those strategies enable you to make more flexible decisions in your architecture, like using multiple resources or technologies. For example, you could see that once you call a REST service, it doesn’t even matter if a workflow engine like Camunda is used remotely.

The Process Automation Map

Bernd Ruecker — Tue, 21 Dec 2021 14:18:44 +0000

This article was originally posted on techspective.

Imagine your CEO wants you to increase process automation as part of the organization’s push towards becoming a digital enterprise. As a first project, you need to automate the payroll run, which is a manual and tedious process at your company currently. How could you go about this? Should you look into process automation platforms to help out?

In this case, the decision is not too hard: as thousands of companies have the exact same requirements you have, you can simply buy a standard HR software or leverage an off-the-shelf cloud service around payroll. This will quickly and cheaply automate this process for you.

Empowered by this success, your next process to automate is your company’s core order fulfillment process, also known as order-to-cash. Order fulfillment needs to integrate some really beasty legacy systems, so what do you do now? Buying standard software again, probably customizing it to your specific needs? Leveraging one of the low-code tools industries are raging about? Or applying software engineering methods accelerated by developer-friendly process automation technology?

This is actually much harder to answer. And guess what: it depends. It depends on various aspects of your situation and the process at hand. To help you with this kind of decision, I created what I call the process automation map (inspired by “the culture map” by Erin Meyer, which is not a prerequisite to understand this article, even if it is definitely worth a read).

Understanding the Process Automation Map

The process automation map defines the following set of dimensions that can guide you towards the right software solution for process automation:

Uniqueness of process : Your payroll service is not unique, hence you use off-the-shelf standard software. The order fulfillment process has unique requirements, so a more tailor-made solution is required. Customizing standard software is a middle ground, but often ends up in nightmares during maintenance, especially with new software releases. Instead, I favor tailor-made processes for the specialties, integrating with standard software. For tailor-made solutions you need to look at the other dimensions:
Process complexity : Your order fulfillment process needs to call out to various systems (e.g. some Cloud systems like Salesforce, your mainframe system, and bespoke legacy systems). Additionally, you need to pull in human decision-makers for risky orders and present them with the right information and context to do their decision as quickly as possible in an optimized user interface. These are complexity drivers.
Process scale : You expect hundreds of orders per day, which is a medium scale. But you also know that you plan to run this huge ad campaign in autumn, where traffic hopefully peaks to thousands of orders in a single day, which poses elasticity and stability requirements on the software solution. An extreme example of big scale is one of our customers, automating a process that shall be able to process up to two million payments per hour.
Scope : Your order fulfillment process is a process with multiple steps where you care about their sequence. For example, you have to make sure an order is approved before the money will be collected. And it shall only be delivered once it is successfully paid for. This is process automation, and it contrasts with the automation of single tasks like for example automating the human decision above with machine learning.
Project setup : Your order-to-cash process is the critical backbone of your company and needs to run reliably. You will maintain the solution for the next years to come. So you want a strategic setup of your process automation project.

Now, you can rate the process automation candidate on all of these dimensions:

Solution Categories

This rating now helps you to determine which solution to pick. For this article, I focus on four solution categories:

1. Commercial off-the-shelf software providing a ready-to-use implementation for certain common problems.

2. Tailor-made solutions requiring own development effort to build out the final solution for the business problem. The development effort can either use low code or pro code tools.

2a. Low code , meaning that non-developer are enabled to build the solution, which is typically reached by a mixture of abstractions of technical details, pre-built components, and graphical wizards.

2b. Pro code , meaning that software development is happening, but accelerated by tools that solve all problems related to process automation.

You can read more about these categories in Understanding the process automation landscape.

Now, the payroll process is a standard process. This rating leads you to a quick conclusion: you should go for commercial off-the-shelf software and can mostly ignore the other dimensions.

In contrast, unique processes require tailor-made solutions. These solutions can be built by low code or pro code tooling. And depending on where your rating tends to be on the map, you should select one or the other. The following illustration gives you a good indication of where the sweet spots for solution categories are.

In the order fulfillment process, the ratings are placed more on the right-hand side of the map so you are in the realm of pro code (developer-friendly) process automation tooling. These tools allow you to model the order fulfillment process graphically and then add glue code, most often in well-known programming languages like Java, C#, or NodeJS, to integrate it with its surroundings, in your case Salesforce and the bespoke systems via an API (e.g. REST) and your mainframe via custom code.

Using pro-code techniques allows you to leverage all best practices from software engineering: reusing existing frameworks and libraries, leveraging development environments and version control, applying continuous delivery practices, and so on. These practices have proven that they can deal with high complexity, scalability, and stability requirements very well. Solutions have high quality and maintainability. The process automation platform will simply add capabilities to deal with long-running process flows.

Let’s also look at the left-hand side of the map with another example. Assume you want to automate your marketing campaign process. This involves defining a campaign, approving it, making the necessary bookings, and assessing the impact it had afterward. Let’s rate this process:

1. Uniqueness of process: While this process is not super unique, the way you decide for campaigns and assess its result means that you need something additional to the standard marketing tools out there.

2. Process complexity: This process is not very complex. It is also fully owned by a handful of people all within the marketing team.

3. Process scale: You only run a handful of campaigns a month, which is a low scale.

4. Scope: Campaigns involve a couple of steps, so it is a process, but only with very simple process logic.

5. Project setup: As you just evolve your marketing practice, you expect that the process will also evolve over time. So you might be happy to automate only certain pain points in an ad-hoc fashion, knowing that you will replace these pieces of automation sooner or later again. If they fall apart, it does not do much harm.

This time, the ratings tend towards the left-hand side. This is an indicator that low code tooling could work well in your case. Maybe a simple Airtable to list campaigns alongside status flags is a sufficient basis for you. You can then build some low-code integrations that simply trigger emails when something needs to be done.

Use The Map in Your Own Projects

There is no one-size-fits-all solution for process automation. This is why companies must understand the forces that might drag you towards one or another solution category. I have only scratched the surface, but you might want to have a look at “exploring the Process Automation Map” for more details on the various dimensions and guidance on how to rate your own process.

Depending on this rating, the sweet spots help you determine the solution category to look at. In this article software categories were simplified to standard software, low code, and pro code tools, but if you want to dive deeper into solutions categories, it might be worth looking into “understanding the process automation landscape”.

Of course, the map will always be an oversimplification. But as long as it is useful to guide you through the tooling jungle or help you find arguments to sell this approach internally, it is worth its existence. To help you create your own map, I uploaded a template slide here. Feel free to use and distribute that at your own discretion. And I am always happy to receive copies of your own processes rated on the map.

Exploring the Process Automation Map

Bernd Ruecker — Tue, 23 Nov 2021 14:53:07 +0000

Earlier this year, I introduced the idea of the process automation map. Over time, it has proven useful in several customer scenarios. In today’s post, I’ll dive deeper into the dimensions of the map to help you rate your processes.

I recommend reviewing the introduction to the process automation map first. As a quick recap, the map defines five dimensions on which you can rate processes you plan to automate. This rating will help you select the right solution approach.

Let’s explore these dimensions one by one.

Standard processes vs. unique processes

Every organization has standard processes. For example, around payrolls, tax statements, and absence management. These processes are the same in every company, which is why you can simply buy standard software automating them. For instance, in my own company Camunda, we use spenddesk.com to manage expenses, automating much of the processes around expense management (e.g. payments, receipts collection, approval, reimbursement, etc.).

In contrast, there are likely processes very unique to your company; they require tailor-made solutions. A good example is NASA and their Mars robot. The process to process data from the robot and calculation of the robot’s movements are pretty unique; very few organizations across the planet do this. In this case, uniqueness is rooted in the fact that NASA has a very unique business model.

But more often, the uniqueness simply comes from a unique set of IT systems , typically because of existing legacy systems. Take, for example, the customer onboarding process in a bank. Even if much of the required functionality is available in the core banking system, a unique set of integration requirements (for example, with your legacy mainframe system), makes the process very unique.

Now, these three use cases are rated differently on the map. Please note, that the exact point on the map is not so important, it is simply a visual aid to discuss direction:

The tool categories to use are also indicated. For standard processes, you buy standard software. For unique processes, you need tailor-made solutions.

As a rule of thumb, deviations from the standard are more often the case with core processes, like in the customer onboarding or NASA case, than with support processes, like absence management. The latter are seldom unique enough to justify tailor-made solutions, as deviations rarely make the business more successful (exceptions confirm the rule of course).

But core processes also don’t have to be unique by default. Imagine a small webshop selling sustainable bike helmets made out of coconut fibers (no need to Google, I just made this up.) The product is super innovative, but the core order fulfillment process can be standard; an off-the-shelf Shopify account might be all the company needs.

Tailor-made solutions need a more precise rating

For standard software, you can probably ignore the other dimensions of the map, and you are done. In our example, there is no need to think further about absence management.

But for tailor-made solutions, you must understand the other four dimensions to select a solution approach for the process at hand. As introduced in the Process Automation Map, the two main solution categories for tailor-made processes are low code or pro code (developer-friendly) tools.

Low code means non-developers are enabled to build the solution, which is typically reached by a mixture of abstractions of technical details, pre-built components, and graphical wizards. Pro code means software developers are accelerated by tools that solve all problems related to process automation, in addition to proven software engineering best practices. You can read more about tool categories in understanding the process automation landscape.

The following image gives you a sneak preview of which solutions have which sweet spot. The following discussion of the remaining dimensions will explore this in more detail.

Process complexity: simple vs. complex

Processes vary in complexity. For example, I run a personal process around speaking at conferences. Conferences are maintained in Airtable, and some additional Zaps (integration flows in Zapier) automate important parts of my call-for-paper processes. For example, to remind me on Slack when a call-for-papers is about to expire. These processes are relatively simple and deal only with a very limited set of applications, all of them with well-known cloud connectors.

Compare this to an end to end business process, like a tariff change for a telecommunication customer that not only needs to take complex pricing rules into account, but also talk to many different bespoke IT systems. For example, to enter the changes into CRM or billing systems, or to provision changes to the telecommunication infrastructure.

Generally speaking, there are different drivers of complexity:

The number and nature of involved applications or people. For applications, their own complexity and ability to integrate is especially important. It is a big difference to connect to a well-known cloud tool like Salesforce, than to a legacy mainframe application which is a black box.
The number of developers required to work on a project.
The number of departments or people involved in discussing how a process is implemented.
The number of users that do operational work as part of the process instances, e.g. via human tasks.
The complexity of the user interface , as some processes might not need any UI, some only simple forms, and in other cases, you might even need a fully-fledged, single-page application to support the users.
Compliance requirements. For example, financial processes often need to comply with many legal requirements. Auditors might ask not only about how processes are implemented in general, but also want to look into specific instances from history to understand what happened in certain situations.

The more complex requirements are, the more best practices from software engineering you will need to handle them. In contrast, simple processes can also be handled by low code tooling where you simply drag together a process from standard bricks.

Scale: small vs. big

The dimension scale can relate to various things. To avoid any confusion, I limit “scale” to “ load” in the context of the map, so essentially the number of process instances in a certain timeframe. Some consider the number of systems or teams involved as part of “scale”; I explicitly put these factors into complexity.

For example, one of our customers implements a process that must be able to process two million payments per hour. This is definitely a big scale and poses different requirements compared to the management of my handful of talks a month. Foremost, the chosen technology must be able to handle the targeted scale and also help you navigate failure scenarios at scale; for example, if a core system faces an outage and thousands of process instances need to be retriggered once it comes up again.

Volatile loads might further lead to requirements around elasticity, so you need to keep changes to the scale in mind. For example, if you provide some service via the internet and run a super successful ad, you want your delivery process to be ready to scale to the increased demand without interruptions.

Scope: task vs. process

Here comes my favorite dimension. I just understood last year how big of a source of confusion this is for a lot of people, maybe even the largest source of confusion in the process automation space: the difference between task and process automation.

A process consists of multiple steps in some logical correlation, like a sequence, probably branched by some decisions, maybe looping back under some circumstances, and so forth. This is well explained by process models in BPMN.

Processes are typically long-running, as they might need to wait at any point in the process (e.g. for humans to do some work, for a service to become available, for customer responses to arrive, or simply for time to pass). This is why solutions need to be able to persist their state durably when waiting.

In contrast, a task is one step in the overall process. Tasks are typically atomic and can be executed in one go. There is typically no need to wait for something and persist state, thus I consider tasks rather stateless.

You might automate a process, but still, tasks are completed by humans. This can be achieved using task list user interfaces. A workflow engine can keep track of the state of the overall process, measure cycle times, or escalate if process instances take too long. However, the real work is done by the humans.

You can also automate tasks, usually using robotic process automation (RPA), decision management (DMN), machine learning (ML/AI), or simply software in the widest sense. You might do this without looking at the overall process. In Process Automation in Harmony with RPA, I describe the journey of Deutsche Telekom and how they started with task automation using RPA. They did not look at the process layer, and finally ended up with what they call “spaghetti bot automation”. Telekom then introduced a separated process layer that resolved these problems.

To generalize, process and task automation are both useful, best in combination, but also addressed by different tools. I always like to emphasize that RPA is not at all meant to automate processes. I found a lot of confusion around robotic process automation, which in reality is task automation.

Hence, it is important to be clear about what you want to automate. A multi-step process or a single task? This has a big influence on your tooling requirements and the solution you choose.

Project setup: ad-hoc or temporary vs. strategic

As part of the overall Camunda journey (Camunda is the company I co-founded), our marketing teams grew quite a bit over the last few years. We hired more people, introduced new functions, and explored a heck of a lot of new ideas about what to do. Many of those ideas required some IT support. During exploration, you have no idea yet what idea makes the most sense and how the process will exactly look like later on. Thus, we did a lot of manual work but applied low code tooling for areas that could simply be automated. We did not aim for a very stable solution that could run for years, we just needed something to explore or validate an idea. We were fine with the fact that sometimes only the original creator really understood the temporary solution.

Only when we took some of these ideas to the next level and scaled their usage could their importance grow to a point where it became quite strategic and required a sustainable and maintainable solution.

Other examples are most of the core processes among our customers. These processes are very strategic, so the organizations even have their own departments responsible for operating and maintaining a single process. Think of processing two million payments per hour, as mentioned earlier.

Part of the project setup dimension is therefore also the criticality of the process. If a process is critical for your company to survive, you need to make sure it runs smoothly and is stable. If you can lose real money on process failures, you need operation capabilities that prevent failures from happening or going undetected.

Conclusion

Understanding these dimensions is a good exercise to help you decide what kind of process automation solution fits into your use case. Generally, you want to rate your process on this map as described in the process automation map. Then, look at typical sweet spots of solution categories.

I hope this is helpful to you and I recommend trying it out in your own projects. To help you create your own map, I uploaded a template slide here. Feel free to use and distribute it at your own discretion. As always, I’m happy to receive any feedback or especially copies of your own processes rated on the map.