Forem: Sergey Bykov

2025 — Part 2

Sergey Bykov — Tue, 18 Nov 2025 16:55:03 +0000

Company

At the time of my last update, the company had 116 people. Now we are over 300. The Go-to-Market organization is now larger than Engineering. Some studies claim that our ancestors couldn’t handle tribes of over about 150 people. We are definitely past the point when one could know every employee. The loss of intimacy is offset by the feeling that we now have resources — a growing number of teams focusing on different areas while collaborating on cross-group efforts.

With such growth, we are doubling down on our efforts to foster and reemphasize consistency in our hiring practices, decision-making, behavioral patterns, and rules of engagement, otherwise referred to as values and culture. In my previous life within a huge corporation, those things generally made sense to me, but they also felt somewhat artificial and performative. Within the context of a small company with a relatively flat structure, it feels very different — much closer to home. This makes me genuinely attentive to such aspects and eager to contribute where I can. Just recently, we rolled out our updated values.

My impression is that at least half of the VC money these days goes to companies with corporate domains ending in “.ai,” and aside from that, funding isn’t easy. We raised our C round early this year with a very good, some say almost exceptional, multiple. This tells us that the investors have a strong conviction about our product, business model, and growth. I’m no VC, but I see how they are impressed with the quality of the use cases and the caliber of customers that come to our cloud. I hope they know better than I do how to assess and evaluate such factors. Since the C round, we’ve also had a secondary round that pushed the company’s valuation significantly higher.

Keeping the hiring bar high continues to be a top priority. With the turmoil in the job market and Temporal becoming a better-known brand, we now have access to a larger pool of high-quality engineering talent. The interview process is still more art than science, and scaling and improving this art as the company grows is a challenge by itself. Hiring at the junior levels has its own difficulties. Recently, we had to close an open SDE 1 position after only a few hours because, during that time, we received more than 3,000 applications. We found that the old recipe still works well — filling junior positions via internships.

We are still fully remote, with WeWork as an option for folks who want to come into the office. We are geo-distributed but not very balanced. Most of Engineering is on the U.S. West Coast, with roughly a tie between the Seattle and Bay areas. Smaller pockets are in Colorado, North Carolina, and the cities of New York, Chicago, Toronto, and Vancouver. The GTM team has its own distribution. My impression is they are more heavily tilted toward the East Coast.

We settled on an annual all-company offsite (we started with twice a year). We complemented it with smaller team offsites and are now aggregating them into an annual R&D offsite, side by side with GTM’s sales kickoff event. We’ll see how this goes. There doesn’t appear to be a simple solution for doing it right, and each company needs to find its own rhythm. From time to time, we leverage the West Coast’s locality for in-person meetings to discuss some critical decisions or designs. In such cases, we consciously violate the remote-first setup for the sake of high-throughput discussions and faster decision-making — at the unfair expense of colleagues who can’t attend in person and have to connect via Zoom.

Replay

It was a bold move in August 2022 to start our own annual conference. The inaugural edition was in Seattle. The 2023 and 2024 editions were in Bellevue, WA, growing bigger each year. In 2025, we held the event in London to reach audiences unlikely to travel to the U.S. Attending Replay is a very special experience. Seeing so many engineers and engineering leaders talking non-stop about your product and presenting on stage what they’ve built with it is a special kind of pleasure. I presented at all Replays but the very first one. In 2023, my talk was on the second day and I talked with folks so much before then that my voice let me down close to the end of my presentation. I guess that’s why no recording of it was published. But I gave slightly different versions of the same talk at J on the Beach and QCon SF that year.

Replay 2026 will be in San Francisco — at Moscone, no less. It should be epic. I’ll need to rewatch the Silicon Valley documentary before going there.

Operations

We operate a multi-million-dollar business based on a single product — Temporal Cloud. Our customers trust us with their hot-path business processes — often their most critical ones. This is an interesting phenomenon. They choose Durable Execution of Temporal to make their applications resilient to various failures. Naturally, they first and foremost care about the reliability of their most critical services. Some choose to self-host Temporal Server with all its dependencies. Many don’t view it as their core competency — operating such complex production machinery — and they come to our cloud service with their most precious workloads. It is amazing and sobering at the same time when big Internet household names bring us their “crown jewels” to run — even those who have a policy of not taking a dependency on SaaS vendors in the critical path. It was eye-opening to hear, on a couple of occasions, a customer say, “We only have two external dependencies — AWS and Temporal Cloud.”

Customer expectations are very high. Sometimes it feels like they set them higher for us than for the hyperscalers. We now have about eight engineering on-call rotations (teams), covering different areas of the system, plus one for on-call managers who coordinate across teams, and another for the Developer Success team that communicates with customers. This may seem large for our company size, but that’s the nature of the service we run.

We use incident.io for managing incidents. It integrates nicely with Slack, creates a per-incident channel, and automatically adds the current on-call engineers to it, among other things. We saw great promise in the early days of their product. They haven’t disappointed and are growing fast. Like most folks, we use statuspage.io for public incidents and pagerduty.com for on-call paging. Incident.io also integrates with Jira to automatically turn incident follow-ups into tickets, helping us continuously improve the system.

Replication

Temporal inherited the application-level replication stack from Cadence. Over the years, we dramatically improved it and added Control Plane functionality to manage it. Initially, we used replication to transparently migrate customer namespaces from Cell to Cell. After we got it working at the level we were happy with, we exposed it to customers as high-availability options — multi-region, cross-cloud, and single-region replication.

At first, few customers immediately understood why they would want to pay double (due to the duplicate hardware needed) for such a feature. Some just used it, at our suggestion, as a tool for migrating their workloads from one region or cloud provider to another. The recent GCP and AWS us-east-1 outages vindicated the paranoid among our customers who refused to accept that “cloud regions pretty much never go down.”

Customers who had replication enabled for their namespaces were able to fail over to the other region or cloud, and their applications continued executing as if nothing had happened. We discovered a few misses on our side and had to fail over some namespaces manually, with a longer delay than we expected. The important part is that replicated namespaces continued running after failover. We saw a major spike in customers setting up replication in the days after the AWS us-east-1 outage. One customer was in the process of migrating their namespace from AWS to GCP during GCP’s global outage. They weren’t impacted and didn’t even need to fail over because their active replica was still in AWS. They were considering keeping the cross-cloud replication running indefinitely after that.

I gave a conceptual talk about replicated namespaces, but the topic probably deserves its own post.

Road ahead

With great opportunities come great responsibility and pressure to execute and realize those opportunities. We still have to strike the right balance between running a highly reliable service and investing in new functionality. It’s a deeply humbling experience to see that some of the world’s top companies — household names with tens or even hundreds of millions of users — take an all-in dependency on Temporal Cloud. This leaves no room for hubris, complacency, or sloppiness. We have to keep pushing the reliability and quality bar higher without hampering further development of the product.

I don’t believe there’s a general recipe for how to grow an organization, be it engineering, R&D, or the whole company. We’ll have to navigate our own path — growing sustainably while preserving what has made us successful so far and learning new ways in parallel. It’s exciting and somewhat dizzying at the same time. Yet I feel we are still only getting started.

2025

Sergey Bykov — Wed, 12 Nov 2025 21:56:41 +0000

Part 1

Belated update. Yes, it’s been five years, can’t believe it myself. What’s the “delta” for the last three years that flew by too fast?
Certain things haven’t changed much. We are still under the “dual mandate” — OSS server, SDKs (clients), CLI, and a whole bunch of other peripheral software, plus a cloud service where we charge customers for running and managing the invisible infrastructure so that they don’t have to. My focus continues to be primarily on the cloud side.

At the same time, obviously, everything has changed — some things even multiple times. We went through COVID with its obligatory work-from-home setup, only for many companies to start imposing, some more gradually than others, return-to-office policies. I interviewed a number of candidates recently who moved away from major tech hubs during COVID and had to leave their jobs because of the RTO push. We are still fully remote.

Most of the Big Tech companies went through rounds of mass layoffs — a tectonic shift from the previous 20 or so years of competing for talent and outbidding each other in offers. Startups suddenly became much more attractive for Big Tech employees who were previously reluctant to take the risk of leaving their well-paid jobs. At the same time, startup founders faced the funding drought starting in late 2021 and early 2022 caused by interest rate changes. Many had to close or fire-sell their ventures where just a couple of years earlier, cheap money seemed unlimited.

Product

Developers choose Temporal for its programming model. They experience it in a language of their choice via Temporal SDKs. We started with two languages. Now we support seven: Go, Java, TypeScript, Python, .NET, PHP, and Ruby. Four of them are built on the same Core SDK written in Rust. No, we still don’t have an official Rust SDK.

We started Temporal Cloud by hosting the OSS Temporal Server with an added layer of security and multi-tenancy. The original value proposition included that, plus general operational concerns such as monitoring, alerting, configuration, upgrades, and scale. We’ve been investing along several dimensions since then and are now running the fifth-generation Cells (Temporal Cloud “clusters”).

Building a custom storage layer between the server and database to absorb reads and coalesce writes was one of the first bold undertakings. Rolling it out to production over the course of 2023 gave us a significant increase in reliability, performance, and scalability compared to the vanilla OSS server. Another major investment was making it possible to incrementally add multiple databases to a running server. With these improvements, the scenario I mentioned in one of my previous posts, a major customer needing to increase their already outsized traffic level up to 10x for a day, became routine. At the end of 2021, a day like that was a big deal for both companies, with teams of engineers monitoring the system, communicating live, and taking action. The subsequent occurrences became increasingly “boring” and turned into non-events.

On the authentication/authorization dimension, we went from initially supporting only mTLS and Google SSO to adding API keys, service accounts, SAML, SCIM, and a bunch of other features critical for enterprise — and not only enterprise — customers.

We started with prospective cloud customers filling out a form, getting contacted by our sales team to complete the paperwork, and then creating an account on their behalf. Embarrassing. Now, we have a complete self-signup process that guides prospective customers along the path, with a full PLG motion behind it. When we opened up Temporal Cloud to the world, we were missing a number of table-stakes features. At the time, I called the bar we had to meet a “reasonable cloud service.” I believe we passed this milestone 12–18 months ago.

I like that we don’t play licensing games (our OSS is under the MIT license) and instead extend and enhance it with proprietary features to differentiate our cloud offering.

We launched Temporal Cloud in 2022 with support for AWS only. We added GCP in 2025 and are working on bringing in Azure, the last of the big three providers. Even though support for Kubernetes clusters across them is similar, most of the integration effort goes into their disparate security and resource hierarchy models, differences in networking, and subtle behavioral differences in their seemingly compatible APIs — for example, GCS vs. S3. Recently, we’ve been chasing GCP load balancers mysteriously ghosting a fraction of the connections. Support for hosted Elasticsearch is another headache — only AWS has it, but in the form of OpenSearch, their fork of ES from before Elastic changed its license.

AI

The agentic AI “storm” turned into a sudden tailwind for Temporal. The very nature of such applications — being stateful, depending on a significant number of semi-reliable API calls to external services, and taking seconds to minutes to execute — made code-first Durable Execution a compelling programming model for this fast-moving, massive herd. While there are still some rough edges for AI use cases in the near term (such as payload and history size limits and required determinism of workflow code), the immediate benefits — high-velocity development of much more reliable code in the language of your choice, guaranteed scalability, and unparalleled visibility into execution for debugging — keep bringing AI-focused companies to Temporal. More traditional businesses that are scrambling to integrate AI into their systems do the same. I was told that as of late 2024, out of the top 20 AI companies, only two were aware of Temporal — and now, 16 of them already run Temporal-based apps.

Nexus

This year we launched the initial version of Nexus, an open-standard-based protocol for APIs that may take arbitrarily long to complete. I think of it as a great frontend layer for Durable Execution. But the protocol itself is implementation-agnostic. One could implement it using more traditional tools and approaches, for example, within the paradigm of event-driven architecture. The idea was conceived in the early days of Temporal. We started talking about it publicly in 2022, only to do nothing for another year due to other priorities.

We believe that Nexus is an immense opportunity to integrate systems and services in a new, powerful way. Nexus deserves a dedicated post, and I’m contemplating a conference talk about how the combination of Durable Execution and Nexus could define a major evolution of the Microservice Architecture. I understand this is a very bold statement, but sometimes you have to shoot for the Moon.

(Continued in Part 2)

Building Durable Cloud Control Systems with Temporal

Sergey Bykov — Sat, 09 Aug 2025 00:47:09 +0000

In today’s world of managed cloud services, delivering exceptional user experiences often requires rethinking traditional architecture and operational strategies. At Temporal, we faced this challenge head-on, navigating complex decisions about tenancy models, resource management, and durable execution to build a reliable, scalable cloud service. This post explores our approach and the lessons we learned while creating Temporal Cloud.

The Case for Managed Cloud Services

Managed services have become the default for delivering hosted solutions to customers. Whether it’s a database, queueing system, or another server-side technology, hosting a service not only provides a better user experience but also opens doors for monetization, especially for open-source projects. The challenge is how to do it effectively while maintaining reliability and scalability.

One of the first decisions we made was about tenancy models. Should we pursue single-tenancy — provisioning dedicated clusters for each customer — or opt for multi-tenancy, which allows multiple customers to share the same resources? While single-tenancy offers simplicity and isolation, its inefficiencies quickly become apparent. Customers end up paying for unused capacity, and providers shoulder higher operational costs. Multi-tenancy, though harder to implement, emerged as the clear winner. It optimizes resource usage, allows customers to pay for actual usage, and creates shared headroom for handling traffic spikes.

Data Plane vs. Control Plane: Defining Responsibilities

Architecting a managed service in terms of the data plane and control plane is an industry best practice that we followed, clearly defining and implementing their distinct roles within our cloud architecture.

Data Plane: This is where the actual work happens — processing transactions, executing workflows, and handling customer data. It must maintain high availability, low latency, and resilience to failures. For Temporal Cloud, we adopted a cell-based architecture to isolate resources and minimize the blast radius of potential failures.
Control Plane: This acts as the brain of the system, managing resources, provisioning namespaces, and handling configurations. While its performance is less critical than the data plane, reliability here still matters for customer experience. For instance, provisioning a namespace may not be urgent, but delays or errors in this process can frustrate users.

Implementing the Data Plane: A Cell-Based Architecture

For the data plane, we applied a cell-based architecture to achieve strong isolation and scalability. Each cell operates as a self-contained unit with its own AWS account, VPC, EKS cluster, and supporting infrastructure. While this approach is framed within the context of AWS, we have applied the same principles to Google Cloud Platform (GCP), leveraging its equivalent primitives to ensure consistency and reliability across cloud providers. This approach ensures that failures or updates in one cell do not impact others, reducing the risk of cascading outages.

Each cell in Temporal Cloud includes:

Compute Pods: Running Temporal services and infrastructure tools for observability, ingress management, and certificate handling.
Databases: Both primary databases and Elasticsearch for enhanced visibility.
Additional Components: Load balancers, private connectivity endpoints, and other supporting infrastructure that ensures smooth operation and integration across environments. Currently, Temporal Cloud operates across 14 AWS regions, and we’ve also added support for GCP. This architecture allows us to meet the diverse needs of our customers while maintaining reliability at scale.

Durable Execution: The Foundation of the Control Plane

Building the control plane presented its own set of challenges, particularly around reliability and maintainability. Control plane tasks, such as provisioning namespaces or rolling out updates, involve complex long-running processes with many interdependent steps. Writing this logic as traditional, ad-hoc code often leads to brittle systems that are hard to debug and evolve.

This is where Temporal’s durable execution model shines. Designed based on experience with earlier systems like AWS Simple Workflow Service and Azure Durable Functions, Temporal’s approach separates business logic from state management and failure handling. Developers can write workflows as straightforward, happy-path code without worrying about retries, error handling, or state persistence. The system automatically manages these concerns, allowing workflows to seamlessly recover from failures.

Namespace Provisioning: A Real-World Example

Consider the process of creating a new namespace in Temporal Cloud. When a user clicks “Create Namespace” on the web interface, the control plane orchestrates a series of tasks:

Selecting a suitable cell within the chosen region.
Creating database records and roles.
Generating and provisioning mTLS certificates.
Configuring ingress routes and verifying connectivity. Each step involves external API calls, DNS propagation, and other potential points of failure.

Without durable execution, managing retries, backoffs, and state persistence would result in a tangle of brittle code. With Temporal, these tasks are encapsulated in workflows, which transparently handle retries and maintain state across failures. Developers can focus on the high-level logic, confident that the system will handle the edge cases.

Rolling Upgrades: Ensuring Safe Deployments

Another common control plane scenario is rolling out updates to the Temporal Cloud fleet. Our deployment strategy involves organizing cells into deployment rings, progressing from pre-production environments to customer-facing cells with increasing priority of traffic.

The rollout process is carefully staged:

Ring 0: Synthetic traffic only, no customer impact. Changes are monitored here for at least a week.
Ring 1: Low-priority traffic namespaces, allowing for additional testing with minimal risk.
Higher Rings: Gradually expanding to critical, high-priority traffic customers. Within each ring, updates are applied in batches, with pauses between batches to observe for potential issues like memory leaks or race conditions. Temporal workflows handle this process, ensuring that even long-running deployments (which can span weeks) are resilient to failures or restarts.

Entity Workflows: A Powerful Pattern

Temporal’s durable execution also enables powerful patterns like entity workflows. These are workflows tied to specific resources, such as cells or namespaces, providing a natural way to model state and operations. For example, each cell in Temporal Cloud has an entity workflow that manages its lifecycle, from provisioning to upgrades. This approach ensures consistency and simplifies concurrency control.

Developer Happiness and Productivity

One of the biggest benefits of Temporal’s approach is the impact on developer experience. By eliminating the need to write boilerplate code for retries, backoffs, and state management, developers can focus on delivering business value. Temporal’s built-in tools for observing and debugging workflows further enhance productivity, making it easier to understand and troubleshoot complex systems.

Happy developers are productive developers, and Temporal’s approach fosters this by reducing the cognitive load and frustration associated with traditional workflow coding.

Why Durable Execution Matters

Durable execution is more than a technical innovation; it’s a paradigm shift for building cloud-native systems. By decoupling business logic from state management and failure handling, Temporal empowers developers to build reliable, scalable systems with less effort. Whether you’re managing control planes, provisioning resources, orchestrating complex workflows, performing money transfers, training AI models, or processing social media posts, this approach delivers clear benefits.

At Temporal, we’ve seen firsthand how durable execution transforms the development process, enabling us to deliver a robust managed service that scales with our customers’ needs.

Ready to Transform Your Control Plane?

Temporal isn’t just a tool for building cloud systems; it’s a better way to think about workflows and application architecture. If you’re building or planning a managed cloud service, consider how durable execution can simplify your journey and unlock new possibilities. For more insights into our approach, check out my full talk at QCon.

Why Top Developers Prioritize Failure Management

Sergey Bykov — Sat, 09 Aug 2025 00:35:05 +0000

There’s a saying: “Amateurs study tactics, while professionals study logistics.” In software, this translates to: “Amateurs focus on algorithms, while professionals focus on failures.”

At J on the Beach, I took time in my talk to expand on this saying and explain that real-world systems don’t just need code that works on the “happy path” — they need a safety net for when things go wrong.

Modern software development has layers of complexity. You’re not just writing code; you’re connecting systems across time and space, handling data that doesn’t sleep, and ensuring flawless performance at scale. What sets top developers apart is how they manage failures. Building resilience focuses on ensuring reliability when things inevitably go wrong, not just maintaining uptime.

In this post, we’ll walk through three common approaches to handling failures in software, each with its own strengths and weaknesses. Then we’ll introduce Temporal’s approach, workflow-as-code, which makes it easier to build reliability into your systems from day one.

Three Ways to Handle Failure in Your Software

Failures are inevitable in your distributed systems. When a network link fails, a server times out, or a service crashes, systems need strategies to respond properly and ensure that your operations remain reliable.

Below, we’ll explore three common approaches to coordination between systems — Remote Procedure Calls (RPCs), persistent queues, and workflows — and their relationship to failure management.

1. Request-Response (RPC)

The request-response, or RPC model, is a classic approach. A client makes a request, the server processes it, and sends back a response. In the best-case scenario — the “happy path” — everything works smoothly. Imagine a money transfer request: one service debits the sender while another credits the receiver. If all goes as planned, the transfer completes with no issues.

Pros of the RPC Model

Simplicity: The direct client-server connection makes this model easy to implement for straightforward workflows.
Efficiency on the “happy path”: When things go smoothly, RPC provides fast, efficient responses and low latency.

Cons of the RPC Model

Limited resilience for partial failures: If the client’s request is successful, but a response isn’t received, or a step in the process fails, RPC often requires extensive error-handling code on the client side.
Heavy client burden: Clients must handle errors, recovery, and retries, complicating systems as they scale.
The RPC model works well for simple, synchronous tasks. However, for resilience, it falls short by placing the onus on developers of the RPCs and those consuming them to manage every failure scenario — and this is no trivial matter.

2. Persistent Queues

Persistent queues add a degree of flexibility by decoupling the client from the server. Messages are placed in a queue, and the system processes them asynchronously. Queues help distribute workloads: they support automatic retries and asynchronous processing, which can smooth out demand spikes.

Pros of Persistent Queues

Automatic retries: Persistent queues often support automatic retries, attempting tasks multiple times if they initially fail.
Load distribution: Queues smooth processing under heavy loads, distributing requests over time, to improve system reliability.
Producer-consumer separation: Decoupling producers and consumers allow the queue to function independently, improving fault tolerance.

Cons of Persistent Queues

Loss of ordering: Since queues process messages independently, tasks may execute out of order, causing unexpected issues for dependent operations.
Dead-letter queues: Tasks that continuously fail may require a separate “dead-letter” queue, adding complexity and, typically, manual intervention.
Limited visibility into status: Visibility becomes even more challenging when you have systems that use multiple queues, requiring additional tooling and infrastructure.
Queues work well when you need flexibility and decoupling, but they lack the control and visibility needed for comprehensive failure management.

3. Workflows

Workflows provide a robust solution for orchestrating complex processes across distributed systems. Unlike RPC or queue-based models, workflows manage retries, state, and error handling automatically, making them ideal for long-running or multi-step processes.

Pros of Workflows

Built-in resilience: Workflows handle retries, recovery, and compensation steps automatically, reducing the need for custom error-handling code.
Support for long-running processes: Workflows accommodate processes that span minutes, hours, or even days, making them well-suited for complex tasks.
Enhanced visibility: Workflow systems enable real-time tracking and querying, so both clients and developers can see exactly where each process stands.

Cons of Workflows

Infrastructure requirements: Workflows require a solid infrastructure to manage states, retries, and tracking, which some teams may lack.
Setup complexity: Workflow systems can be complex to set up, especially when building custom solutions to manage workflows.
For complex processes that demand reliability and transparency, workflows provide the most comprehensive solution, though they require dedicated infrastructure to deploy effectively.

Resilience Without Extra Overhead

At Temporal, we addressed these challenges by designing a platform that handles resilience, error handling, and state management so you don’t have to.

With Temporal, you write workflows as code - no extra XML, JSON, or YAML definition of workflow logic that is difficult to understand and debug down the line. Define your steps in regular code, and Temporal does the rest, managing retries, maintaining state, and ensuring that your workflows are reliable and simple to create.

Companies like ANZ Bank, one of the largest banks in the Asia-Pacific region, rely on Temporal to strengthen the resilience and reliability of critical financial processes. With Temporal, ANZ orchestrates and manages complex operations across distributed systems, ensuring tasks are retried automatically, failures are handled, and long-running processes are tracked seamlessly. This has enabled ANZ to boost system reliability, reduce operational complexity, and uphold strict compliance standards in their high-stakes FinServ environment.

Failure Management Is a Strategy, Not a Setback

Any complex system will encounter failures. But how you handle those failures makes all the difference. For developers, focusing on failure management from the start distinguished exceptional teams from the average. Building resilience into your system sets your project up for long-term success.

Two Years In

Sergey Bykov — Tue, 31 Jan 2023 15:22:43 +0000

I’ve wanted to publish an update for quite some time on how things are going at Temporal. I’ve had all the right excuses at my disposal for not doing it—“I’m just too busy while we are finishing X”; “we are just about to announce Y”; “I need this the time right now to help my son with physics, after that I’ll have more time”; “after I return from this trip”; etc. Excuses are only excuses after all. If only I could write like Colin Breck, I would do it more often. Another excuse.

When I joined Temporal in September 2020, our team consisted of 13 people, almost all engineers. Now the number is 116 (at least last I heard). We have several distinct teams within engineering: Server, SDK, Cloud, Infra, Developer Tools, Security, and a “10x” team. “10x” is not the actual name of the team; I just used this moniker to reflect what the team’s been focusing on without revealing its internal name.

Engineering is not the only game in town anymore. We built several other critical functions and staffed them with excellent specialists. A Go To Market team with Sales and Solution Architects, Technical Writing and Education, Customer Success, Recruiting, Finance, Product, and Design teams.

We also happened to raise our B round in December 2021 on great terms, just three weeks before the VC market changed dramatically. I have a very limited understanding of this part of business, but it seems we had timed the market very well. We are actively hiring to accelerate investments across multiple areas. In the current market environment, sometimes this feels a little bit surreal.

Remote

When I joined, Temporal was a primarily local company, with a plan to get back to office after the pandemic came to an end. A couple of months later, we realized that the desire to stay local was significantly limiting our ability to hire. We decided to become a fully remote company. That immediately unlocked access to great hires outside of the Seattle area. We hired people in the Bay Area, East Coast, Midwest, Colorado, Texas, Florida, and Canada. However, the company is still very skewed toward the West Coast in general and the Seattle area in particular, especially on the engineering side.

Such a geographical makeup of the company presents an interesting challenge in terms of how to collaborate effectively. The original desire for being local was not based on convenience, but to allow for high-throughput face-to-face discussions. Sometimes a two-hour whiteboarding session is more productive than a month-long series of Zoom calls and Slack threads. Unfortunately, flying people across the continent for a day or two of discussion is a significant toll on them, with family inconvenience, travel time, and jet lag. The trickiest situation is when there’s a local “majority” that can easily meet in person and a few “remote” team members who can’t. This creates an inherent counterproductive split.

We are still trying to figure this out. We don’t buy the simplistic recipes like “all meetings have to be remote” and “just need to write everything down,” at least for the current state of the company. Especially for some of the deep technical problems we are trying to solve. I would love to learn how other companies dealt with this challenge. Should we choose a hub-and-spoke model with an HQ and “remote” employees traveling to it? Should we build a small number of hubs in key areas? How many? Seattle, Bay Area, Boston, Denver?

Collaboration Tools

We’ve been fairly open-minded about using a number of collaboration tools with overlapping functionalities. As the company grew, some of the tools phased out, some stayed but changed their role, some new ones entered the fold.

Slack continues to be the #1 collaboration tool. The introduction of huddles helped it to take over a significant fraction of Zoom calls for quick chats originating from Slack threads. On-call incident conversations happen primarily in huddles these days. Discord died out as a pseudo-office chat tool around the same time huddles were added to Slack.

Zoom is the main tool for scheduled meetings and most external calls. Go To Market teams use Gong for recording and transcribing Zoom calls, which is very handy for skimming through content of calls you weren’t part of, or for finding a particular important moment in an hour-long conversation.

Notion is our key information holding system. I keep saying it deserves its own blog post. One of these days. The combination of easy-to-use wiki-like document creation and editing functionality with the ease of adding ad hoc databases/tables has proven to be great for putting structured (data) and unstructured (text) information together. However, a consistent complain about Notion is its search capabilities. Personally, I find its search good enough for me. But most people don’t.

We previously used Notion for task tracking and sprint planning (just another database), but we recently switched to Jira for that. The highest level of planning, the roadmaps, stayed in Notion though. Notion has connectors for Jira and GitHub, which makes such integrations possible. But I can see how many people would prefer to stay within one tool and use either Jira or GitHub projects for all of their planning and tracking.

Other Tools

GitHub Actions is our preferred CICD mechanism today. Before I joined, CICD pipelines were run using Concourse for Server and SDKs. We are gradually moving away from Concourse because GitHub Actions proved to be easier to manage. At the same time, we are working on leveraging Temporal Workflows for pipeline orchestration. This is not for ideological reasons, but simply because our engineers find it a lot more convenient to look at Workflow histories than digging through flat logs. Support for failure handling and automatic retries is an obvious strength of Temporal. We are looking to open-source this framework when it takes its mostly final shape. Ironically, GitHub Action orchestrations are called “workflows” as well, confusingly enough.

k9s is a clear favorite when it comes to interactive operations on Kubernetes clusters. It’s primarily used while on call, to monitor things and sometimes to make an immediate ephemeral change. Some people still prefer to use kubectl or VS Code instead.

Cloud

By early 2021 we had a handful of paying customers, whom we called design partners. The rationale was that even back then Temporal Server was a mature product, having already been used at scale for several years by Uber and others for production for mission-critical applications, in the form of Cadence. Hence, it was a very solid core for the Data Plane.

We started by deploying several Capacity Units of the Data Plane, which we unoriginally call Cells, in a barely automated way, with a combination of scripts and manual operations. This unlocked monetization for the product before the Control Plane automation was put in place. The downside was that we had to support production Cells, including being on call, without sufficient tooling to perform routine operations. Therefore, the operational burden was taking engineering cycles away from development. The benefit of this setup is that it was constantly keeping us honest—any new Data Plane or Control Plane feature had to be production ready from the start. We would also receive immediate feedback from the design partners on any and all changes. I think this was the right tradeoff that helped us to keep the quality bar high and to avoid building esoteric features.

For a long while, our customers had to file support tickets for any changes to configuration of their Cloud Namespaces: creating a Namespace, updating certificates, inviting users, and so on. That meant the on-call engineer was required to make such changes on the customer’s behalf. This very manual process was a bottleneck for sales because we weren’t ready to take a large number of customers that we’d have to support via such an unscalable process. This meant we had to be very strategic about which customers to take.

Last May we set for ourselves the goal of providing a full self-serve experience. In general, as a company we avoid date-driven releases. In this case not only did we set a target date of early October, we also put a forcing function upon ourselves—we decided that we would announce availability of Temporal Cloud at the inaugural Replay conference in August. A lot of hard work happened between May and October to deliver on the promise. We ended up slipping on a couple of minor features, but otherwise delivered https://cloud.temporal.io/. With the self-serve functionality in place, we were able to quickly process the backlog of companies that had been waiting “in line” to get to Temporal Cloud and started taking new customers in a real-time fashion.

Control Plane

Each Cell in Temporal Cloud is a composition of several compute clusters, one or more databases, Elastic Search, ingress, observability stack, and other dependency components. As one would expect, we needed a Control Plane that would manage provisioning such resources, deployment of software and configuration changes to Cells, monitoring, alerting, and handling certain classes of failures.

We chose to leverage Temporal for building the Control Plane. This decision was not made for ideological reasons either, but because Temporal is indeed an excellent fit for automating infrastructure management. This is actually one of the popular Temporal use cases. Operations on cloud resources can take time, sometimes a fairly long time. A dependency service might return various retriable and non-retriable errors. Retries usually require backoff. Failures require compensating actions. In general, operations in a Control Plane like that walk and quack like Workflows. We could take the “duct tape” route, writing code with a bunch of timers and queues. We could use DSL for defining Workflows. But Temporal exists as a product in large part because these two approaches are much less developer friendly than Workflows As Code.

Reconstructing from logs what happened during the execution of a long process with a bunch of sequential and nested steps is not a pleasant experience. This is where Temporal also shines. Calls to various services are expressed as Activities with default or custom Retry Policies. Execution of a complex sequence of operations can be structured into a hierarchy of Child Workflows. Each Workflow has a complete history of its execution retained, with all inputs and outputs, timestamps, and errors automatically captured.

I feel the architecture of the Control Plane deserves its own blog post and could be made into a conference talk. Bottom line is we have an Inception-like setup—a Control Plane Cell executes a bunch of Workflows that manage a bunch of production Cells with customer namespaces and Workflows running in a multi-tenant configuration.

Scale

One of the core characteristics of Temporal is that a Temporal Cluster can scale linearly with the database and compute resources given to it, although as with any scaling, there are practical limits. The key reason for the Cell architecture is to minimize and contain the blast radius of any issue within an individual Cell. In other words, if we run ten Cells in a region, an issue with one of them, be it a software edge case or a transient infrastructure incident or a human operator mistake, would only impact about 10 percent of customers in the region. If we were instead to run a single Cell ten times the size, any issue would potentially impact all customers in the region.

At the same time, being able to run large Cells has value. We have customers that need high throughput for a single Namespace. (A Namespace cannot span Cells today.) Large Cells are also good for absorbing spikes in traffic of individual Namespaces.

We are still learning the sweet-spot approach for balancing between scaling out and up. We’ve been doing both. We’ve invested in supporting high scale for individual Cells and Namespaces. For example, one of our customers needed to be able to increase their already pretty high traffic by ten times for a day. That’s why I referred to the team that’s been focusing on the scale-up work as the “10x” team. Last summer they hit a milestone of processing one million state transitions (the lowest-level unit of work in Temporal) per second. This required some deep investments on the intersection of distributed systems and database techniques. And we only scratched the surface.

What’s Next

We had a leadership offsite in December where we discussed our plans for 2023. We settled on a number of investment areas and priorities for the year. I cannot share them because we haven’t communicated even the public part of it yet.

The biggest investment without a doubt will be into people. In my opinion, we managed to attract 116 top-notch first employees. One candidate asked during the interview what I was most proud of at Temporal. I was caught off guard and had to think for a few seconds. I said that number one on my list was that I had helped to hire the great people we have. Fun fact—the candidate accepted our offer and quickly established themselves as an amazing engineer.

We are still learning how to work more efficiently with an organically growing geographic graph of employees. With the growth of the company, we need to not only continue to hire people who are better than us, we need to evolve how we work, plan, and collaborate. We need to always be adjusting how we do things.

This is the first time in my professional career that I can truly contribute to shaping the company, its products, business, and culture. It’s been an amazing and pure experience. Quite a drug, to be honest. No surprise, I continue to be all in on Temporal, just as I was when I joined. Even more so now, as things got so much more real. Ideas turn into products and features, hires form highly-functioning teams, opinions and thoughts become how we roll. And we are still only getting started.

Inversion of Execution

Sergey Bykov — Wed, 01 Dec 2021 20:35:56 +0000

At first glance, I made the same assumption that I see many other people making - that Temporal executes their workflow code. As I looked deeper, I learned that the execution model of Temporal is actually more interesting, and more powerful. I figured I should write a post describing the execution model at a high level. Hopefully, this will help give people a starting mental model for Temporal’s core architecture.

I call it: Inversion of Execution.

Disclaimer.

I am not saying that Temporal was first to implement this kind of execution model. It wasn't. There were earlier products that used similar ideas. I'll leave the genealogy research for an aspiring PhD student to perform. Somebody probably wrote a paper laying all this out back in the 1970s.

Architecture

Temporal is a workflow engine. That made me immediately think about how I would deploy my workflow (application) code to it, how it ensures security, performance isolation, versioning, and all other concerns that accompany running somebody else's code. I see many people making the same set of assumptions and having similar concerns, in the beginning of their Temporal journey.

In reality, Temporal doesn't execute application code at all. It "only" orchestrates execution of your code in order to drive workflows to completion. Application code leverages Temporal via the client SDKs in such a way that it's not even immediately apparent where the Server gets into the picture.

In other words, there is an Inversion of the standard Execution model we are used to.

That's how the analogy with Inversion of Control came to my mind. Instead of injecting dependencies, Temporal "injects" steps of execution (known as tasks) into application code. Temporal can run (and, if necessary, rerun) those steps to overcome intermittent failures of outgoing calls and to recover from infrastructure failures. There are two kinds of tasks in Temporal: Workflow Tasks and Activity Tasks. They have different purposes and use separate persistent queues for reliability and scalability.

This is the topology of a typical Temporal-based system:

Temporal Server is usually run as a cluster of servers for reliability and scalability. The Server is backed by a database for persisting state.

None of your application code ever runs on the server. Instead, application logic runs on the client side of the picture. It is also usually organized as a cluster of Worker Processes, for scalable execution.

The umbilical cord connecting client to the server is the Client Runtime piece (a.k.a. SDK). Don’t underestimate it based on the name. Unlike typical SDKs that are mostly language-specific wrappers around a set of server APIs, Temporal SDKs do much much more than that. They contain the complex machinery that hides all interactions with the server and handle the non-trivial process of reconstructing client state after a failure. The goal is to free the business logic from all of that complexity, and enable the application code to be concise and easy to write.

Bank Transfer Example

Let's look in more detail at how Temporal actually executes workflows, using the canonical example of transferring a sum of money from one account to another. While a real financial transaction may involve many steps, we will keep it simple and just talk about performing one debit and one credit operation. That should be enough for showing the mechanics of Temporal.

A workflow like this is typically started by a client process, for example, a web frontend, with a single gRPC call to the server through the SDK.

Go:

we, err := client.ExecuteWorkflow(context, workflowOptions, money.TransferFunds, params)

Java:

WorkflowClient.start(transferWorkflow::transfer, from, to, reference, amountCents);

If this call succeeds, the money transfer workflow has been accepted by Temporal Server. Even if the client process were to crash immediately after that, Temporal will drive execution of the workflow to eventual completion or failure.

Under the covers, Temporal Server already persisted a record of the workflow with its arguments, options, and unique execution ID. As part of that transaction, it puts a Workflow Task into the corresponding Workflow Task Queue. This task contains a WorkflowExecutionStarted event as a first event in the workflow’s history.

Workflow implementation code (in our example transferWorkflow::transfer) is compiled into a Workflow Worker. When a Workflow Worker process starts, it connects to Temporal Server via Temporal Client Runtime (a.k.a. Temporal SDK) and starts long-polling the Workflow Task Queue for tasks to execute. That's how it receives the Workflow Task with a WorkflowExecutionStarted event inside from the server.

Note that the Workflow Worker didn't even have to be up and connected to the Server when the workflow was submitted by the client. It could connect later or experience a temporary outage. It doesn't matter; once the application worker connects, it will pull the task from the queue and start executing it.

A Workflow Worker (there's usually more than one worker process running for redundancy and scalability) receives the initial Workflow Task that contains a WorkflowExecutionStarted event as the first event of the workflow history. This event includes a workflowType.name field that indicates which workflow type Client Runtime needs to invoke. Runtime performs a lookup in the map of registered workflow types and calls the corresponding application function that starts actual execution of the workflow logic.

Workflow code defines a sequence of steps, not necessarily linear, that need to be executed for a workflow to complete. Some of these steps are actions that involve communication with external systems. In our example they are Withdraw and Deposit functions on the two given accounts. Such functions are referred to in Temporal as Activities. They are the second most important concept in Temporal after Workflows.
Workflow code invokes an Activity by calling workflow.ExecuteActivity function in case of Go:

future := workflow.ExecuteActivity(context, money.DebitFunds, params)

or by invoking Activity stubs in case of Java:

@Override
public void transfer(
    String fromAccountId, String toAccountId, String referenceId, int amountCents) {
  account.withdraw(fromAccountId, referenceId, amountCents);
  account.deposit(toAccountId, referenceId, amountCents);
}

Execution of Activities is tracked and orchestrated by Temporal Server, very similar to how Workflow Tasks are handled. When account.withdraw() is called in our example, Client Runtime sends a command to the server behind the scenes - a request to execute an Activity of the given type with the provided arguments. This request gets recorded in the Workflow History as an ActivityTaskScheduled event, and the corresponding Activity Task gets added to the Activity Task Queue. These updates are performed transactionally, so that the Workflow History and Task Queues are always in sync, and no task can be lost.

The Activity Task then gets picked up and executed by one of the Activity Workers long-polling the Activity Task Queue. Activity Workers are conceptually similar to Workflow Workers, and are often combined into a single Worker process. Even different Workflow and Activity types can be isolated into their own Worker processes, for security, performance, or other concerns.

The important part here to stress is that Temporal Server records the intent to execute application steps before their actual execution and then records the results they produced after they complete. This makes it possible to resume execution of application logic after any failure, from the step that previously failed. These recorded application steps, Workflow and Activity Tasks, are then passed to Application Workers for execution. Hence, the notion of Inversion of Execution - an application process is told by Temporal Server (via the Client Runtime) what tasks to execute when.

Benefits

This model of execution brings a number of benefits.

1. Reliability

Each step of a workflow execution is recorded first with all its inputs, so that the workflow can resume from where it left off after any failure or even a complete shutdown of the system. When an Application Worker starts after a failure or a planned update and receives a task to resume execution of a workflow from a particular step (next after the last successfully executed), the Client Runtime deterministically replays execution of the Workflow Tasks that already succeeded, so that it can continue as if there was no failure or restart. Interestingly enough, the application code doesn’t even know if it’s executing a Task that previously failed or running it for the first time or replaying a task that already succeeded. That's why the Temporal programming model is often referred to as Fault-Oblivious - because application code indeed can stay mostly unaware of recoverable failures.

2. Security

The separation of concerns provided by the Inversion of Execution approach where Temporal only handles orchestration of execution without actually executing application code, makes it possible for applications to run with zero trust of Temporal. Every payload (arguments and results) of Workflows and Activities can be encrypted on the Application Worker side (via the Data Converter feature), so that Temporal Server has no way of knowing what application is doing and what data it is operating on. This clean separation makes security and compliance reviews much easier to conduct because the zero trust story is very compelling.

3. Scalability

Execution of each workflow is independent and logically isolated from all other workflows in the system. Temporal is only responsible for recording execution steps and communicating with application workers. Because of that, Temporal Servers are relatively easy to scale out by adding more servers and redistributing responsibilities for different workflow ID ranges (shards) between them. Of course, storage needs to be able to scale up or, better, out accordingly.

Scalability is made easier because Temporal is not responsible for hosting and executing application code, and only needs to scale with the rate of execution tasks being generated and recorded. Scalability of application code (workers) is handled separately, and such clear separation makes it easier to find out and eliminate bottlenecks.

Is this a queueing system?

If you think that this looks like a task queuing system, I wouldn’t say you are wrong. Temporal can be viewed as a specialized queue of execution. In a way. But if you ever implemented a high-throughput reliable message processing system that used queues, I'm sure you’ve had to solve a number of non-trivial problems in order to achieve good performance with strong consistency. Retries and backoff in case of failures, deduplication of reprocessed messages, how to handle a failing or slow message that is blocking a queue partition, dead-letter queues and violation of ordering guarantees, etc. These are hard problems to solve.

Temporal provides an opinionated set of solutions to these problems, and raises the level of abstraction, so that application developers don't have to think about them. Inversion of Execution is one of the core opinionated decisions, and I hope this post helps you see why.

The Curse of the A-word

Sergey Bykov — Wed, 05 May 2021 17:04:53 +0000

I wanted to write this post ever since I saw David Fowler's tweet and the discussion it triggered.

While it may have taken me quite some time to get around to answering it, 2020 wasn't an ordinary year by any measure.

I took my first steps into the Actor Model space more than a decade ago, when I started working on the Orleans project. I have been and will continue to be an enthusiast of actors. We published a few papers and I gave a number of talks about them. However, over time I gradually stopped using the term 'actors' even when explaining the properties and benefits of the Actor Model. This post is an attempt to explain why.

Minefield of Conflations

When we open-sourced Orleans in January of 2015, I was surprised by the amount of debate it generated on the seemingly trivial topic. The debate was about whether or not Orleans was in fact a faithful implementation of the Actor Model and if grains were actors at all. Even though we published the tech report "Orleans: Distributed Virtual Actors for Programmability and Scalability" nine months beforehand, it didn't seem to help much in those discussions.

It honestly took us more than a year to reach a point when Virtual Actors of Orleans were generally recognized as a legitimate interpretation of the Actor Model. It was not just another interpretation, but one that has its unique benefits, especially for high-scale applications like cloud services. I was shocked by the uphill battle it took to get there. Over time, I came to realize that a big reason for these debates is the fact that actors are inherently a minefield of conflations.

Conflation #1: Distributed & Local

Whenever somebody would say "we are using actors", I learned to first ask, "is it for a distributed system or a single-process?" This question was necessary because many developers use actors as a concurrency mechanism, leveraging their "processes one message at a time" property. The Orleans team was coming from the C#/.NET background where there was already a strong support for concurrency and asynchrony, with features like Promises (Tasks) and await. So from our vantage point, there was little reason to use actors just for basic concurrency. However, in languages with less native support for concurrency (ie: Java), local actors (used within a single process) continue to be a useful mechanism for concurrency and asynchrony.

Both local and distributed actors adhere to the same three rules of the Actor Model definition — in response to a message, an actor can:

Send messages to other actors
Create actors
Change its behavior for next message

However, distributed actors that live in a cluster of servers exist in a very different and more hostile environment; an environment of network messages, latencies, failure modes, and uncertainty about their state.

Despite the commonality of the core three rules that apply to both, I would argue there's very little else in common between local and distributed actors, especially when it comes to application design, tradeoffs, failure modes, and major aspects of how actors supporting runtimes are implemented.

Conflation #2: Supervision Trees & Actors

Erlang was the first popular implementation of the Actor Model. Arguably, it was Erlang that is responsible for bringing actors into the mainstream and pioneering a number of important design choices. One of them was the idea of supervisors which are actors that are responsible for handling failures of other actors by recreating or restarting them, etc. Supervisors are usually used in hierarchies, known as supervision trees. These trees make it easy to reset a system of interconnected actors into a known state after a failure.

Akka, being a faithful adaptation of Erlang ideas to the JVM world, also implemented supervision trees as the key failure handling mechanism. When your goal is to build a resilient system that cleanly resets chunks of its state in response to a failure this makes a ton of sense. The subtlety of the fact that supervision trees of Erlang and Akka are just a way to implement actors was lost on many people. In their minds, supervisors and supervision trees became part of the Actor Model itself.

It took us a lot of effort to explain why we chose a different approach (Virtual Actors) to handling failures in Orleans. The Virtual Actor method of automatic lifecycle management by the runtime doesn't use supervisors and has its benefits, especially in many cloud scenarios. Keep in mind that the supervision tree approach may be superior in other cases, such as where you have a hierarchy of actors and need the ability to reset it. The point is that "actors" ≠ "supervision trees", and it's a tax having to explain it to new people coming from the Erlang/Akka background.

Conflation #3: Message Passing & One-Way Messages

In the world of traditional actors, it is more common to send one-way messages without expecting an immediate response. More than that, the request-response (RPC) pattern is considered dangerous. Actor developers are told to use it with extra care because the calling actor will be blocked until a response is received.

In Orleans, we chose the opposite default, with asynchronous RPC being the primary way of invoking actors. Each such RPC call has a built-in timeout. That removes the need for developers to worry about their actor getting blocked forever. Actors can also be marked as reentrant, so that they aren’t blocked from processing other calls at all while awaiting for a response.

Multiple asynchronous RPC calls can be made by an actor concurrently, e.g. to fan out to a number of other actors. The elegance of async/await makes merging of the resulting promises in a desired way and awaiting a joint Promise for the whole fan-out operation a trivial pattern.

One-way messages are also supported in Orleans, but they are not the primary pattern because in most cases developers want to know at least if a call successfully arrived to the caller or failed or timed out.

This is yet another fundamental area with a significant "explanation tax", incurred by the different choices other implementations of the Actor Model have made. I suspect that if we had not used the term "actor" in defining Orleans from the beginning, we wouldn't have spent so much effort explaining ourselves. The async/await pattern for efficiently managing asynchrony had been established in the .NET ecosystem a long time ago, and there's no expectation of supervision trees in that developer community.

Conflation #4: State transition & Become

This is a smaller issue. However, I’ve had several conversations with people who insisted on a specific interpretation of the third rule of actors (that they can change their behavior of processing subsequent messages). They interpreted it to mean that there must be an explicit way to tell an actor to become something different. The claim was that if your actors don't support an explicit feature like that, they are not real actors.

In my opinion, this rule simply means that an actor can change its internal state, whether it’s a formal state machine or a boolean/enum flag that will define how the actor should process another call. For example, Digital Twins are a mainstream pattern to model program entities that shadow physical IoT devices in order to reflect their state and to communicate with them. Actors are an obvious fit to implement Digital Twins.

When a Digital Twin actor receives a "turn device off" command, it is very natural for the actor to flip an internal state variable that reflects the “off” state. In that state, the actor ignores or rejects all commands except for a "turn device on", which flips that variable back to “on”.

Elephant in the Room: Actors & Models

I've seen a number of presentations about actors that start with a meme slide showing some famous actor’s photo. This is because the vast majority of developers have never heard of the Actor Model. This would be ok if the term "actor" carried some intuitive connotation for them. In my experience, it does not. Even worse, when presenting about Orleans, the minority of the audience that knew about actors often had the above listed conflations in mind. It was a no-win situation, for both parts of the audience. Every time I presented, I had to spend energy and time pushing that boulder up the hill. At some point I stopped doing that by avoiding talking about actors altogether.

Instead, I started describing grains in Orleans as objects that live somewhere within a cluster of servers. These objects have stable identities and are always available for an invocation. Objects are a widely understood concept. It is easy to build on the concept by adding that each such object has a unique identifier of your choosing, hides (encapsulates) its state, and is only accessible via asynchronous method calls defined as part of an interface. Object, interface, method call — these are no new concepts to grasp. You just have to imagine objects working transparently across machine boundaries in the combined memory and compute space of a cluster. This approach was more effective, catering to a wide range of audiences, from academic to experienced cloud developers to “I want to learn about building scalable applications” developers.

I'm happy that we chose to call Orleans actors "grains", not "actors". "Grain" is not a perfect term by any means, but at least it conveys the general idea of a rather small piece of an application. I would argue it is much better than "actor".

The landing page of Orleans documentation nowadays only mentions actors once — in reference to our Orleans: Distributed Virtual Actors for Programmability and Scalability 2014 paper. This is a result of our conscious effort of reducing the cognitive load on people that come to the page to learn about Orleans.

Reuben Bond recently started describing grains as Cloud Native Objects. Again, not a perfect term. But I like it because it tries to convey the benefits of the model and where it is most applicable. Roger Johansson even suggested a CNOB acronym for it. 🙂

Conclusion

I am a big fan of the Actor Model as a simple and clean model of computations. In particular, it is an excellent fit for building distributed systems, whether it be on premises or in the cloud. I am forever grateful to Carl Hewitt, Peter Bishop, and Richard Steiger for their original insight, and to many subsequent followers that pushed those ideas forward. Implementations of the Actor Model and the ideas it pioneered power many high-scale and mission-critical systems today.

At the same time, I’m convinced now that the name of the term "actor" was a rather unfortunate choice. It took me years to gradually arrive to this realization. In my view, the very word “actor” continues to be a major barrier for adoption of the Actor Model ideas for the broader population of developers. I cannot formally prove it. This is just my speculation, of course. I listed several other contributing factors that in my opinion add to this confusion.

Remember that old saying about two hard things in computer science: naming things and cache invalidation? I believe it is part of the answer to the question, “Why aren’t actor frameworks more popular?

In my opinion, the Cloudflare folks made a pragmatic choice to call their (for all practical purposes) virtual actors "durable objects." Once again not perfect, but much more developer friendly than "actors." I like Reuben's idea of calling grains Cloud Native Objects. It helps people quickly get a high-level intuitive understanding of what it is and decide if it's relevant to them.

If there’s a better term, I'm open to your ideas. Just not "actors", please.

Temporal: How It's Going

Sergey Bykov — Wed, 31 Mar 2021 17:10:52 +0000

I wrote about how it started in Why I Joined Temporal. Somebody suggested I should post an update about how things are six months later. How am I feeling after jumping from a corporate cliff into the whitewater of startup life?

`https://xkcd.com/1782/`

Technology

I used Windows as my primary platform, and haven’t touched a Mac since the 90s. I was fairly confident that switching to a whole new platform would be a speedbump. Surprisingly, it was much smoother than I anticipated. Obviously, MacOS has its own idiosyncrasies, and I had to rebuild muscle memory for some keyboard shortcuts, but overall it’s close enough to Windows with WSL2. I prefer the Surface Pro for personal use because of its form factor, touch screen and how comfortable I am using it.

I cheated by continuing to use the Microsoft Ergonomic Keyboard and a two-button mouse, so I didn't have to use long button clicks for context menus. The one thing that continues to drive me crazy is the inconsistency of how Home and End buttons are interpreted in different apps.

About half of Temporal engineers develop on Linux machines. I realized why over time; just like on Windows, Docker does not work natively on MacOS. I recently wasted several hours trying to figure out why sed and even date weren't working as expected, only to find that it was because of the difference between GNU and BSD versions of these tools!

From the coding point of view, JetBrains is simply awesome; it is lightweight, cross-platform, and cross-language. In the hindsight, I should have started using Rider back at Microsoft for its lightweightness and cross-platform support. At Temporal, I mostly use GoLand, IntelliJ (occasionally) and Rider(I try to stay up to speed with the fast development in the Orleans world).

Golang

Go is a bit strange, with some tradeoffs I understand but others, not so much.

On the plus side:

Simple, limited syntax
Easy to read and understand unfamiliar code
Inline access to code from dependency packages
Goroutines and channels are good tools for concurrency
Very fast compilation

What's lacking:

No generics and even overloads
Simplistic visibility (public/private) model
Unusable plugin model, no IoC/DI
Debugging tooling is weak

Overall, Go is a reasonable tool for the job. I think C# would be more powerful as a language. What’s less clear to me is how much we would lose not having access to Go modules.

Development

I joined a couple of weeks before Temporal v1.0 was released. It was a huge milestone after almost a year of development since co-founders left Uber - we switched to gRPC, added support for mTLS, and made a myriad of other changes.

As I ramped up, I primarily contributed to Temporal server. My major focus was developing Temporal's security features: authentication and authorization. The authorization model was my first design effort. Through the process of formulating the initial proposal, having internal discussions, implementing, and releasing the code, I got to know the team better. I also learned a number of "how can this be done in Go" things and established relations with some key customers.

The security work naturally morphed into contributing to the nascent Temporal Cloud effort. As Ryland mentioned in his latest Transparency Report, by late December we were able to start onboarding a small number of select paying design partners to our cloud service. It's crazy that we have customers paying to use our service on our Cloud, long before we GA it. This speaks to their level of confidence in our product. The best part is that we even have a waitlist to join the program.

Currently, development of Temporal focuses on three major areas:

Open Source Temporal Server
Open Source Client Runtimes that we, unjustly IMO, call SDKs
Temporal Cloud

Nowadays, I spend most of my time on the Cloud and occasionally the Server. The Client piece is what I’m the least familiar with for now.

Team

The company is still fairly small. The setup is very simple and requires little overhead. Everyone is focused on developing the product, whether that means adding new functionality or improving what we already have. I can sense strong motivation beaming through the Zoom screen.

The current company size allows for a very flat structure. On the engineering side, we don't have a single manager yet. Everybody except our co-founders(the CEO and CTO) are simply engineers. Decisions are made within the three dev teams if possible, where engineers would happily work with other teams when necessary. Company-wide decisions are primarily product-level and about improving our engineering processes. Task tracking and planning is lightweight and done using Notion, an interesting product that deserves its own deep dive.

I found the switch to an IC role refreshing. I’ve always liked the balance of Microsoft's dev lead role that combines technical work with managing a small team. But it forces you, for a good reason, to prioritize your IC contributions behind the team management priorities. I'm enjoying the plunge into building stuff, learning new languages and tools, figuring things out by collaborating with colleagues and partners.

Remoteness is a nuisance, of course. It’s definitely a challenge to connect and stay connected with colleagues that I don't work closely with. Fortunately, we had an in-person celebration for the v1 release back in October. That was the only chance I’ve had to meet a bunch of co-workers and their families. Aside from that, all communication is done through Zoom and Discord.

It's interesting that the team uses a suite of communication tools that broadly overlap in functionality. Zoom is for scheduled and sometimes ad-hoc meetings - it has the richer feature set and best screen sharing performance out of the box. Discord is used for spontaneous voice conversations, with an adjacent goal that such interactions are overheard by other people that hang out in the same channel. Discord is also used for quick screen sharing because it is more natural to supplement an already going voice conversation with a video stream.

Slack (not surprisingly) is the main method of written communications, within the company and with partners. Discord has chat features as well, but Slack is much stronger in that and has more business features and integrations. I was surprised to learn that Slack also has screen sharing. But I only saw it used once, as an experiment. What's important is that the team is open to trying new ideas. That's how Discord and Notion got added to the toolset.

Culture

Temporal defines its culture along the three no-nonsense pillars.

Developers, developers, developers
Reliable like running water
Seek the truth

Developers, developers, developers

Our customers are engineers like us. If they are happy with our product, they will find ways to apply it and build great systems with it. Some of them will extend, give us valuable feedback and even contribute back. In the end, they will be the ones who have to convince management that they need Temporal. On the flip side, if developers don't like our tech, no manager will successfully force them to use it.

I'm very used to a world of developers as customers, considering it’s what I've been doing for the last ~15 years. I recently heard a story about a team of engineers at some company who tried to convince their management to use Temporal. When a higher-up opted for the low-code solution instead, the engineer who led the convincing effort simply quit and started looking for a new job where he could apply Temporal.

Reliable like running water

This one was borrowed from Uber. The key value proposition of Temporal is that it makes it possible for application code to execute reliably without needing the usual complexity. Hence, reliability is our product. If Temporal is not reliable, behaves non-deterministically, loses workflows, there is no point in using Temporal. Needless to say, the bar is set pretty high.

My background in building services for game developers is helpful here because gamers are probably the most demanding end users. However, no matter how little patience gamers have, those are still games in make-believe worlds. On the other hand, Temporal powers many workloads with very real monetary value and the highest consistency and availability expectations.

Seek the truth

This is common sense for people with engineering experience - that it's not the most vocal, articulate or persuasive person in a technical discussion who is necessarily right. It can be somebody quiet or early in their career. Of course, experience does help us recognize typical mistakes, design flaws, and corner cases. At the same time, mistakes that an experienced engineer overlooks tend to be of higher impact.

Intuition, gut feels, and instincts are all useful. That's our internal "AI" at work. But in the end, we have to operate on facts. Intuition should only help us find facts faster. It is also important to actively manage your ego and to depersonalize opinions, ideas, facts, designs, code, etc. That way we can suppress our human desire to win an argument or to feel proud with that useless "I told you so!" remark.

Product

I gained a much better understanding of Temporal, it's capabilities, and architecture. Obviously, being neck deep in the code while adding and improving features helped. So did design discussions, reviewing PRs, watching what other people work on, and chatting with them.

In addition, we hold bi-weekly brain dump sessions where our co-founders interactively whiteboard to explain how certain core features work and why. Being able to ask questions like "why this way and not that" and "is this essentially X" helped me to build a conceptual mental model of those features. It also helped me separate core functionality from optimizations in the "whys". These sessions are recorded and some are even transcribed for our future employees.

The product does appear to define a new software category. I'm going on a limb here because somebody will always come up with a "but how is it different from X?" I think Temporal strikes a very delicate balance between on the one hand working "like magic" (that's what some customers literally say) and being very down-to-Earth practical, with pragmatic tradeoffs that experienced systems engineers understand and are generally comfortable with.

The Inversion of Execution model (which deserves its own post), seems to give developers confidence that they are in control of how their code runs. They can easily debug and fix their application code while Temporal ensures reliable execution of workflows and tasks, with a clear boundary between then.

The other day, a candidate I was interviewing shared an interesting analogy: Temporal has a chance of becoming TCP for developers. How so? By providing an illusion of smooth execution by automatically handling a range of failures and retrying behind the scenes. Just in the way TCP retransmits lost packets without exposing that to the higher levels of the stack.

There’s one typical pattern we see. One team of developers in a company will start building something with Temporal, and then in a few months we hear that a dozen other teams have already adopted Temporal. It spreads like a virus despite social distancing. I hear about 20 of Fortune 100 companies already using Temporal. From behemoths to tiny startups, we see people building software systems across many different domains: infrastructure provisioning, financial transactions, traditional business workflows, so on, and so forth.

We feel a special kind of pride when some of the most sophisticated technology companies choose to use Temporal. I still can't name all of them! But several went public about their dependency on Temporal: Coinbase, Hashicorp, Box, Checkr, Netflix, Snap, Datadog and Stripe. Stripe and Datadog even have job postings specifically calling out working with Temporal.

Reflections

Looking back, I see that the decision to leave Microsoft was much harder for me to make than the actual transition into a new life. Big tech companies like Microsoft definitely have an advantage over smaller ones in that they provide opportunities for employees to change jobs, sometimes dramatically, across the wide variety of businesses they run. In my tenure at Microsoft I moved from Servers to Embedded to Bing to Research to Gaming, in several cases continuing to work with many of the same colleagues. This is definitely a benefit but it’s also a "trap". A barrier making leaving the walled garden harder. One day I should tell a story of my public ... let's say disagreement with Microsoft's head of HR at the time about the policy of internal transfers. 🙂

From my previous job changes I learned that what I'd be doing at a new place is almost always different from what I had in mind going into it. Not better or worse, just different. This transition was another example of that rule. Before I joined, we discussed that I would likely focus on the programming model of Temporal, leveraging my experience of building frameworks and tools for developers. In reality, I am working on the server and cloud side, leveraging my other experiences in building systems and services. No surprise for me there any more.

It was interesting to move from the Redmond bubble to the Startup bubble. In the Redmond Bubble there is still a lot less trust in using open source software and a bias toward in-house stuff. There are obviously good and bad reasons for that.

I knew beforehand it was the case, but I was still shocked to discover that Microsoft products are pretty much non-existent within the Startup Bubble. Strangely, I rarely hear Azure being mentioned, much less than GCP, and more like a necessary future integration tax to pay. Recently, I found myself arguing with coworkers who claimed that MacOS is more popular among developers than Windows!

I have been reminiscing about what I miss from my previous life. I definitely miss some of my colleagues that I enjoyed working with. I miss the great Orleans community that has formed and grown over the last 6-7 years. I love how that community started a number of great initiatives, such as the Gitter chat room, a community driven OrleansContrib GitHub org, and the virtual meetups. As the core team, we only supported these initiatives and contributed to them.

The irony here is that I am not cut off from the community. That's the beauty of open source. I just have so little time at the moment with all the things I'm busy with. I try to stay in touch with what's going on in the Orleans world and hope I'll continue to help one way or another.

I miss the availability of Microsoft Research. I wish I could still walk into the hallways of building 99 to ask Phil Bernstein or Sebastian Burckhardt advice on how best to ensure the atomicity/consistency guarantees that we need in Temporal. MSR is a unique institution that can provide state of the art help on nearly any Computer Science topic. You just need to know how to work with it.

I do not miss the politics and reorgs (we had three just during COVID before I left). Neither do I miss technical and product decisions being influenced by or optimized for somebody’s career goals. These things are facts of life in any sufficiently large corporation. When Temporal gets big, it’ll inevitably face the same challenges. But at this point, my “AI” trained by years of corporate life is not detecting such behaviors. Everyone is focused on making the product better. Nobody is concerned about “how this is going to help my annual review.”

I realized I mostly miss people, individuals. At the same time, I’m getting exposed to a great number of people from the new bubble. The Sequoia/Amplify/Madrona “mafia” can connect you to seemingly any top expert in the field because they are so well connected and broadly invested in the industry. Like with MSR, one just needs to know how and when to tap this resource.

I’m trying to make this as much of an honest assessment as I can, but it seems to sound very upbeat. That must be a reflection of how I feel now. I don’t know if it’s some kind of an extended honeymoon period. Time will tell. I think that regardless, this was a good and necessary change for me, no matter what happens next. I believe in our profession it is important to periodically shake yourself up, circumstances permitting. In a big scheme of things, we have the luxury of being able to do that when most other people can’t. Why not use it?

“Steh auf!”, as the famous German modern artist sings.

Dealing with Failure

Sergey Bykov — Thu, 21 Jan 2021 17:29:16 +0000

I recently gave a talk at the CodeMesh conference, and I spent half of it reflecting on the seemingly boring topic of dealing with failures. The talk was primarily based on my experience building and helping others build cloud services with the Orleans framework. I chose this topic, because I believe dealing with failures is the most important aspect of any system. Oftentimes, it is what stands between a product that runs as expected and one that keeps producing surprises and causing investigations. When done right, handling of failures is what differentiates a professional from an amateur.

The talk covered three approaches that I've seen and applied the most myself:

Request-Reply (a.k.a RPC)
Using persistent queues
Workflows

1. Request-Reply

In my opinion, Request-Reply (a.k.a. RPC -- Remote Procedure Call), is the most natural way of handling failures. The client makes a request to the server and waits for a response (up to a timeout) and in most cases learns about a request processing failure immediately. This is how HTTP works, for example.

Note that by client and server I mean simply two sides of the call. They can be real client and server processes or merely two tiny objects communicating with each other within a distributed system.

Simplicity of RPC is good for the server.

"I try to do what the request asked me to. If there's any failure downstream, I return it to the client. The client knows best what to do, to retry or not, how many times, with a backoff or not. My logic can stay simple."

In our world of overly complicated systems, the value of simplicity is difficult to overstate. However, in this case the complexity burden gets pushed to the client. This puts the remote client at a disadvantage here. It has to operate based on the limited error information it received back. Sometimes it’s just a communication error or a timeout. These are a few of the many possible real life cases:

An error may not be clear about whether the requested operation actually failed. It might have succeeded, and the error happened while trying to communicate success. This forces the client to either check for the status of the operations or retry anyway, assuming retrying the operation can be done safely, i.e. it is idempotent.
The system may be temporarily unavailable, actually being down or network partitioned from the client. For mobile applications that's rather expected.
Partial failures are hard to deal with. When we need to update multiple external systems at once, there is almost never a way to do that in an all-or-nothing manner, i.e. atomically. So, we have to handle retries and rollbacks, side effects, and all the inevitable complexity.

The last point I illustrated with the following picture:

In this example, the client (square blue thing) makes a request to the server (round green thing). The server does not have the information locally to satisfy the client request and therefore needs to call two external services, blue and purple.

If either of those two sub-calls fail, the server returns an error to the client. If the client were to retry the request, there would need to be a mechanism in the server that prevents duplicate calls to the external services.Idempotency is one method of addressing this issue. If the client decides to give up, there needs to be a way to revert any changes made as part of processing the request before the failure (in our example - of the call to the service A).

A canonical example is money transfer from an account in one bank to an account in a different bank. However, there are many other scenarios with conceptually indentical requirements. In many cloud orchestration cases we need to allocate a resource (such as a virtual machine), and then perform a number of operations with it before returning it in a ready state to the client. If any operation fails, we don’t want to leave the VM running. Nor do we want to keep allocating new VMs for the same request.

To summarize the pros and cons of the RPC approach.

Pros:

Simplicity
Obvious correlation between a request and a failure

Cons:

Retries are client's responsibility and are difficult to do for a remote client
Partial failures are difficult to handle

2. Persistent Queues

Putting a persistent queue between the client and server solves a number of problems. The client just needs to successfully send a request to the queue to ensure that it will eventually be processed.

Assuming the server only deletes a request from the queue after it is successfully processed, we get a simple retry mechanism. Due to the queue, even if the server crashes and restarts in between the attempts, it will keep trying to process the request again and again. The fact that the client (producer) is completely decoupled from the server (consumer), means the client can enqueue requests even if the server is down. This is the main reason why the publisher-subscriber architecture is so popular. Separation of subsystems in space and time is a nice property.

A simulated illustration of Chang'e-5 probe's orbiter-returner's separation from the ascender on the moon orbit, December 6, 2020. /CNSA

For streaming one-way events, queues are great. But how can the client get a response in a queue based architecture? There's no good answer to this question that I'm aware of. Responses need to be delivered (somehow) back to the client, usually over another queue. Then the client needs a way to correlate requests and responses, typically done via correlation IDs. There also needs to be timeout mechanisms for dealing with requests that never received a response.

Retries are simpler with queues compared to the RPC case. They are pretty much automatic, as long as the request stays in the queue. Calls to external services still need to be idempotent. However, we can't retry forever and have to deal with requests that keep failing to process. Either because they clog the queue (if the queue is ordered), consume too many resources or cause excessive load on the external services. The popular approach is to treat such requests as "poison messages", by moving them out of the queue to a different location ("dead letter" queue) for special handling.

Pros:

Separation of systems in space and time
Automatic retries
Simple when no responses are expected

Cons:

Additional dependency of the queueing technology
Extra work to correlate responses
Queues may clog
Special handling of "poison messages"

3. Workflows

Similar to queues, workflows take the burden of ensuring successful execution of requests off the client's shoulders. But instead of writing them into a shared queue, requests are persisted as part of an independent workflow document. That document makes processing requests stateful:

Tracking which steps of processing succeeded
Tracking which steps of processing failed
Remembering how many retries have been made, etc

Workflows have other important properties and use cases. They are a great way to implement long-running business processes, incorporate human operations and react to events. From the failure handling perspective, the most important aspect of workflows is the ability to be more intelligent when handling partial failures. Instead of being oblivious about what happened in the past, a workflow can keep a log of all relevant information and make informed decisions about what to retry and when.

Workflows can be individually addressable, which makes them easier to scale compared with shared queues. It also allows for targeted inspection and even on-the-fly modification of their state, if needed.

At the same time, workflows "inherit" most of the challenges of queues. Responses still need to be correlated with requests, although the individual addressability of workflows makes it easier for the client to query results. "Poison messages" are also still possible. They don't clog the queue anymore, but still require special handling.

Pros:

Partial separation of systems in time and space
Robust handling of partial failures
Support for long-running operation
Retries are "automatic"

Cons:

Additional dependency on a workflow system or complexity of in-house implementation
Extra work to correlate responses
Special handling of "poison messages"

Conclusion

It's a cliché that in our business there's no free lunch, only tradeoffs. As unoriginal as they might sound, many clichés are true. Dealing with failures is an area of important tradeoffs. There's obviously no single pattern that fits all scenarios. In fact, many systems leverage all three patterns I described.

For simpler requests that need a prompt response and aren’t involved with complex multi-step processing, Request-Reply is often the right approach. One-way messages, events, data streams are clear candidates for Queues. Workflows are a good fit for reliable execution of relatively complex requests that either require multi-step processing or can leak resources if failures aren't properly handled.

Why I joined Temporal

Sergey Bykov — Thu, 15 Oct 2020 17:11:41 +0000

I left Microsoft last month after a long and successful career. It's an understatement to say that most of my colleagues were surprised. What could possibly motivate me to leave the environment where I learned how the system works, built many relationships and knew how to make things happen? Not to mention the sizable amount of unvested stock.

To answer this question, we have to travel back more than a decade. The year was 2009, and I had recently moved to Microsoft Research from the Online Services Division. I joined a newly established lab with a charter to explore the future of Cloud Computing (still relatively novel at the time). More than joining the lab, I joined the project Orleans. A project that had a name, a 60,000-ft vision, and not a single line of written code. The vision was bold - to reimagine how cloud-scale application should be coded. At the time highly available high-performance scalable systems were only achievable by experts, and Orleans wanted to make it something possible for every developer. The question was how to reduce the amount of complexity inherent in this space. We believed the answer was a new programming model, that would change how developers conceptualize applications, how they structure code, and ultimately how they think about the problems they face.

As we were churning through early prototypes of Orleans, one crazy idea after another, I met a guy who was sitting around the corner on the same floor - Maxim Fateev. Maxim was on a different team working on some unrelated projects, but he also had some ideas of his own that he was prototyping. As developers do, we immediately started discussing our views on “what should be done and how”, pros and cons of the different approaches, the usual stuff. It became clear from those debates that I was interested in the domain of interactive workloads - with RPC-style request-response, low latency, high throughput, client-side retries when deemed necessary, and eventual consistency in case of node failures. That was the viral thing at the time, started by the Dynamo paper from Amazon. Maxim was interested in building a reliable workflow system that persisted variables and virtual call stacks, so that application code could "resume" its execution after a delay or failure. Almost as if it stayed on that line of code the whole time. At the time it sounded too slow to me considering the emerging workloads of interactive entertainment, Internet of Things, etc. I was younger, less experienced, and more opinionated then.

Maxim left Microsoft soon after, continuing to pursue his passion to build a workflow-based programming model. In the meantime, we shaped Orleans into a promising prototype, and eventually put it into production with the Halo team. This instantly boosted the credibility of our project and led to a number of collaborations/co-designs with internal and external partners.

Sometime in 2013 or 2014 we organized an Orleans hackathon for a select group of big partner companies. They coded while isolated from each other in separate rooms. To my surprise, out of seven companies three ended up building workflows as a major part of their application. A pattern started to emerge that I only recognized much later. Orleans users loved the simplicity and power of the “get an eternal object and invoke its method” model. At the same time, in some scenarios they needed a workflow solution for longer-running operations.

In Orleans, we had a suggested solution for building workflows - use reminders (persistent timers) to periodically activate and invoke workflow objects. That way they can check if it's time for them to perform an action or to transition to a new state. This minimalistic approach generally worked, although it left most of the complexity to the application developer. The community also built a number of workflow-like solutions, such as Orleans.Activities and Orleans.Sagas to fill the gap.

As we moved out of Research and into the product group, I got even more intimately involved in building and operating backend services for games, but the shadow of workflows continued following me. I kept encountering more and more workflow use cases. Either an operation inherently takes longer than an RPC call can afford to wait, for example, starting a virtual machine. Or a scenario where an operation succeeds most of the time promptly, but fails regularly enough that it could potentially time out and require a follow up. In such cases, an application typically needs a delayed action that would check if the operation succeeded, retry it if necessary or explicitly give up on the operation and to clean up any associated resources.

It is possible to implement such actions with timers and reminders and would involve ~following steps.

Register a reminder for the requested operation, so that it can recover from failures.
Execute an external API call to initiate a long running operation, such as allocation of a cloud resource.
Schedule a timer to periodically check if the operation succeeded until we reach a timeout we allotted to it.
If the operation successfully completed or failed, record or report the result, unregister timer, unregister reminder, clean up any intermediate state we recorded.
If the operation hasn't completed or failed within the timeout window, try to cancel it, record the fact that we've given up, perform the same cleanup as in 4.
Follow up on all cancellations to make sure we didn't leak any resources. This requires another piece of logic and a timer/reminder combination to guard its execution from potential failures.

Steps may vary depending on the types of resources being managed and their likelihood to fail and leak. But the main problem still looms - to perform a single line reliably, we were forced to put a lot of scaffolding code around that single line which expresses what we actually needed to do. Such complexity slows down developers at all stages:

Initial development
Failure testing and stabilization of the codebase
Investigation of production incidents

We added a long list of features to Orleans over the years that helped developers build robust scalable systems. We even dared to integrate distributed ACID transactions, with help from our great partners in Research. But when somebody would inevitably ask, “how do I implement a long-running operation”, we would describe the same 6-step process with timers and reminders.

It eventually dawned on me that RPC and workflows are sort of yin and yang of interactive applications, that the debates from 10 years ago about RPC vs workflows should have been about merging the two and not choosing one over the other. Our team met with people at Microsoft Research about a year ago to discuss where they could help. Workflows were our number one ask. Our rationale was that if we could add an abstraction for workflows, that would greatly complement the RPC style that Orleans already excelled at. I'm pretty sure a number of people will think, "Doh, that's obvious. Why did it take you so long to realize it?" I'm okay with this embarrassment, as long as it serves as a useful lesson for others.

In the meantime, I heard through some mutual contacts that Maxim was now at Uber and was leading some ambitious project. Later I learned that it was called Cadence, yet another reliable workflow system that Maxim and Samar were working on, and it was open source. This put Cadence on my radar but in that “when have time to take a closer look” bucket. Somebody told me a few months ago that Maxim and Samar had started Temporal Technologies last fall. But I never considered startups to be my thing. However, as I began talking to them, my attitude rapidly started to change. As I learned more about the programming model of Temporal and descended down the rabbit hole of its implementation, I started seeing “workflows, workflows everywhere”, big and small, quick and long running, just like in that popular meme.

If we could turn the six steps in the above list back into a few lines of trivial code, we would free developers from the burden of scaffolding, and enable them to focus on interesting problems. Workflows are somewhat like grains in Orleans - units of distribution, execution, and fault isolation. They are "just" executed differently - by the application supplied workers. I call this model Inversion of Execution, not sure if it will stick. There are, of course, other major architectural differences, but the high-level goals are the same - qualitatively reduce complexity of application code; make applications resilient (invincible) to inevitable failures; empower developers to solve more interesting problems than those we can take care of.

Besides the workflows everywhere, I also saw the two co-founders, roughly my age and level of experience, sharing the same passion I've had for many years. Both with track records of delivering innovative solutions following that passion. The startup environment provides a nearly complete freedom to pursue what we agree is right. There's no management hierarchy above to approve a decision or to allocate a budget. You just need to deliver the product and delight your customers. I kept thinking that if I were ever to join a startup, Temporal had to be it.

It is very difficult to get up and leave a comfortable well paid place where I've learned how to be successful, built many professional relationships and connections, and can navigate with my eyes closed. Yet I felt that if I play it safe and not jump into the unknown now, I might regret it for the rest of my life. So I decided to take the plunge, break the proverbial golden handcuffs, and go all in on Temporal.