Forem: Criteo Tech Community

Why And How We Organize Coding Dojos

Criteo Tech Community — Wed, 17 Sep 2025 07:01:56 +0000

Author: Guillaume Turri-Werth

Photo by Flipsnack on Unsplash

If you are looking for a way to both have fun and improve your coding skills (or those of your team), I have some good news: we have open-sourced a repo that contains material to organize coding dojos! 🎉 👇

GitHub - criteo/coding-dojo: Some content to organize some coding dojos

Sure, there is already a lot of such material available online, but this one is still original. I’ll explain why in a minute. Before that, I need to explain how we proceed with coding dojo at Criteo, and share the good practices that help us, in our setup, organize regular coding dojos, and keep the momentum for several years. But even before that, I probably should explain what coding dojos are.

A lot has been said on this topic already, but I should at least ensure that people who are not yet familiar with this topic get up to speed. If you are already knowledgeable about it, you can skip the first part.

What coding dojos are

Coding dojos are basically a practice to improve developer skills.

In its most basic form it is pretty straightforward: you decide on simple a topic (for instance: implementing a simplified brainfuckinterpreter, a master-mind solver, a roman calculator, …), you work on it for up to a couple of hours, and afterwards you throw away (or archive) what you did -because what matters is the journey, not the destination.

That’s basically it. Then you can adapt the practical details to what works for you. You can, for instance, practice alone, in pairs, or in a group. You can do it at home, at work with teammates, or outside of work with peers.

Photo by Brett Jordan on Unsplash

Coding dojos can serve different purposes (and are hence suitable regardless of the background of the participants):

Getting familiar with software engineer practices -TDD, pair programming, refactoring, …
Experimenting with unfamiliar tech stacks, such as doing a dojo in Scala as a C# engineer.
Discovering good practices from your teammates, such as features in your IDE that you may not be aware of.
Coaching juniors -who can be inspired by your code wizard skills.
Developing relationships with people you don't currently work with directly.

And above all: coding dojos are fun! 💻

Our ritual

To make it more concrete, and to explain how we managed to keep the momentum during several years, I’ll give some more practical clues about how we proceed at Criteo.

Photo by Glenn Carstens-Peters on Unsplash

Before the dojo

We have a public Slack channel that anyone can join, dedicated to the organization of coding dojos.
We organize the dojos between 12:45 and 14:00, ensuring that many of us can attend - since we usually don’t have meetings at that time -and also because evening sessions don't work for parents. In practice:
We have lunch between 12h and 12h30~12h45.
And then we code together until 14h. (Keep this timing in mind, it may have some importance later 😄)
The responsibility of the organization of the next dojo is split between two roles: one that we call the Master of Ceremony and one that we call the Master of Dojo. Those roles are assigned to volunteers. Ideally, they change every time.
The Master of Ceremony oversees the logistics. In particular, they are responsible for booking a meeting room with a big screen, on which we can connect a laptop. At some point, that person was also in charge of ordering the lunch -it may sound like a detail, but it’s an important one, as we observed that having a budget for food really boosted participation!
The Master of Dojo decides on which topic we will work and the stack we will use. They are also in charge of setting up a small project with no implementation and an empty test suite, but which at least builds (so we don’t waste several minutes with that boilerplate during the session).
We set up checklists for both roles, making it less intimidating to take on those responsibilities. It is also a way to document details that ensure those roles are perfectly fulfilled -for instance, that the Master of Ceremony orders some vegetarian dishes.

During the dojo

In general, we use the format called “Randori” — a type of mob programming.

In practice, we do iterations of 5 minutes. During an iteration, a single person has the keyboard and contributes to implementing the solution. The other participants follow on the screen and give advice. At the end of the iteration, the next person takes the keyboard and continues the code where the previous participant ended the iteration.

Photo by charlesdeluvio on Unsplash

We continue like this for roughly 1h15~1h30. It is a bit short, but it fits our constraints. We try to keep between 5 and 10 minutes at the end to:

Debrief (what did we learn? What went well? What ended up being poor decisions and how we could have proceeded differently, …)
Get volunteers for the roles of Master of Ceremony and Master of Dojo for the next session

The good practices we identified

Launching a coding dojo initiative at work is easy, and it's exciting to be enthusiastic about it. However, we often lose momentum after a couple of months 😫. Unfortunately, we've been there a couple of times. We then managed to relaunch coding dojos and to keep the momentum alive for several years 🎉

Here are the practices that work well in our case.

We organize dojos on a regular basis. For instance, in the past, it was every other Tuesday. Now it is more flexible as our guideline is just to ensure we organize one per month. Anything works as long as it’s not “the next dojo occurs whenever someone is motivated enough to organize it” 😄
If the duration is too short (say, less than 1h), we don’t have time to do anything interesting. If it is too long (say, more than 2h), it is hard to keep participants focused and interested. In practice, we do roughly 1h15 so people can go back to work at 2 PM. It’s a bit short, and that’s why we came up with the trick that… we’ll reveal just after 😊 (but if in your setup you can do up to 2h long dojos, it’s probably better)
As mentioned previously, we split the responsibility of the organization. We achieve this by applying two tricks: defining two roles to organize a given session. It ensures it is not overwhelming; And we try to have different volunteers take those roles during successive sessions. That way, we have more people who know how that works, and we hence reduce the bus factor.

Photo by Afif Ramdhasuma on Unsplash

Another advantage of changing the people in charge of preparing a topic is that, over the years, we have had an interesting variety of topics. Including some interesting exotic ones (such as implementing an in-memory filesystem with a FUSE, experimenting with BDD with Cucumber or mutation testing with Pitest, or even creating music with code with Sonic Pi, …)

The resources we appreciate… and why we rolled out our own

If the Master of Dojo is inspired, it can invent topics on its own: as long as it can be explained in at most a couple of minutes to participants, it’s probably ok. And for times where it lacks imagination, we can fall back on some of those resources:

Coding Dojo: this website contains a catalog of such topics
cyber-dojo: This is a platform that provides tooling to do dojos remotely and collaboratively. We do not use this platform (because we already have a setup that works for us), but the neat part is that it also contains a catalog of topics
Gilded Rose: This is a pretty nice exercise to practice refactoring. It contains a code base with several code smells. The goal is to add a feature to that code base, but to do so, the user is encouraged to refactor (and hence add tests) beforehand
Elephant Carpaccio: a dojo to practice breaking a task into tiny slices

All this content could keep us busy for years… and it did!

But lately, we felt that for our 1h15 dojos (which is short), it was sometimes frustrating to spend “too much” time writing tests. I mean, sure, practicing TDD makes a lot of sense. But having a chance to practice other skills than just writing tests is nice too.

TDD Workflow

So, we experimented with another approach: what if the Master of Dojo was also in charge of writing a test suite before the dojo? This way, we can comment all tests beforehand. During the session, we uncomment them one by one and write the code to turn them green before uncommenting the next one.

We gave that approach a try a couple of times… and it was great! 😃 Not only does it save time, but it also leads to more fluid sessions -because it ensures there is no ambiguity with what we try to implement.

I mean, of course, it is interesting to have more classic topics once in a while (to keep practicing writing tests). But it turns out that with this approach, we are still practicing a kind of TDD -“red test, green test, refactor”. The only difference is that the participants do not write the tests. 🙂

What we open-sourced

Since it takes a bit of time to set up a topic with an existing test suite, and since we appreciate this approach, we figured out we might as well share the content prepared by our Masters of Dojo.

And that’s what we released on GitHub: a set of topics on which we worked, along with their test suite. All you need to do is clone the repository, select a topic, open the corresponding project in your IDE, and follow the instructions in the associated README.

GitHub - criteo/coding-dojo: Some content to organize some coding dojos

For now we have only a handful of topics (mostly in C#, and a bit in Java), but we will keep adding the content we will produce.

Of course, we would also gladly accept pull requests that would either add topics or add support for an existing one in another language.

We hope you enjoy the content, and we would be very interested in any feedback or suggestions you might have! 👋

Importance of Graceful Shutdown in Kubernetes

Criteo Tech Community — Wed, 10 Sep 2025 07:01:46 +0000

Author: Alik Khilazhev

Have you ever deployed a new version of your app in Kubernetes and noticed errors briefly spiking during rollout? Many teams do not even realize this is happening, especially if they are not closely monitoring their error rates during deployments.

There is a common misconception in the Kubernetes world that bothers me. The official Kubernetes documentation and most guides claim that “if you want zero downtime upgrades, just use rolling update mode on deployments”. I have learned the hard way that this simply it is not true — rolling updates alone are NOT enough for true zero-downtime deployments.

And it is not just about deployments. Your pods can be terminated for many other reasons: scaling events, node maintenance, preemption, resource constraints, and more. Without proper graceful shutdown handling, any of these events can lead to dropped requests and frustrated users.

In this post, I will share what I have learned about implementing proper graceful shutdown in Kubernetes. I will show you exactly what happens behind the scenes, provide working code examples, and back everything with real test results that clearly demonstrate the difference.

The Problem: Hidden Errors During Pod Termination

ChatGPT: draw funny picture of Kubernetes pod gracefully shutting down

If you are running services on Kubernetes, you have probably noticed that even with rolling updates (where Kubernetes gradually replaces pods), you might still see errors during deployment. This is especially annoying when you are trying to maintain “zero-downtime” systems.

When Kubernetes needs to terminate a pod (for any reason), it follows this sequence:

Sends a SIGTERM signal to your container
Waits for a grace period (30 seconds by default)
If the container does not exit after the grace period, it gets brutal and sends a SIGKILL signal

The problem? Most applications do not properly handle that SIGTERM signal. They just die immediately, dropping any in-flight requests. In the real world, while most API requests complete in 100–300ms, there are often those long-running operations that take 5–15 seconds or more. Think about processing uploads, generating reports, or running complex database queries. When these longer operations get cut off, that’s when users really feel the pain.

When Does Kubernetes Terminate Pods?

Rolling updates are just one scenario where your pods might be terminated. Here are other common situations that can lead to pod terminations:

Horizontal Pod Autoscaler Events : When HPA scales down during low-traffic periods, some pods get terminated.
Resource Pressure : If your nodes are under resource pressure, the Kubernetes scheduler might decide to evict certain pods.
Node Maintenance : During cluster upgrades, node draining causes many pods to be evicted.
Spot/Preemptible Instances : If you are using cost-saving node types like spot instances, these can be reclaimed with minimal notice.

All these scenarios follow the same termination process, so implementing proper graceful shutdown handling protects you from errors in all of these cases — not just during upgrades.

Let’s Test It: Basic vs. Graceful Service

Instead of just talking about theory, I built a small lab to demonstrate the difference between proper and improper shutdown handling. I created two nearly identical Go services:

Basic Service : A standard HTTP server with no special shutdown handling
Graceful Service : The same service but with proper SIGTERM handling

Both services:

Process requests that take about 4 seconds to complete (intentionally configured for easier demonstration)
Run in the same Kubernetes cluster with identical configurations
Serve the same endpoints

I specifically chose a 4-second processing time to make the problem obvious. While this might seem long compared to typical 100–300ms API calls, it perfectly simulates those problematic long-running operations that occur in real-world applications. The only difference between the services is how they respond to termination signals.

To test them, I wrote a simple k6 script that hammers both services with requests while triggering rolling restart of service’s deployment. Here is what happened:

Basic Service: The Error-Prone One

checks_total.......................: 695 11.450339/s
checks_succeeded...................: 97.98% 681 out of 695
checks_failed......................: 2.01% 14 out of 695

✗ status is 200
  ↳ 97% — ✓ 681 / ✗ 14

http_req_failed....................: 2.01% 14 out of 696

Graceful Service: The Reliable One

checks_total.......................: 750 11.724824/s
checks_succeeded...................: 100.00% 750 out of 750
checks_failed......................: 0.00% 0 out of 750

✓ status is 200

http_req_failed........... ........: 0.00% 0 out of 751

The results speak for themselves. The basic service dropped 14 requests during the update (that is 2% of all traffic), while the graceful service handled everything perfectly without a single error.

You might think “2% it is not that bad” — but if you are doing several deployments per day and have thousands of users, that adds up to a lot of errors. Plus, in my experience, these errors tend to happen at the worst possible times.

So, How Do We Fix It? The Graceful Shutdown Recipe

After digging into this problem and testing different solutions, I have put together a simple recipe for proper graceful shutdown. While my examples are in Go, the fundamental principles apply to any language or framework you are using.

Here are the key ingredients:

1. Listen for SIGTERM Signals

First, your app needs to catch that SIGTERM signal instead of ignoring it:

// Set up channel for shutdown signals
stop := make(chan os.Signal, 1)
signal.Notify(stop, os.Interrupt, syscall.SIGTERM)

// Block until we receive a shutdown signal
<-stop
log.Println("Shutdown signal received")

This part is easy — you are just telling your app to wake up when Kubernetes asks it to shut down.

2. Track Your In-Flight Requests

You need to know when it is safe to shut down, so keep track of ongoing requests:

// Create a request counter
var inFlightRequests atomic.Int64

http.HandleFunc("/process", func(w http.ResponseWriter, r *http.Request) {
    // Increment counter when request starts
    inFlightRequests.Add(1)
    // do not forget to decrement when done!
    defer inFlightRequests.Add(-1)

    // Your normal request handling...
    time.Sleep(4 * time.Second) // Simulating long-running work
})

This counter lets you check if there are still requests being processed before shutting down. It is especially important for those long-running operations that users have already waited several seconds for — the last thing they want is to see an error right before completion!

3. Separate Your Health Checks

Here is a commonly overlooked trick — you need different health check endpoints for liveness and readiness:

// Track shutdown state
var isShuttingDown atomic.Bool

// Readiness probe - returns 503 when shutting down
http.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
    if isShuttingDown.Load() {
        w.WriteHeader(http.StatusServiceUnavailable)
        fmt.Fprintf(w, "Shutting down, not ready")
        return
    }

    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "Ready for traffic")
})

// Liveness probe - always returns 200 (we are still alive!)
http.HandleFunc("/alive", func(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, "I'm alive")
})

This separation is crucial. The readiness probe tells Kubernetes to stop sending new traffic, while the liveness probe says “do not kill me yet, I’m still working!”

4. The Shutdown Dance

Now for the most important part — the shutdown sequence:

// Step 1: Mark service as shutting down
isShuttingDown.Store(true)

// Step 2: Let Kubernetes notice the readiness probe failing
time.Sleep(5 * time.Second)

// Step 3: Wait for in-flight requests to finish
for inFlightRequests.Load() > 0 {
    time.Sleep(1 * time.Second)
}

// Step 4: Finally, shut down the server gracefully
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

if err := server.Shutdown(ctx); err != nil {
    log.Fatalf("Forced shutdown: %v", err)
}

I’ve found this sequence to be optimal. First, we mark ourselves as “not ready” but keep running. We pause to give Kubernetes time to notice and update its routing. Then we patiently wait until all in-flight requests finish before actually shutting down the server.

5. Configure Kubernetes Correctly

Do not forget to adjust your Kubernetes configuration:

# Use different probes for liveness and readiness
livenessProbe:
  httpGet:
    path: /alive # Always returns OK
    port: 8080
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready # Returns 503 during shutdown
    port: 8080
  periodSeconds: 3
  failureThreshold: 2

# Give pods enough time to shut down gracefully
terminationGracePeriodSeconds: 30 # Stops routing traffic after 2 failed checks (6 seconds)

This tells Kubernetes to wait up to 30 seconds for your app to finish processing requests before forcefully terminating it.

TL; DR; Quick Tips

If you are in a hurry, here are the key takeaways:

Catch SIGTERM Signals : Do not let your app be surprised when Kubernetes wants it to shut down.
Track In-Flight Requests : Know when it is safe to exit by counting active requests.
Split Your Health Checks : Use separate endpoints for liveness (am I running?) and readiness (can I take traffic?).
Fail Readiness First : As soon as shutdown begins, start returning “not ready” on your readiness endpoint.
Wait for Requests : Do not just shut down — wait for all active requests to complete first.
Use Built-In Shutdown : Most modern web frameworks have graceful shutdown options; use them!
Configure Terminaton Grace Period : Give your pods enough time to complete the shutdown sequence.
Test Under Load : You will not catch these issues in simple tests — you need realistic traffic patterns.

Wrap Up: Is It Worth the Extra Code?

You might be wondering if adding all this extra code is really worth it. After all, we’re only talking about a 2% error rate during pod termination events.

From my experience working with high-traffic services, I would say absolutely yes — for three reasons:

User Experience : Even small error rates look bad to users. Nobody wants to see “Something went wrong” messages, especially after waiting 10+ seconds for a long-running operation to complete.
Cascading Failures : Those errors can cascade through your system, especially if services depend on each other. Long-running requests often touch multiple critical systems.
Deployment Confidence : With proper graceful shutdown, you can deploy more frequently without worrying about causing problems.

The good news is that once you have implemented this pattern once, it is easy to reuse across your services. You can even create a small library or template for your organization.

In production environments where I have implemented these patterns, we have gone from seeing a spike of errors with every deployment to deploying multiple times per day with zero impact on users. That is a win in my book!

If you want to dive deeper into this topic, I recommend checking out the article Graceful shutdown and zero downtime deployments in Kubernetes from LearnKube — the Kubernetes training company . It provides additional technical details about graceful shutdown in Kubernetes, though it does not emphasize the critical role of readiness probes in properly implementing the pattern as we have discussed here.

For those interested in seeing the actual code I used in my testing lab, I’ve published it on GitHub with instructions for running the demo yourself.

Have you implemented graceful shutdown in your services? Did you encounter any other edge cases I didn’t cover? 👀

Let me know in the comments how this pattern has worked for you! 👇