Forem: Sara A.

The Test That Lied to Me: A practical guide to writing unit tests that actually mean something

Sara A. — Sat, 07 Mar 2026 19:22:15 +0000

The Test That Lied to Me

A practical guide to writing unit tests that actually mean something

A quick note before we start: there is a lot of enthusiasm lately about AI generating unit tests automatically. I have tried it. The results were technically valid, consistently green, and almost completely useless — which, if you think about it, is a perfect description of most of the unit tests in the industry.

So here we are. Maybe this helps the AI write better tests. Maybe it just helps the engineers. Either way, I felt the need to write it down.

This guide is not about coverage percentages or which framework to use. It is about the decisions that determine whether your test suite is an asset or an alibi.

The examples in this guide use Java with JUnit 5 and Mockito. The principles apply everywhere.

Part 1: The Name Is the Specification

Before a test does anything, it declares what it expects. That declaration lives in the name.

A bad test name is a lost opportunity. Not just for documentation — for thinking. If you cannot write a clear name for a test, it is usually because you have not yet decided what you are actually testing.

The convention that works:

Should_ExpectedBehaviour_When_StateUnderTest

Not because the format is sacred, but because it forces two decisions: what should happen, and under what condition. Both need to be explicit.

// ❌ Tells you nothing
@Test
public void testValidation() {
    // ...
}

// ✓ Tells you everything
@Test
public void Should_ReturnValidationError_When_AmountIsNegative() {
    // ...
}

The second name is a spec. If this test fails, you know exactly what broke before reading a single line of code. That matters at 11pm when something is on fire.

One more thing: consistency. A codebase where half the tests say should_ and half say test_ and half say verify_ is a codebase where nobody agreed on anything. Pick a convention and apply it. It costs nothing and saves more than you expect.

Part 2: One Test, One Reason to Fail

The most common mistake in test suites is not bad assertions. It is too many of them, testing too many things at once.

A test that can fail for three different reasons gives you almost no information when it does fail. You know something is wrong. You do not know what.

One condition, one test

Even if two different invalid inputs produce the same error, they belong in separate tests. The failure condition is not the same, even if the output is.

// ❌ Two causes, one test
@Test
public void Should_ThrowException_When_Invalid() {
    assertThrows(NotFoundException.class, () -> victim.findOrder(unknownId));
    assertThrows(NotFoundException.class, () -> victim.findOrder(deletedId));
}

// ✓ One cause, one test
@Test
public void Should_ThrowNotFoundException_When_OrderDoesNotExist() {
    when(orderRepository.findById(unknownId)).thenReturn(Optional.empty());

    assertThrows(NotFoundException.class, () -> victim.findOrder(unknownId));
}

@Test
public void Should_ThrowNotFoundException_When_OrderIsDeleted() {
    when(orderRepository.findById(deletedId)).thenReturn(Optional.of(deletedOrder()));

    assertThrows(NotFoundException.class, () -> victim.findOrder(deletedId));
}

The tests are longer this way. That is the point. Diagnostic value is worth the extra lines.

Parameterization: the right use

Parameterized tests are useful when you are testing the same behaviour with different values of the same input. They are not useful when you are testing different behaviours and bundling them together to make the test file look shorter.

The rule: one axis of variation per parameterized test. If your test method takes a flag name as a parameter alongside the flag value, you are probably mixing two different concerns.

// ❌ Mixed concerns — iterates over unrelated flags
static Stream<Arguments> flags() {
    return Stream.of(
        of("include_tax", true),
        of("include_tax", false),
        of("include_discount", true),
        of("include_discount", false)
    );
}

@ParameterizedTest
void Should_HandleFlag_When_Set(String flag, boolean value) {
    // tests multiple unrelated flags in one method
}

// ✓ One concern, two values
static Stream<Arguments> taxFlagVariants() {
    return Stream.of(
        of(true, "with-tax.json"),
        of(false, "without-tax.json")
    );
}

@ParameterizedTest
void Should_GenerateCorrectInvoice_When_TaxFlagSet(boolean includeTax, String expectedFile) {
    // tests one flag, both states
}

When the parameterized test fails, you want to know exactly which value caused it. A test that iterates over unrelated flags gives you a row number. A focused test gives you a reason.

Part 3: Structure Is Not Optional

A test has three jobs: set up the world, do the thing, check what happened. If you cannot identify which line belongs to which job, the test is already a problem.

Given, When, Then is not ceremony. It is the minimum structure required for a test to be readable by someone who did not write it — including future you.

@Test
void Should_CalculateAggregatedData_When_ValidDataProvided() {

    // Given
    when(repository.returnDataToBeAggregated())
        .thenReturn(List.of(
            new DataPoint("Sensor1", 100),
            new DataPoint("Sensor2", 200)
        ));
    AggregatorService victim = new AggregatorService(repository);

    // When
    int result = victim.calculateAggregatedData();

    // Then
    assertEquals(300, result);
}

The comments (// Given, // When, // Then) are for illustration only. In real tests, the structure should be clear enough that the comments are unnecessary. If they are not, the test probably needs restructuring.

Naming conventions within the test

Two naming decisions that pay consistent dividends:

Pick a consistent name for the class under test and use it everywhere. Common choices are victim, underTest, or sut (system under test). Which one you pick matters less than the fact that everyone on the team picks the same one — it makes the subject of the test immediately identifiable at a glance.
Do the same for the output. result is a common choice. Whatever you pick, the consistency is what makes tests scannable.

AggregatorService victim = new AggregatorService(repository);
int result = victim.calculateAggregatedData();

Keep variables where they matter

There is a tendency to extract every string into a named variable. In production code, this is usually right. In tests, it can be counterproductive.

If a value is self-explanatory, inline it. Extracting "ADMIN" into a variable called adminRoleName adds a line and removes information. The reader now has to look up what adminRoleName is.

// ❌ Over-extracted
String adminRole = "ADMIN";
String userRole = "USER";
String userId1 = "usr-001";
String userId2 = "usr-002";
String userId3 = "usr-003";

insertUsers(
    createUser(userId1, adminRole),
    createUser(userId2, adminRole),
    createUser(userId3, userRole)
);

// ✓ Inlined — the values speak for themselves
insertUsers(
    createUser("usr-001", "ADMIN"),
    createUser("usr-002", "ADMIN"),
    createUser("usr-003", "USER")
);

List<User> result = requestUsersByRole("ADMIN");

assertThat(result)
    .extracting(User::getId)
    .containsOnly("usr-001", "usr-002");

Part 4: The Setup Trap

@BeforeEach is one of the most misused tools in unit testing. It exists to avoid repeating identical setup across tests. It is routinely used to hide setup that is not actually identical — just similar enough to seem like it should be shared.

The result is tests that look short but are not self-contained. To understand what a test actually does, you have to read the test and the setup method and remember how they interact. If a later test overrides one of the behaviours defined in setup, that is three places to look.

// ❌ Hidden dependency — where does the mock behaviour come from?
@BeforeEach
void setup() {
    when(repo.findById(anyLong()))
        .thenReturn(Optional.of(new Entity()));
    victim = new MyService(repo);
}

@Test
void Should_ThrowException_When_NotFound() {
    // Silently overrides BeforeEach
    when(repo.findById(anyLong()))
        .thenReturn(Optional.empty());
    assertThrows(NotFoundException.class, () -> victim.findById(99L));
}

// ✓ Self-contained — everything you need is right here
@Test
void Should_ReturnEntity_When_IdIsValid() {
    when(repo.findById(1L))
        .thenReturn(Optional.of(new Entity()));
    MyService victim = new MyService(repo);

    Entity result = victim.findById(1L);

    assertNotNull(result);
}

@Test
void Should_ThrowException_When_NotFound() {
    when(repo.findById(99L))
        .thenReturn(Optional.empty());
    MyService victim = new MyService(repo);

    assertThrows(NotFoundException.class, () -> victim.findById(99L));
}

The second version is longer. It is also the one that tells you, immediately, what each test needs to be true. There is no invisible context. There is no setup to remember.

The alternative to @BeforeEach is not duplication — it is helper functions used correctly. A helper that builds an object with sensible defaults and accepts explicit parameters for the values that matter keeps the test readable without hiding what is being tested.

// ❌ Helper hides what matters
prepareData(
    createInvalidData(),
    createPartialData()
);

// ✓ Helper is explicit about what varies
prepareData(
    createOrderLine("item-a", price(100.0), quantity(2)),
    createOrderLine("item-b", price(0.0), quantity(1))
);

The difference is what the reader learns from looking at the test. The first version tells you the data is invalid and partial. The second tells you exactly why — one line has a zero price. That is the information that matters for understanding what the test is actually verifying.

@BeforeEach has legitimate uses: initialising mocks, setting up state that is genuinely shared across every test, preparing infrastructure. The problem is using it to define mock behaviour — which is almost never truly shared.

@BeforeAll has even fewer legitimate uses at the unit test level. Loading a config file once, compiling a regex, spinning up something expensive. Not setting up test data. If your unit test needs data set up once for all tests, that is usually a sign the tests are not as independent as they should be.

Test utilities — use sparingly, test thoroughly

There is a tension worth acknowledging here. On one hand, test utility classes tend to become bloated, overgeneralised, and quietly wrong. On the other hand, utility classes extracted from production code need to be tested — properly, not incidentally.

These are different problems.

On production utility classes: a utility method extracted for reuse is a first-class unit of logic. It deserves its own tests covering valid inputs, invalid inputs, and boundary conditions. The fact that it is called from another class does not mean it is implicitly tested — it means it is tested indirectly, which is not the same thing. Indirect coverage tells you something broke. Direct tests tell you what.

That said, direct unit tests alone are not enough. A utility method that works correctly in isolation can still behave unexpectedly in context — with real data, in combination with other logic, under conditions the unit tests did not anticipate. Test it directly and verify it in context.

On test utility classes: the bar should be high. A helper that creates a test object with sensible defaults is useful. A helper that encodes business logic, makes assertions, or accumulates enough behaviour to need its own documentation is a liability. When a test utility class becomes complex enough that you have to understand it to understand a test, it has defeated its own purpose.

The rule of thumb: if the logic is genuinely shared across many tests and has no natural home elsewhere, a utility makes sense. If it exists mainly to save a few lines, inline it. Explicit setup in the test is almost always easier to read than a call to a helper that hides what is actually being prepared.

Part 5: Keep Tests Simple

A test that contains an if statement can be wrong in two different ways: the production code can be wrong, or the test logic can be wrong. At that point, who is testing the test?

Tests should verify behaviour, not implement it. The moment a test starts making decisions — branching on input, looping over results, switching on type — it has become code that needs its own tests. That is not a metaphor. That is the actual problem.

// ❌ The test is doing too much thinking
@Test
void Should_ApplyDiscount_When_Eligible() {
    for (Customer customer : testCustomers) {
        if (customer.isEligible()) {
            assertEquals(0.9, victim.calculateDiscount(customer));
        } else {
            assertEquals(1.0, victim.calculateDiscount(customer));
        }
    }
}

// ✓ Two cases, two tests, no ambiguity
@Test
void Should_ApplyDiscount_When_CustomerIsEligible() {
    Customer customer = new Customer("alice", eligible(true));

    double result = victim.calculateDiscount(customer);

    assertEquals(0.9, result);
}

@Test
void Should_NotApplyDiscount_When_CustomerIsNotEligible() {
    Customer customer = new Customer("bob", eligible(false));

    double result = victim.calculateDiscount(customer);

    assertEquals(1.0, result);
}

When test logic grows complex, the instinct is to make it smarter. The right move is almost always to make it simpler — and split it.

Part 6: Mock the Behaviour, Not the Data

Mocking is one of the easier things to do wrong in unit testing. The tell is when you find yourself mocking a data object — a DTO, a plain record, a value container.

Mocking data objects produces fragile tests. They break when the object's structure changes, even when the logic being tested has not changed at all. Worse, they test almost nothing meaningful. A mock DTO that returns a hardcoded string is not telling you anything about your system.

What you actually want to mock is the behaviour of the things your code depends on — repositories, external services, anything that makes a decision or crosses a boundary.

// ❌ Mocking data — fragile and meaningless
OrderDTO dto = Mockito.mock(OrderDTO.class);
when(dto.getProductId()).thenReturn("123");
when(dto.getQuantity()).thenReturn(2);
// If OrderDTO gains a new field, this test might break for no reason

// ✓ Mocking behaviour — tests how the service handles real scenarios
Order order = new Order("123", "Product A", 2);
when(orderRepository.findById("123"))
    .thenReturn(Optional.of(order));

The distinction matters most when thinking about what you are actually verifying. A test that mocks a DTO is verifying that Mockito works. A test that mocks a repository is verifying that your service behaves correctly when data is and is not found.

Verifying interactions

Not everything worth testing produces a return value. Sometimes the important thing is that a method was called, was called with the right arguments, or was never called at all.

@Test
void Should_NotPersist_When_InputIsInvalid() {
    victim.processData("invalid");

    verify(repository, never()).save(anyString());
}

@Test
void Should_PersistParsedValue_When_InputIsValid() {
    ArgumentCaptor<String> captor = ArgumentCaptor.forClass(String.class);

    victim.processData("raw-value");

    verify(repository).save(captor.capture());
    assertEquals("parsed-value", captor.getValue());
}

The first test tells you nothing was saved. The second tells you exactly what was saved. Neither requires a return value. Both tell you something real about how the system behaves.

That said, if verifying an interaction is the only way to test something meaningful — a calculation, a transformation, a decision — it is worth pausing before reaching for verify. Code that can only be validated through its side effects is often code that is doing too much in one place. Extracting the logic into a dedicated class or utility makes it directly testable by its output, which is almost always cleaner.

Needing to spy on a method to confirm a calculation happened is not just a testing problem. It is usually a separation of concerns problem wearing a testing costume.

Part 7: The Edge Cases Are the Point

Unit tests are the cheapest place to cover edge cases. Cheaper than component tests, cheaper than integration tests, infinitely cheaper than a production incident.

Yet most test suites are weighted toward the happy path. The valid input goes in, the correct output comes out. Green. Done. The test suite grows, coverage climbs, and the system still breaks on the first null it encounters.

Edge cases worth explicitly testing:

Null inputs, empty strings, empty collections
Boundary values: zero, negative numbers, maximum allowed values
The case where a dependency returns nothing (Optional.empty(), empty list)
The case where a dependency throws
Inputs that are technically valid but semantically unusual

@Test
void Should_ThrowException_When_OrderLineIsNull() {
    assertThrows(IllegalArgumentException.class,
        () -> victim.calculateUnitPrice(null));
}

@Test
void Should_ThrowException_When_PriceIsNull() {
    OrderLine line = new OrderLine("item", null, quantity(2));

    assertThrows(IllegalArgumentException.class,
        () -> victim.calculateUnitPrice(line));
}

@Test
void Should_ReturnZeroUnitPrice_When_QuantityIsZero() {
    OrderLine line = new OrderLine("item", totalPrice(50.0), quantity(0));

    Money result = victim.calculateUnitPrice(line);

    assertEquals(Money.ZERO, result);
}

Edge cases are not extra credit. They are the specification for how your code behaves when the world does not cooperate. Which is most of the time.

Part 8: Tests That Lie

A passing test should mean something. If it passes when the feature is broken, it is not a test — it is a green light that gives false confidence, which is worse than no test at all.

Tests lie in predictable ways.

The assertion that does not assert

The most common form. The test runs, it calls the method, but the assertion is checking something that is always true regardless of what the method does.

// ❌ This always passes — it has nothing to do with process()
@Test
void Should_ProcessOrder_When_Valid() {
    Order order = new Order("123", "Product", 2);
    victim.process(order);

    // We're asserting the id we assigned, not anything the method did
    assertEquals("123", order.getId());
}

// ✓ This actually tests something
@Test
void Should_PersistOrder_When_Valid() {
    Order order = new Order("123", "Product", 2);
    victim.process(order);

    verify(repository).save(order);
}

The test that survives deletion

Delete the implementation. Run the tests. If the same tests still pass, they were not testing the implementation.

This sounds extreme, but it is one of the most useful checks you can do on a test suite. Tests that survive the deletion of the thing they claim to test are guaranteed liars.

The complex test

Tests that contain if statements, switch cases, or loops are tests that can have bugs. A test with a bug is not a test — it is a liability that produces a false sense of coverage.

If the test logic is becoming complex, the answer is almost always to split it into simpler tests, not to make the existing test smarter.

Part 9: Write Code That Wants to Be Tested

Some code is hard to test because the test is poorly written. But some code is hard to test because the code itself is poorly designed — and the difficulty of testing is the most honest feedback you will get about that.

A class that requires a running database to do anything is a class that has not separated its concerns. A method that produces no output and mutates hidden state is a method that has made itself invisible to assertions. A function that does five things is a function that needs five different test setups to cover each path.

Testability is not a property you add after the fact. It emerges from design decisions made while writing the code.

A few patterns that consistently produce untestable code — and what to do instead:

Hidden dependencies. If a class creates its own collaborators internally, there is no way to replace them in tests.

// ❌ No way to control what the repository does
public class OrderService {
    private final OrderRepository repository = new OrderRepository();
}

// ✓ Inject it — now tests can provide their own
public class OrderService {
    private final OrderRepository repository;

    public OrderService(OrderRepository repository) {
        this.repository = repository;
    }
}

Logic buried in private methods. Private methods are not directly testable. If a private method contains meaningful logic, that logic either gets tested indirectly through the public interface — which is fine — or it is complex enough that it should be extracted into its own class and tested directly.

// ❌ Complex logic hidden where tests cannot reach it
private BigDecimal applyTieredPricing(BigDecimal base, int quantity) {
    // 30 lines of pricing logic
}

// ✓ Extracted — now testable on its own terms
public class TieredPricingCalculator {
    public BigDecimal calculate(BigDecimal base, int quantity) {
        // same logic, now directly testable
    }
}

Static method calls and global state. Static calls are impossible to mock and global state makes tests order-dependent (not impossible, but highly inadvisable). Both are usually avoidable.

Methods that do too much. A method that validates input, fetches data, applies business rules, formats output, and persists the result cannot be tested cleanly. Each responsibility it sheds becomes a unit that can be tested independently.

The uncomfortable version of this principle: if writing the test feels like a struggle, read the production code before blaming the test. The test is probably trying to tell you something.

Part 10: The Test Suite As a Document

The best test suites are the best onboarding material. Not because someone planned them that way, but because tests that are named correctly, structured clearly, and focused on one thing each naturally become a readable description of what the system does.

When a new developer joins the team and wants to understand how order processing works, they have two options. They can read the production code, which tells them how. Or they can read the tests, which tell them what — what inputs are valid, what happens on the edge cases, what the system refuses to do and why.

That is the difference between a test suite written for coverage and a test suite written with intention.

A useful exercise: read through your test names without reading the test bodies. Can you reconstruct what the system does from the names alone? If not, the names are not doing their job.

Inheritance in tests works against this. When a test class extends a base class to inherit setup, the test is no longer self-contained. Understanding it requires reading two files, understanding how they relate, and tracking what the parent does and does not override. That is more archaeology than onboarding.

As discussed in Part 4, the same applies to test utility classes that grow large enough to have their own logic.

Part 11: What Not to Test

As important as knowing what to test is knowing what not to.

Do not test the web layer

A service does not return a 404. It throws a NotFoundException. What happens to that exception — how it gets mapped to an HTTP status code, how the response body is shaped, what headers are attached — is the web layer's responsibility, not the service's.

Unit tests live below that boundary. At the unit level, test the logic: services, domain classes, utility methods, calculations, decisions. Anything that can be verified in isolation, without starting a server or talking to a database, belongs here. The moment a test depends on another party — an HTTP layer, a real database, a message broker, an external service — it has crossed into component or integration territory. Keep the two separate and both become easier to reason about.

Do not bootstrap the application context

If your unit test has @SpringBootTest on it, it is not a unit test anymore. It is a component test that happens to live in the wrong folder.

Bootstrapping a full application context to test a single service method is overkill by definition. It is slow, it introduces dependencies that have nothing to do with what you are testing, and it blurs the line between levels of testing in ways that tend to get worse over time.

Unit tests should start fast and run in isolation. Mocks and stubs exist precisely so you do not need a running application to verify that a service behaves correctly. If you find yourself reaching for @SpringBootTest at the unit level, the question worth asking is not "how do I make this work" but "why does this feel necessary" — because the answer usually points to a design problem.

This applies beyond Spring. Any equivalent mechanism that boots a full application context — dependency injection containers, embedded servers, framework runners — has no place in a unit test.

Do not test the framework

If you are using a framework such as Spring Boot, do not test that @Autowired works or that the application can read the application.yml. Spring tests that. If you are using Jackson, do not test that it serializes an object. Jackson tests that. The job of your tests is to verify that your code, given correct inputs from the framework, produces the right outputs.

Testing framework behaviour wastes time, adds maintenance burden, and produces tests that break when you upgrade a dependency — not because your code changed, but because the framework's internals did.

Do not write production code for tests

If the only reason a getter exists is to make a test easier to write, that getter should not exist. The test should find another way.

Production code written for tests is production code that does not serve production. It inflates the API, exposes internals that should stay internal, and misleads anyone who reads the class and wonders what that method is for.

The constraint is useful. If a class is difficult to test without special access, that is usually feedback about the design. A class that is hard to test without poking at its internals is often a class that is doing too much, or a class whose dependencies are not properly injected.

Test the observable behaviour. If that is not enough to verify the class is working, the class probably needs to be redesigned.

The Uncomfortable Part

Most of these guidelines are not hard to understand. They are hard to apply consistently when you are under pressure, when the ticket is already late, when the test suite has a hundred tests written the wrong way and adding one more wrong one is faster than fixing the pattern.

The longer version of this guide would be about that. About how a test suite degrades gradually, one convenience at a time, until the green board means almost nothing and everyone has quietly stopped trusting it.

The short version is this: a test that you do not trust is not a safety net. It is a ritual.

Writing tests that you actually believe is harder than writing tests that pass. It requires deciding what you are testing before you write the test. It requires resisting the urge to share setup that is not really shared. It requires accepting that ten focused tests are better than one test that checks everything.

None of that is complicated.

It just requires not taking shortcuts.

Quick Reference

Guideline	The point
Name clearly	`Should_ExpectedBehaviour_When_StateUnderTest` — readable without reading the code
One reason to fail	Separate tests for separate failure causes, even when the outcome is the same
Given / When / Then	Three sections, always. If you cannot identify them, restructure.
Consistent naming	Pick a name for the class under test and the output, and use them everywhere
Self-contained setup	Avoid `@BeforeEach` for mock behaviour. Put setup where it belongs: in the test.
Mock behaviour, not data	Mock repositories, services, decisions — not DTOs or plain objects
Inline simple values	Do not extract constants that are already readable as literals
No logic in tests	No `if`, no loops, no `switch`. If the test needs to think, split it.
Cover edge cases	Null, empty, zero, boundary, missing. These are the interesting cases.
Do not test the web layer	Services throw exceptions. HTTP status codes are someone else's job.
Do not bootstrap the context	`@SpringBootTest` in a unit test is a component test in the wrong folder.
Do not test the framework	Test your logic. Assume Spring, Jackson, and JPA work.
No test-only production code	If a getter exists only for tests, the test is wrong, not the class.

You Can’t Learn Spring Boot in a Weekend (And That’s Not the Problem)

Sara A. — Fri, 06 Mar 2026 22:56:46 +0000

The “Give Me Two Days” Learning Strategy

Today an engineer told me about his plans for the weekend. They were going to take a crash course in Java and Spring Boot.

The kind that promises you will “cover the essentials” of an entire framework ecosystem in a few hours, usually with a very confident youtube instructor and a progress bar that moves at an inspiring pace.

I told them that might not be the best plan. Not because learning is bad. Learning is always good.

But because the plan itself revealed a small misunderstanding about where the actual difficulty lies.

To be fair, the frustration that triggered this idea was real.

They had been working on a piece of an application and later realised there were some issues they could have spotted earlier. You know the situation.

You look at a block of code and something feels… off. But you can't immediately explain why.

Later, after someone points it out - or the behaviour becomes obvious - you look back and think:

“How did I not notice this earlier?”

Their conclusion was simple: he needed to learn the framework better. Which is a very common reaction in software development.
Something goes wrong → we assume the tool is the problem → we try to learn the tool faster.

And to be fair, this wasn’t the first time I had heard a version of this plan. Some time ago another engineer, still early in their journey through larger systems, came with a carefully designed two-week training route.
It had a schedule, a list of libraries, and a surprisingly ambitious timeline. By the end of those two weeks, according to the document, they would have covered half the modern Java ecosystem.

The problem was that most of that route focused on tools and libraries, not on the foundations underneath them.
Which is understandable. Libraries feel concrete. Frameworks feel productive. They promise visible progress.

But the thing is: the code that didn’t make sense would not suddenly make sense if it had been written in Spring Boot, Quarkus, Micronaut, plain Java, C#, or C# with .NET.

Confusing code has a remarkable ability to remain confusing across frameworks.
And the skill required to recognise it earlier usually has very little to do with crash courses.

Frameworks Are Just Fancy Packaging

As referred, the common assumption in these situations is that the missing piece must be the framework.
If the system uses Spring Boot, then understanding Spring Boot better should make the system easier to understand.

But frameworks like Spring Boot are not the system. They are an abstraction layer over problems engineers have been solving for decades:
Dependency injection. Configuration management. Application lifecycle. HTTP routing. Persistence. Messaging.

Which is great. But it also creates a small illusion: it makes it look like the complexity lives in the framework.

In reality, the complexity usually lives outside it.

There is another trap hidden here as well: sometimes concepts appear to change meaning depending on the ecosystem you are working in. Take something like a “queue.”

In embedded systems, a queue might be a small in-memory structure used to pass messages between tasks or threads.
In web applications, someone says “queue” and suddenly they mean RabbitMQ, Kafka, or some distributed messaging system moving events between services.
Different scale. Different guarantees. Same idea: a buffer that decouples producers from consumers.

Frameworks and platforms wrap these ideas with different tools, terminology, and levels of complexity. It can feel like you are learning a completely new concept, when in reality you are seeing the same one operating at a different scale.

When you are reading code that doesn’t make sense, the issue is rarely that you forgot how a Spring annotation works. It is much more likely that you are dealing with unclear responsibilities, tangled logic, hidden assumptions, or design decisions that were never properly explained.

A crash course can show you that @Service, @Repository, and @Controller exist. It can show you roughly where they go.
What it rarely explains is what those annotations actually trigger, why the separation exists, or when the design itself is the real problem.

It also cannot teach you why a piece of business logic ended up in the wrong layer, why two modules depend on each other in surprising ways, or why a seemingly simple change suddenly affects five different parts of the system.

Frameworks do not eliminate design problems — they simply give you nicer tools to build them.

Tutorials Teach the Hammer. Not When to Use It.

Tutorials are exactly that: instructions on how to use a tool. They show you the toolbox.

“Look,” they say, “here is the hammer. Here is the screwdriver. Here is the wrench. Here is the power drill.”
Then they demonstrate.

You use the hammer to hammer. Usually that happens when you have a nail. They might even show you a better way to hold the hammer, or how to hit the nail without smashing your thumb. All useful things.

What they rarely discuss is whether you should be using the hammer at all.

Is the wall made of wood?
Is it concrete?
Is it safe to put nails there?
Is there already a pipe behind that wall?

And yes, hammers can also be used to destroy things. But let’s not destroy that master wall.

Tutorials are very good at showing how tools work. They are much less interested in explaining when those tools should exist in the first place.

If you need to solve a very small problem, a tutorial is perfect. You follow the steps, you get a result, everyone is happy. But if you want to build a house, knowing how to swing a hammer is not enough.

You need to understand foundations.

Ideally, university or any other formal training would already have introduced many of these concepts.
Dependency injection. Design patterns. Architecture principles. Separation of concerns.

But let’s be honest: without some experience, these words often exist as concepts in a void. You read about them. You memorise definitions. You might even pass an exam about them.

And then you start working on real systems and realise you have no idea what any of it actually looks like in practice.
Because understanding usually comes in a different order.

First you experiment. You write code. You break things. You copy examples. You follow tutorials. You try frameworks. You build small things that work — and sometimes things that really shouldn’t.

At that stage learning is messy and unstructured. And that’s fine. In fact, it is often necessary.

And that’s fine. In fact, it is often necessary. You see the same problems. The same shapes in the code. The same attempts to organise behaviour. And then the foundations become important again.

It’s the Same Idea Wearing a Spring Jacket

If you understand what Dependency Injection and Inversion of Control are, then Spring Boot's @Autowired is not mysterious at all.

The same idea appears everywhere.

@Inject in Java EE or Quarkus.
The DI container in C# with .NET.

Different frameworks, same principle.
The framework did not invent the idea. It simply packaged it.
The same happens with other things you will encounter in the Spring ecosystem.

Once you understand Decorator and Proxy patterns, annotations like @Transactional or @Async become much easier to reason about. They behave like decorators: they add behaviour around a method call. Spring simply implements that layering with proxies and interception instead of manual wrapper objects.

"HandlerInterceptor and filters?": That is just Chain of Responsibility wearing a Spring jacket.

So yes — if you are going to work with Spring Boot, you should absolutely learn Spring Boot. But it becomes much easier to learn when the foundations are already in place. And those foundations also make it much easier to move between frameworks later.

Because once you know what problem you are trying to solve, the question becomes simple:

"How do I do this here?"

You need native queries? You understand the trade-offs, so you search: how do I do this in Spring Boot?

You want retries because a service call might fail? You understand the use case, so you look at the multiple ways Spring provides retries — and there are many.
Which is another reason why foundations matter: ecosystems tend to offer a lot of ways to do the same thing.

And then there is unit testing. You'll ask how to solve that with SpringBoot and someone will inevitably say:

"Let's use @SpringBootTest."

Ah! Trick question. You usually don't use Spring Boot for unit tests. @SpringBootTest is for component / integration tests.

But there are also Test Slices, which, are indeed, mainly a Spring Boot concept, but you don’t need to worry about them until you understand what kind of tests you actually want to write.

See? Foundations.

Worst case scenario: you discover the framework does not actually solve that particular problem for you. Then you reach for a library — or you implement it yourself. But because you understand the concepts, you can still do it.

If you want a place to start building those foundations,

Refactoring Guru is excellent for learning design patterns.
Venkat Subramaniam is a fantastic speaker for explaining concepts clearly.
The Java YouTube channel is great to keep up with the evolution of the platform.
Some of the classics are still worth reading — Martin Fowler, Robert C. Martin, Sam Newman.
Fundamentals of Software Architecture is also a great book.

And the Spring ecosystem itself has some excellent educators as well. Josh Long and the Spring team often explain not just how to use the tool, but the ideas behind it. Which is the important part.

And yes, tools like Claude, ChatGPT, Gemini and friends can also be very helpful for learning these concepts — as long as you know how to ask the question.

But that is also the catch. Knowing how to phrase the question usually means you already understand the foundations you are asking about. Which is why I still tend to recommend the other sources first.

We Built a Wall Between Dev and Ops

Sara A. — Fri, 27 Feb 2026 22:24:54 +0000

Two Villages and a Very Professional Wall

There were once two villages in a valley.

On one side lived the Developers.
They wrote code. They shipped features. They believed progress was measured in commits per hour.

On the other side lived the Ops tribe.
They guarded uptime. They feared outages. They believed progress was measured in how little changed.

They were not enemies. But they were not friends either.

The Developers would build something magnificent and launch it over the wall with confidence:

“It works on my machine!”

The Ops tribe would catch it, stare at it quietly, and reply:

“That is… concerning.”

Between the villages stood a large wall. A very capable wall.

The wall had departments.
The wall had KPIs.
The wall had separate leadership, separate objectives, and separate bonus structures.

Developers were rewarded for speed. Ops were rewarded for stability. These goals rarely shook hands.

At the top of the wall, the chiefs sometimes met:
Chief Trumpur of Delivery and Chief Ronald of Infrastructure.

They would nod seriously and discuss improvements.

Occasionally someone proposed making the wall bigger.
“Just to clarify responsibilities.”
And so more tickets were added; more processes; more approvals.

Sometimes one tribe suggested the other should pay for it. This arrangement was considered professional.

Inside each village, unity was also… aspirational.

Among Developers:

The Backend Guild spoke in APIs and schemas.
The Frontend Circle debated pixels and state. They met frequently for “alignment” and usually left slightly less aligned.

Across the valley, Ops had its own clans:

CloudOps spoke in regions and availability zones.
SysOps guarded servers like ancient relics.
Database Administrators believed — not entirely unfairly — that everyone else was reckless.

Each group possessed sacred knowledge. Each quietly suspected the others did not understand how fragile everything truly was.

Then one day, a prophet arrived. He spoke strange words:

Agile.
Collaboration.
Shared ownership.
Continuous delivery.

And most shocking of all:

“Why not remove the wall?”

The villages gasped.

Workshops were held.
Slides were presented.
Arrows pointed in both directions.

Eventually, the wall fell. From its rubble emerged a new tribe:

DevOps.

It was a beautiful name. A new era had begun. But demolition is easier than integration.
The wall disappeared. The distance did not.

People stayed where they had always stood — only now without bricks to blame.

Questions replaced stone:

Who owns the pipeline?
Who approves production access?
Who responds at 3am?
Who broke this?

Old Man Ronald would sometimes sigh:

“At least when there was a wall, we knew whose fault it was.”

And slowly, quietly, new fences appeared. They had modern names:

Platform Team
SRE
Infrastructure Service Team

The wall was gone. The instinct to rebuild it was not.

DevOps as a Concept vs DevOps as a Renamed Silo

Let’s leave the valley for a moment. A while ago, we were hiring.

We needed people with experience in CI/CD, monitoring, Infrastructure as Code, and cloud environments. Not wizards. Just engineers who understood that software does not end at git push.

What we found was… instructive.

Candidate 1

Had “CI/CD experience” on their CV.
It turned out they had mostly worked with Ansible playbooks.
They had never written application code.
They did not know what a build lifecycle looked like.
Cloud? No.
Monitoring? Not really.

Conclusion: In their previous company, they ran playbooks. That was the job.

Candidate 2

Listed themselves as “CloudOps – Mid Level.”
When we asked what that meant, the answer was: “I’ve set up EC2 instances.” Which is fine.
But that was the extent of it.
No discussion of networking.
No IAM.
No scaling considerations.
No CI/CD integration.

Conclusion: Cloud meant “launching machines.”

Candidate 3

Had “DevOps Engineer” in bold.
Impressive stack. Impressive terminology.
After a few questions, it turned out their role was mostly maintaining a Jenkins instance someone else had designed years ago.
They could restart it. They could update plugins.
They had never written a pipeline from scratch.
They had never debugged a failing deployment across environments.

Conclusion: The pipeline worked. What it actually deployed was someone else’s concern.

One thing common across all interviews:

Limited participation in development, little exposure to the product lifecycle, and sometimes only a surface-level understanding of the system itself.

And this is not about mocking individuals. It is about what happens when we turn a philosophy into a job title.

The mirror image exists on the development side. For many developers, infrastructure means writing a docker-compose.yml file and hoping it requires zero understanding of networking, ports, certificates, or what actually happens once the containers leave their laptop.

Abstraction is useful. Blindness is not.
The original idea of DevOps was not to create a new role. It was to stop pretending development and operations were separate problems.

Code decisions affect operations.
Operational decisions affect code.

Logging choices influence monitoring.
Framework capabilities influence observability.
Infrastructure shapes authentication, scaling, and resilience.
Application design determines how deployments succeed — or fail.

A developer does not need to become an infrastructure specialist.
An operations engineer does not need to become a feature developer.

But neither can be blind to the other side.

Because there is no “other side.” The system is continuous.
And yes — foundations matter.

A developer should understand what a pipeline is doing when it runs.
Not necessarily how to rebuild Jenkins from scratch, but what stages exist, why tests run where they run, what a deployment actually means, and what can go wrong between commit and production.
And yes — sometimes that also means being able to tweak a Jenkinsfile, adjust a pipeline configuration, or make a small change to a CloudFormation template. Not as an infrastructure expert, but with enough understanding to know what the file is for, what it controls, and why the change matters.

Otherwise, code is written in a vacuum and surprises appear later — usually at 5pm on a Friday.

The same applies in reverse. Operations should understand what is being deployed well enough to recognise when the platform is solving a problem that the application already solved — or could solve more simply.
They should also understand what the application actually does. Not every line of code, but which modules are critical, which services carry business risk, and how configuration changes might affect behaviour.

Infrastructure decisions make more sense when you understand the system running on top of it. Some modules deserve deeper monitoring because they are business-critical. Some alerts matter more than others. Some failures are noise; others stop revenue.

You don’t need to be the person who wrote the feature.

But you do need enough context to understand what you are keeping alive.

This is not about everyone knowing everything. It is about everyone knowing enough.

Because without shared foundations, collaboration becomes ticket exchange.
And ticket exchange was exactly the wall DevOps was supposed to remove.

DevOps (Now With New Paint)

To be clear: I am not saying real DevOps does not exist.

There are certainly companies where shared ownership actually happens. Where developers understand how their software behaves in production, and operations engineers understand the applications they support. Where pipelines, monitoring, deployments, and code evolve together instead of being negotiated through tickets.

Those places exist. They are just… less common than the conference talks might suggest.

And there are engineers who naturally work this way — people who move comfortably across domains, ask questions outside their specialty, and try to understand systems as a whole instead of protecting a narrow slice of responsibility.

This is not a claim that DevOps failed everywhere. It’s a claim that many organisations kept the exact same structure… and simply changed the name.

The wall stayed.
The tickets stayed.
The handoffs stayed.

Only the org chart changed. What used to be Development and Operations became something else but...new name, same queue.

And honestly, I don’t particularly care what we call it. Call it DevOps. Call it Platform Engineering. Call it Deployment Happiness Engineering if that helps morale.

But if work still crosses organisational borders as requests instead of shared responsibility, then the wall is still there — just painted a different colour.

And if the plan now is “the AI will handle it,” just remember: AI outputs are just your inputs… with confidence added.

In conclusion: now everyone agrees the wall is gone… while quietly adding new bricks: more ownership boundaries, more approval layers, more specialised silos.

We demolished the wall in theory. Then rebuilt it in Jira.

The Day the AI Took My Requirements Literally

Sara A. — Fri, 20 Feb 2026 00:22:33 +0000

The AI Built Exactly What Was Asked For. That Was the Problem.

A new project was starting.

The proposal described a distributed, multi-domain system responsible for processing financial transactions, aggregating operational metrics, and exposing analytical insights to multiple internal stakeholders.

There were compliance considerations. There were integration points with legacy services. There were performance expectations. There was the word “real-time” in bold.

The requirements included:

An event-driven real-time streaming pipeline.
CQRS with event sourcing for future scalability.
A service mesh for observability and traffic control.

It sounded serious. It sounded modern. It sounded like something you would proudly present in a boardroom.
The company had recently rolled out its new AI-enabled development infrastructure. Internal agents. Architecture generators. Prompt templates. “Accelerated delivery pipelines.”

The team fed the requirements into it.
The AI did not hesitate.

An architecture emerged.

Message brokers coordinating streams of domain events.
Dedicated command and query models backed by separate data stores.
Event sourcing to maintain an immutable audit trail.
Sidecars injected for traffic control.
mTLS between services.
Retries. Circuit breakers. Distributed tracing.

It was the best system. So robust, so scalable. The best enterprise application.
And then a preliminary AWS estimate arrived two days later. Projected monthly infrastructure cost: $42,300.

That included managed streaming clusters, multi-AZ databases for command and query models, service mesh overhead, observability tooling, and three separate environments.

But there were no hallucinations. No obvious mistakes.
Just a perfectly coherent interpretation of the requirements. The AI built exactly what was asked for.

That was the problem.

Garbage In. AI Out. Repeat. Or: How We Professionally Automated Our Own Confusion

Let’s rewind.
Before the architecture. Before the AWS estimate. Before the “enterprise-grade” diagram.

There was a stakeholder. Stakeholder wanted to build an internal operations platform to monitor transaction processing and generate insights for management.

Reasonable goal. They opened ChatGPT.

They typed something like:
What architecture should I use to build a scalable, real-time financial monitoring platform that might grow in the future?

ChatGPT did what it does best. It delivered ambition. The model mentioned event-driven architecture, streaming pipelines and service meshes for observability and control!

“Service mesh,” the stakeholder thought.

That sounds important.

They type a follow-up question:
How do we make sure the system scales well if usage grows?

ChatGPT responded confidently:
Modern systems often adopt patterns such as CQRS and event sourcing to separate concerns, improve scalability, and support future growth.

CQRS. The stakeholder had heard that word before. Someone from engineering had mentioned it in a meeting once.
It sounded serious, modern and safe.

So they asked:
Would CQRS make our platform more future-proof?

ChatGPT, still helpful:
Yes, CQRS is commonly used in systems that anticipate growth and evolving requirements.

Future-proof. There it was again. And it sounded very reassuring.

So the stakeholder wrote a very serious proposal.

It contained “real-time streaming architecture.”
It contained “CQRS with event sourcing.”
It contained “service mesh for resiliency and governance.”

The requirements were vague. The ambition was abstract. The context was thin.

Garbage in. → AI out.
Then that proposal was fed into another AI-powered system.

And it produced a perfectly consistent, technically correct, financially enthusiastic architecture.

Garbage in. → AI out. → Repeat.

By the time it reached the architecture review, the proposal sounded heavy. It looked substantial.
What no one had written clearly was this:

There were 25 users. At most.
“Real-time” meant “an update every hour or so.”
The “legacy integration” was a REST API. Poorly documented, yes... but still a REST API.

The expected traffic curve could comfortably fit inside a single moderately-sized instance without breaking a sweat.

The vague ambition travelled further than the concrete constraints. And the AI did exactly what it was asked to do.

It optimised for the bold words, for growth, for the future. It was not, however, optimised for reality. Of course, this is a slightly exaggerated example.

But only slightly. I have seen real systems where the architecture slides were more complex than any business case.
And to be fair, the opposite happens too.

Sometimes the requirement arrives as:

“It’s just adding an if to the tool we already have.”

Just an if. Behind that “if”:

18,000 lines of legal compliance requirements.
Regional regulatory variations.
Audit trail obligations.
Data retention constraints.
Twenty bespoke hardware integrations with devices that were configured in 2014 by someone who no longer works here.

But the proposal says:

“Minor enhancement.”

The AI will happily believe that too.
It will produce a neat solution for a simple conditional branch.

Again, this is slightly exaggerated. But (again) only slightly. I have genuinely heard someone say, “It’s just an if”, for a problem that required eight scalable AWS services, ingestion from thousands of devices and near real-time reporting,.

It was not, in fact, just an if.

Garbage in → garbage out has always been true. Confusing requirements did not start with AI.

The difference now is speed and confidence.

We used to miscommunicate slowly. A developer would question it. A meeting would happen. Someone would sigh. Clarifications would emerge.

Now we can take vague ambition, feed it into a model, generate a polished architecture, feed that into another system, and deploy it way before anyone asks whether the original requirement made sense.

Garbage in → AI out → Feed that output back in → Repeat.
Confusion used to stay human-sized. Now it scales.

Which brings us to the uncomfortable part.

Prompt Engineering Is Just (Mis)Communication With Better Marketing

We have been trying to fix human miscommunication for years.
User stories. BDD. Acceptance criteria. Refinement meetings. Specification templates. Diagrams.
Workshops about workshops.

Entire methodologies exist because humans are terrible at saying what they want.

And if anything has been proven, it’s that we remain terrible at communicating and writing.

Now we have “prompt engineering” — which sounds very fancy and technical — but is essentially communicating by text with a very confident, opinionated rubber duck.

That, by itself, will solve nothing.

We say “scalable” and mean “won’t crash.”
We say “real-time” and mean “doesn’t feel slow.”
We say “future-proof” and mean “I don’t want to revisit this.”

We assume everyone shares the same mental model. And they don't.
AI does not fix this. It removes the human buffer.

There is also the small detail of language. The same prompt written in English, Spanish, or German will not always produce the same output. Nuance shifts. Assumptions shift. Tone shifts.
If “lost in translation” is a problem between humans, it does not disappear with a probabilistic model. It just becomes statistically interesting.

Ambiguity is often resolved by questioning. A lot of questioning.
The human (ideally) will ask:

“Do we actually need real-time?”
“Is this a CQRS problem or a CRUD problem?”
“Do we have enough services to justify a service mesh?”
“What’s the traffic?”
“What’s the budget?”

Yes. Never trust the requirements on their own. Requirements are optimistic by design.

AI, however, trusts the requirements.

If you write:
Let’s build an event-driven real-time streaming pipeline.
It builds one.

If you write:
Let’s use CQRS with event sourcing to future-proof.
It splits your system in two and prepares you for scale you may never reach.

If you write:
We should introduce a service mesh for observability and control
It configures sidecars, mTLS, traffic policies, retries.

The AI will trust your ambition and will not even ask if you have the budget to match it.

Stakeholders sometimes bring big technical words. They do not always bring matching “big” money.

Let’s be clear: I am very glad AI does not fight back.
I am not prepared to argue with the AI overlords — or with a chatbot that has developed ego.
But someone should. (Fight back, I mean.)

Because AI does not measure necessity. It measures alignment.

It will scale like the users exist.
It will future-proof like the roadmap exists.
It will architect like the money exists.

Prompt engineering is not magic. It is structured communication without interruption.

And if we were imprecise before, we are now imprecise with acceleration.

Human in the Loop, Not AI in the Loop

Let’s un-exaggerate this for a second and gently deflate the enterprise dreams: all the previous examples were somewhat optimistic about the capabilities of AI.
I have yet to see an AI model or agent that can take a full-blown PDF specification, translate it into a complete system, preserve every nuance, and not quietly drop 15% of the context somewhere between page 12 and Annex C.

I have tried something simpler:

“Here are two versions of a specification as PDFs. Tell me the differences.”

What I usually get back is enthusiasm and vibes, instead of a traceable list of changes. A few of the vibes are sometimes correct.

Which means I cannot trust it blindly. Which means I have to read the specification myself anyway — to confirm that the AI didn’t miss half of it or hallucinate a requirement that never existed.
Which means… I still have to do the work.

Now let’s imagine a better-case scenario.
Let’s imagine the AI does understand the specification perfectly and generates all the required code. I still need to review it.

Line by line; behaviour by behaviour; edge case by edge case.

If something doesn’t match expectations, I now have two options:

tune the prompts, refine the context, restructure the input, clarify assumptions… and try again
or just redo it myself — or delegate it to someone who will (we still might use AI to help in smaller increments).

And there is another subtle problem. AI depends heavily on familiarity. Imagine you are working with Java 25. If most publicly available examples, blog posts, and Stack Overflow discussions revolve around Java 8, then statistically speaking, that is the world the model understands best.
You can explicitly ask for Java 25. It will try. But models gravitate toward what they have seen most often. So now your review job expands again.

And even in the magical scenario where the AI produces flawless, syntactically perfect, technically coherent and version adherent code — there is still the one uncomfortable truth that initiated this whole article:

If the human requirements were vague, confused, or contradictory…

the output will be vague, confused, and contradictory — just very efficiently so.

And no model can compensate for missing intent.

Now, before this turns into “AI is useless,” let me be clear: I am not against AI. Quite the opposite.

AI is extremely useful when it is positioned correctly. You probably shouldn’t let it write your documentation from scratch.

But you absolutely can ask it to:

tighten your writing,
improve clarity,
suggest structure,
highlight ambiguities,
spot inconsistencies.

It will save you hours of re-reading your own text and wondering why that paragraph “feels off.”
You probably shouldn’t let it replace code reviews entirely.

But you can use tools that:

flag suspicious logic,
detect edge cases,
suggest refactors,
answer follow-up questions in pull request threads,
tell you that according to the version of the tool that you are using, there are simpler ways to do something.

Worst case?

You dismiss the comment — just like you would if a colleague misunderstood the context.

Examples of other areas where it shines, if expectations are realistic:

Suggesting boundary conditions, or pointing out “Have you considered null here?” moments. You still decide what matters. It just saves typing and forgetfulness.
Generating repetitive scaffolding: templates, boilerplate, basic CRUD layers, migration scripts. The kind of code that is necessary but not intellectually exciting. You still review it. You still adapt it. But you type less.

In all these cases, AI is not replacing the human. It is assisting the human.
That’s the difference. When you put AI in the loop, you risk removing accountability and context.

AI will happily suggest the architecture but it will not sit in the budget review defending it. And, a person walking into that meeting and saying:

“Well… the AI model told me to.”

…is not exactly a career-enhancing move.

When you keep humans in the loop, AI becomes leverage.
Every system has loops.
Requirements go in. → Architectures come out. → Costs follow.

AI can sit inside that loop. But it cannot, or at least should not, own the loop. AI should not be a replacement for thinking.
If you let that happen confusion will scale better than your system ever will. And the loop becomes something else entirely:

Garbage in. → AI out. → Repeat.

Faster each time.

Jack VS the AI Machine

Sara A. — Fri, 13 Feb 2026 20:53:35 +0000

The Specialist and the Box

There once was a developer who knew Kubernetes. Let’s call him Bruno.

Not “has deployed something once.” Not “can read a tutorial.”

Bruno knew it. If there was a question about Kubernetes, Bruno was already answering it. If a project involved Kubernetes, there would inevitably be a moment where Bruno stood in front of a diagram explaining how everything worked. When something behaved strangely, everyone turned to Bruno.

And slowly, without anyone announcing it, a box formed.

The box had boundaries: containers, clusters, networking, scaling policies, resource limits. Inside those boundaries, Bruno was fluent. He knew the failure modes. He knew the edge cases. Outside the box… that was someone else’s diagram.

He no longer needed to be fully allocated to projects. He floated. He was invited when “scaling” appeared in a document.

Eventually, the company made it official. Bruno became the Kubernetes Titan. It appeared in org charts, in email signatures, and at least in one PowerPoint template. The box now had a name.

Inside the Box, Life Is Good

Life is good inside the box.

Bruno does not attend daily stand-ups. He does not debug pipelines. He is summoned when the topic is “orchestration.” He writes the guidelines: platform standards, deployment principles, slides titled “Kubernetes Strategy 2025.” He no longer reviews YAML — that is project work. Bruno defines what “good YAML” means.

If something doesn’t fit the standards, the answer is simple:

The application must adapt.

After all, the platform is correct. If something fails, it must be elsewhere. Inside the box, responsibility has edges. Bruno owns the platform. Everything else belongs to “the product.” Life is clean when responsibility is clean.

Lately, though, the questions feel different. They are not about scaling thresholds or deployment patterns. They are about how everything fits together.

Those are not Kubernetes questions.

Bruno prefers Kubernetes questions.

The Machine Eats the Box

At first, Bruno doesn’t notice. Then he does.

The meetings are shorter. Fewer architecture reviews. Fewer “Can we align on this?” messages. People still use Kubernetes — they just don’t need Bruno to explain it.

There is a new habit in the company. Logs go into chat windows. Diagrams go into chat windows. Entire systems get described to agents that reply in seconds. Instead of asking Bruno, they ask the machine.
No meeting required.

The machine answers quickly — deployment structures, scaling rules, example configurations. The answers are not perfect. But they are fast. And often… good enough.

Bruno tries it too. When the question stays inside the box, he can judge the answer. He can correct it. But when the question crosses into application design, database structure, integration flows…

Bruno reads the answer and does not know if it is right.

He knows Kubernetes. He does not know the system.

For the first time, that difference matters.

Jack notices this sooner. Jack has been using the machine for weeks — not for Kubernetes, for everything.

Jack Knows Where to Click

Jack works in the same company.

He is not a Titan. Not a Wizard. Not a Pro Whisperer. Jack just… works on things. He has written Java, Python and Typescript, built Docker images, fixed pipelines, written tests, removed tests, shipped things.

He does not know Kubernetes like Bruno does. But he understands the core ideas. And he understands something else: Kubernetes is powerful. It is also sometimes a cannon. And occasionally we are hunting flies.

When Jack works on an application, he tries to understand what it is supposed to achieve — not just how it runs, but why it exists. He understands what the data represents, what the business expects to see, what monitoring should detect, what the pipeline guarantees. He reads the requirements — sometimes twice.

Yes, Jack is kind of a bore. He has opinions about traceability.

He ties things together: orchestration, data, business rules, monitoring, pipelines, documentation.

When the machine produces an answer, Jack does not treat it as truth. He treats it as material. He knows which constraints matter and which assumptions cannot break. He doesn’t ask for a solution in isolation. He gives the machine the context — and asks for something that fits.

And when the machine answers confidently, Jack reads carefully. Sometimes he can’t explain why something feels off. But he can tell. And when he cannot tell, he knows where to look.

He doesn’t need to master one box.

He needs to move between them — and recognise when the machine is guessing.

Jack Has the Big Red Button

Let’s drop the story for a moment.

Kubernetes is just an example. This isn’t about Kubernetes. It isn’t about specialists being useless. It’s about boxes — and what happens when we stay inside them too long.

There’s a lot of noise about AI replacing juniors. Maybe. But there’s another group quietly at risk: the specialist who lives entirely inside one box.

AI does not just generate code. It connects things. It drafts architecture. It crosses boundaries without asking for permission.

If you only understand one layer deeply — but not how it interacts with others — you struggle differently. You don’t know what to ask. You don’t know what context matters. You don’t know whether the answer is subtly wrong.

This didn’t start with AI. The industry has been accelerating for years. Safe niches become narrow corridors.

The generalist — the so-called “jack of all trades” — never had the comfort of one box. They had to adapt. And AI rewards adaptation. It rewards context. It rewards people who can see across layers.

For years, I have been skeptical of rigid separation: developers here, validators there, DevOps in another silo. Those structures often create friction, duplicated effort, and systems that feel stitched together rather than designed. With AI in the mix, those walls become even more problematic.

If the machine sees the whole system, but the engineers only see their slice, the machine will move faster than the humans.
The generalist isn’t powerful because they know everything. They are powerful because they connect things. And the person who connects things decides what gets built.

The machine can generate. The machine can connect.

But someone still has to decide what makes sense.

That’s the button.
Jack presses it.
Carefully.

My Data Lake Runs on MongoDB and PostgreSQL and I’m Not Sorry

Sara A. — Sat, 07 Feb 2026 13:34:10 +0000

A Brief History of How I Angered Absolutely No One (So Far)

Before we ever talked about technologies, we (a team working in a project) were dealing with a fairly uncomfortable data problem. We were collecting large volumes of data from multiple external sources, many of which we did not fully control or even fully understand. At the point of collection, the data arrived with little reliable context, inconsistent structure, and no clear guarantees about meaning.

More importantly, the data could not be meaningfully analysed at ingestion time. Individual values were not inherently valid or invalid, useful or useless. Their significance only emerged later, once they were combined with other datasets, configuration, and a set of calculations applied in a subsequent processing phase.

To make this harder, the same incoming data could eventually belong to different processing “plans”. Each plan defined its own expectations around data types, granularity, frequency, and validation rules. New plans could appear over time, existing ones could evolve, and at the moment the data was collected there was no reliable way to know which plan would ultimately apply.

There was also a strict traceability requirement: any calculation or recalculation had to be fully attributable to the original input values. This meant that once data was ingested, the values themselves could not be updated or rewritten. Corrections and reinterpretations had to be expressed as new processing steps, not mutations of the original records.

Faced with this, the textbook answer is a data lake. A “proper” data lake, at least in theory, is a place where raw data can be stored cheaply and indefinitely, in its original form, without forcing early decisions about structure or schema. The promise is simple: ingest first, understand later.

In practice, this usually means object storage at the core: large, inexpensive buckets where data is written once and rarely touched directly. Data is organised by conventions rather than rigid models, and meaning is derived at read time using analytical engines rather than enforced at write time. Around this storage layer sits an ecosystem of supporting technology - distributed query engines, batch processing frameworks, metadata catalogues, and orchestration tools - while the lake itself remains deliberately passive.

On paper, this fits our problem almost perfectly. We had uncertain data, evolving interpretation rules, and no safe way to apply strict schemas up front. Deferring meaning was a requirement.

And yet, this is also where the theory starts to fray. Our problem was not just storing raw data for later analytics; it was needing to interact with that raw data operationally, correct it, investigate it, and understand it while the system was running. The data lake model explains where to put uncertain data, but it says much less about how to live with it day to day.

So the question stopped being “where do we dump raw data cheaply?” and became “how do we live with raw data every day?”. We needed to inspect and query raw data operationally, without introducing extra ETL processes which would exist only to make the data usable, we needed to correlate records to plans once the missing context became available, and we needed to support corrections and investigations without rewriting history. The point was to keep immutability of original values for traceability, while still being able to attach new metadata and interpretations close to the original information - not as a separate universe that requires rebuilding pipelines every time we learn something new.

In a classic data lake stack, that usually means object storage plus a surrounding ecosystem: file formats like Parquet or Avro; table layers such as Iceberg, Delta, or Hudi; catalogues like Glue, Hive, or Unity; batch processing engines like Spark or Flink; and interactive query engines such as Trino or Presto, all held together by orchestration on top. That world is powerful, but it also tends to move complexity into “process”: jobs, compaction, reprocessing, schema evolution, and the operational overhead of making raw data convenient to interrogate. Cheap storage is great, but we also cared about cheap processes and cheap debugging - and getting all three (cheap storage, cheap compute, cheap operations) is basically an engineering utopia.

So we chose a simpler operational centre of gravity: store raw, schema-flexible data in MongoDB in a way that remains queryable day one, and treat plan correlation and corrections as additional metadata and derived layers rather than rewrites of the original values. It’s not the canonical data lake implementation, but it matched the shape of the problem we actually had: uncertainty first, meaning later, and a constant need to inspect and evolve the story without losing the original facts.

Two Databases Walk Into a Bar

Data lakes, despite their reputation for being “just storage”, almost never exist without some notion of a catalogue. Once raw data starts accumulating, you very quickly need a way to answer basic questions: what datasets exist, what they represent, where they live, and how they are meant to be used. Without that layer, a lake stops being flexible and starts being opaque.

In most standard data lake architectures, this catalogue emerges as a separate system. Technologies like Hive Metastore, AWS Glue, Unity Catalog, or similar services exist to map logical datasets to physical storage, track schemas, and help query engines make sense of otherwise passive files. The catalogue doesn’t replace the lake; it makes it navigable.

PostgreSQL is not what usually comes to mind when people talk about data lake catalogues. But at its core, a catalogue is simply structured metadata: names, identifiers, relationships, and lifecycle information that need to be queryable, consistent, and understandable by both humans and systems.

Seen through that lens, PostgreSQL works exceptionally well. It gives us strong consistency, a rich query model, transactional updates, and a familiar interface for expressing relationships and constraints. Instead of discovering metadata by scanning storage or inferring schemas after the fact, we explicitly record what exists and how it should be interpreted. The result is not a less capable catalogue, but a more intentional one that is built around accessibility and correctness rather than engine integration.

PostgreSQL became the place where we indexed meaning: which datasets exist, what context applies, and how raw collections should be interpreted at a given point in time.

And that, more or less, is how two databases walked into a bar and agreed to share custody of the data.

Every Data Lake Is a Zoo, Mine Just Has Signs

Yes, I could discover everything by parsing collection names.
I could also parse raw bytes in hex.
I choose civilisation.

A data lake without a catalogue is not empty, it’s just loud. All the information is technically there, but understanding it requires effort, context, and a tolerance for archaeology. Storage can be self-describing in theory, yet still deeply unfriendly in practice.

Our raw data is, by design, physically self-describing. Collection names encode the same information you would normally express through object-storage paths. For example, a MongoDB collection like:
raw_123_foo_2026-01
carries the same meaning as a more traditional data lake layout using partitioned object storage, such as: ~/qualifier=123/dataset=foo/date=2026-01/

In both cases, the dataset, scope, and time window are embedded directly into the storage structure. With enough convention and enough discipline, the data can always be rediscovered by inspecting storage alone.

The problem is not whether this works, but where that logic lives. Without a catalogue, every consumer needs to know how to parse collection names, reconstruct time ranges, and apply the same concatenation rules consistently. That logic inevitably leaks into multiple services, scripts, and mental models.

Instead, we centralise that knowledge in the catalogue. Rather than forcing every consumer to understand naming conventions or storage layouts, we record the relationships explicitly:

| plan | dataset_type | from       | to         | collection         |
| 123  | raw          | 2026-01-01 | 2026-01-31 | raw_123_foo_2026-01 |

Consumers no longer need to care whether the underlying data lives in MongoDB collections or in partitioned object storage (for example AWS S3 or Azure Data Lake Storage). They simply ask the catalogue what data exists for a given plan and time range, and receive the location that matches. Storage remains self-describing; the catalogue just makes that description immediately accessible.

This does not replace self-description with abstraction. If the catalogue disappears, the data is still there and still interpretable by inspecting collection names directly. The system degrades into inconvenience, not failure. What the catalogue removes is the need to repeatedly rediscover the same meaning through convention and duplication.

The same principle applies inside the data. Raw values are immutable, but their meaning over time is not. When a value needs to be corrected, we do not overwrite it. Instead, we attach context and let the data describe its own history:

{
  "timestamp": "2026-01-01T00:00:00Z",
  "name": "my_value",
  "value": "12",
  "metadata": {
    "status": "REPLACED",
    "collected_at": "2026-01-20T00:00:00Z",
    "...":  {... a set of audit information}
  }
},
{
  "timestamp": "2026-01-01T00:00:00Z",
  "name": "my_value",
  "value": "2",
  "metadata": {
    "status": "CURRENT",
    "collected_at": "2026-01-22T00:00:00Z",
    "...": {... a set of audit information}
  }
}

The original value still exists, unchanged. What changes is not the data itself, but the context around it: a status that marks when it was superseded, an audit trail that records when and why that happened, and a new entry that assumes the role of the current value. Nothing is erased, nothing is rewritten, and every recalculation can always be traced back to the original facts.

This is immutability with signage. Instead of forcing consumers to infer intent from timestamps or absence, the data makes its own history explicit. It tells you which value was current at any given time, which one replaced it, and under what circumstances - with meaning attached directly to the records themselves.

Every data lake eventually becomes a zoo: full of valuable, unfamiliar creatures. Some rely on visitors memorising the animals. Mine just puts the names on the enclosures.

My data lake is self-describing.
PostgreSQL just adds subtitles so humans can watch it without suffering.

This Was Designed Under Real-World Constraints

No data engineers were harmed in the making of this architecture.
Mostly because there were none present to stop me

This system was not designed by a specialised data platform team or under ideal conditions. It was built under a very concrete set of constraints that shaped most of the architectural decisions:

We are a small team, responsible for both building and maintaining the system, so any solution had to remain understandable and maintainable without requiring deep domain expertise in data platforms.
There is very limited dedicated data management knowledge within the team, which made highly specialised data platforms and analytics stacks, lakehouse ecosystems, and custom metadata or governance platforms - unrealistic from both a development and operational perspective.
The system had to be fully cloud-based, which meant that a baseline level of cost was unavoidable regardless of technology choices, making the focus more about cost predictability than absolute cost minimisation.
Debuggability was treated as a first-class architectural requirement. When a calculation produces an unexpected result, the system must allow us to inspect the exact raw inputs that contributed to it immediately, not after a batch job finishes or a pipeline is re-run.
The primary goal was therefore not to implement a “perfect” data lake architecture, but to build something that could be operated reliably, debugged easily, and evolved incrementally as the system and the team mature.

Many established data lake patterns make sense in organisations optimising for large-scale analytics and long-running batch workloads. Our constraints were different: operational clarity mattered more than theoretical optimality.

This Is Not Free, I Just Know What I am Paying For

The primary trade-off in this design is storage cost. Storing raw, immutable data in MongoDB is more expensive per gigabyte than using object storage, and that cost is not always trivial to estimate upfront. The volume, shape, and retention of data vary significantly depending on the plans applied to it, and those plans can evolve or appear over time. As a result, the exact storage footprint cannot be predicted with complete confidence at the outset.

However, this uncertainty is still easier to reason about than the alternative. Storage growth is largely linear and visible: data arrives, it is stored, and its cost accumulates in a predictable way over time. There are no hidden bursts of compute, no surprise cluster spin-ups, and no indirect costs tied to how often questions are asked. While the total cost may be higher, it is simpler to analyse, easier to attribute, and more transparent to operate than a model where storage is cheap but every interaction with the data incurs additional, variable processing cost.

The comparison below reflects a conscious trade-off: higher storage costs, offset by more predictable spending and simpler day-to-day operations.

Topic	MongoDB-based Lake	Object Storage + Lake Stack
Storage cost	Higher per GB	Low per GB
Compute model	Always-on database	On-demand distributed compute
Query latency	Low (interactive)	Medium to high (job/cluster spin-up)
Query pattern	Point lookups, filtered queries	Large scans, batch analytics
Update model	Metadata/status updates, new records	Rewrite / compaction jobs
Data mutability	Immutable raw values, contextual updates	Append-only, rewrite-based
Reprocessing	Selective, record-level logic	Batch pipeline re-execution
Metadata management	Explicit catalogue (Postgres)	External metastore (Glue/Hive/Unity)
Governance	Explicit, application-level	Platform-assisted
Operational complexity	Moderate	High (multiple systems)
Cost predictability	High	Variable (compute-driven)
Scaling pattern	Vertical + sharding	Horizontal compute clusters
Debugging	Direct data access	Indirect via jobs
Traceability	Record-level, immediate	Pipeline- and job-level
Time to investigate issues	Minutes to hours	Hours to days
Typical team size	Small to medium	Medium to large
Primary optimisation	Operational access	Analytical throughput

I Didn’t Build a Perfect Data Lake, I Built One I Can Explain

Future me will hate parts of this system.
Present me at least knows where the data is.

This architecture is not an attempt to redefine what a data lake should be, nor a manifesto against established patterns. In fact, we didn’t even come up with the idea of calling it a data lake. Using a data lake was a requirement. The name arrived fully formed, and our job was simply to make something exist behind it.

With that constraint in place, we did not spend much time debating what a data lake should look like in theory. The requirement was simply that one existed. Our job was to make something that fit the description without making the system unnecessarily painful to build, operate, or explain. That’s where the cheekiness started.

We kept the promises usually associated with a data lake: raw data retention, deferred interpretation, immutability, traceability - and quietly ignored the assumption that this automatically combined with large-scale distributed processing stacks and a supporting cast of coordination, orchestration, and metadata services. Not because those tools are bad, but because they optimise for a different class of problems.

There is a persistent idea that data lakes are cheap. In practice, only the storage is cheap. Everything required to make that storage usable, such as distributed query engines, batch processing frameworks, orchestration layers, catalogues, and the expertise needed to operate them, carries a real and often variable cost. Object storage is inexpensive precisely because it does nothing; the moment you want to understand your data, you start paying in compute, coordination, and operational overhead.

In our case, a fully cloud-based solution was also a requirement, so a baseline level of cost was unavoidable regardless of the architecture. The real choice was therefore not between cheap and expensive, but between different kinds of expense. We chose higher storage costs in exchange for simpler processes, faster feedback, and the ability to inspect and explain data directly, without spinning up clusters or waiting for pipelines to finish.

So yes, this is a data lake. It stores raw data, defers meaning, preserves immutability, and supports recalculation. It just does so while being a little irreverent about the tooling, and very serious about day-to-day operability.

If that makes purists uncomfortable, that’s fine.
They weren’t in the room.

The Intellectual Junior Syndrome

Sara A. — Sat, 24 Jan 2026 17:24:18 +0000

The intellectual junior syndrome is a condition that primarily affects junior developers. Fresh graduates, people with a couple of years of experience, and especially the most motivated ones.

It usually appears shortly after someone realises that software engineering is not just about making things work, but about doing things properly. They start reading. The whole lot: the books, the blogs, the forums.

They now know what a visitor is. They know what a monad is. They know what hexagonal architecture is. They know the SOLID principles by heart.

And then they get a job and a problem to solve. Not a university assignment. Not a kata. A real, messy, slightly boring business problem.

And they solve it.
In ten lines.
Very smart lines.

It’s not just a solution. It’s a framework. It’s generic. It’s extensible. It solves not just this problem, but fifty potential future problems.
There is recursion. There is reflection. There is an interface to define the contract. There is a generic type parameter with a name nobody understands. And the whole thing lives inside a beautifully abstract method called something like process, handle, or execute.

It’s elegant. It’s minimal. And it solves fifty problems.

Forty-nine and a half of them are imaginary.
The remaining half is the actual problem you asked them to solve.
And only half of that behaviour is correct.

The intellectual junior syndrome is a serious condition. In most cases it fades with experience, but in some cases it becomes chronic and persists well into senior years.

Clinical Observations

Over the past months, I’ve observed several patients presenting clear symptoms of the intellectual junior syndrome. Names have been changed to protect the innocent.

Junior A
Junior A is confronted with a simple conditional flow: a switch statement with three cases.
Their immediate thought: replace it with a strategy pattern.

Now we have three classes, one interface, a factory, and a dependency injection configuration.

The original problem had three branches.
The new solution has eight files.
The code is now “more extensible”. Nothing has extended.

Junior A never got the chance to finish the implementation. He was given medication and told to rest.

Junior B
Junior B notices that two different flows share the same two lines of code.

They decide to redesign the entire module around an abstract base class.
The abstract class is used like a utility class. Nothing is overridden. No polymorphism is needed.

Two weeks later, a third flow appears.
It doesn’t fit the hierarchy.
Now the module needs to be redesigned again.

Junior B is currently under observation.

Senior A
Senior A proposes introducing a strategy pattern for something that could be solved with a single if.

A new analysis is requested. Similar incidents have been reported before.
Senior A was also given medication and told to rest.

At this point, it becomes clear that the syndrome is not always cured by time.

Senior A (again)

Senior A, in a Spring Boot application, presents an urgent proposal: implement a bespoke JSON parser.

Clinical interview reveals no current requirement. The justification is purely prophylactic:

“What if one day we need special handling?”

Diagnosis is confirmed. Quarantine protocol has been applied.

Books Are Not the Pathogen

It’s important to clarify one thing: the problem is not theory.
The intellectual junior does not suffer from reading too much, but from not yet knowing what to do with what they have read.

Understanding design patterns is essential. Knowing architectural styles is useful. Learning about functional programming, object orientation, or system design is valuable. None of that is wasted effort.

In fact, much of what modern frameworks do is exactly this: they embed decades of design patterns so you don’t have to reimplement them every time. Frameworks like Spring Boot or Quarkus, for example, hide a huge amount of complexity behind sensible defaults and conventions.

The problem starts when theory becomes a hammer and every problem starts looking like a nail.

Sometimes you really do have fifty different concerns in the same piece of code. Sometimes you really do need multiple layers of abstraction. Sometimes you really do need to introduce patterns explicitly.
But recognising those situations requires something theory alone does not provide: context, experience, and judgement.

And that is precisely what the intellectual junior has not accumulated yet.

The Cure Is Simplicity

The uncomfortable truth behind all of this is that the cure is not more knowledge. It is simplicity.

And simplicity is hard. Not trivially simple code, but code that is simple because it only solves the real problem and nothing else.

Creating that kind of simplicity requires knowing exactly what you need. And to know that, you need to understand the world you’re working in.

Maybe you won’t over-engineer something that the framework already does for you, but for that, you need to actually understand the framework.
Maybe you won’t invent imaginary problems, but for that, you need to understand the business context and what problems actually exist.
Maybe you won’t design complex infrastructure abstractions, but for that, you need to understand how your deployment, platforms, and tooling already behave.

Simplicity is not a lack of knowledge.
It is the result of a lot of knowledge applied with restraint.
And for a junior, that’s genuinely hard. Not impossible, but hard. It requires mentoring, exposure, and a willingness to learn things that go far beyond the tasks they are assigned.

That part is normal.

The bigger concern is the seniors who never respond to the treatment. They accumulate patterns, frameworks, and rules - but not context.

They stay inside a very narrow technical world and become evangelists of practices rather than observers of reality. At that point, the syndrome takes a form of its own. It no longer just affects the host - it starts trying to contaminate others.

And then the final symptom appears: future-ology.

“What if we need this later?”
“What if this grows?”
“What if requirements change?”

The problem is that humans are terrible at predicting the future. If we were good at it, we’d probably be better off buying lottery tickets.
In practice, when a simple case becomes complex, it is far easier to refactor a switch with three branches than to untangle a whole architecture that was built for problems that never existed.

The senior who is never cured often doesn’t even get the chance to be cured.

They move from project to project, design a skeleton, introduce patterns, create abstractions, and then leave to evangelise somewhere else.

And the team that stays behind is left with a solution that was never really designed for their world, their constraints, or their actual problems.

Just for someone else’s imagination of the future.

Discharge Notes

There is no shame in juniors who over-engineer. Many of us have suffered from the syndrome at some point. I was clinically diagnosed.
And honestly, I would still much rather work with a junior who reads too much and overthinks everything than with one who couldn’t care less. The first one at least has the curiosity and the motivation. The second one has already given up before learning anything.

As a former patient, I only really got a chance to recover when I was moved to projects full of dead code left behind by seniors like the ones described above. After repeatedly hitting my head trying to fit a simple if into a hundred files of unnecessary abstractions, I slowly learned something important: most of the time, I would save far more time by understanding the business and removing code than by adding more.

And when removing it was not possible, I was left with a quiet, internal resentment towards the people who had already moved on.

Now, as a senior - cured of this syndrome, but probably afflicted with others - I try to push a slightly different perspective.

Writing code is the easiest part of our job.
Understanding what is the minimal code we need to write is the hard part.

And to get there, they don’t just need to learn languages, patterns, and frameworks. They need to understand a bit of everything: the business, the infrastructure, the constraints, the people, and the history of the system.

Because anyone can build something.
Figuring out what not to build takes longer.

No Dogma: Applying BDD to real systems

Sara A. — Thu, 22 Jan 2026 21:47:58 +0000

Behaviour Is Not a Methodology

How BDD actually works once you stop treating it like one

When I first tried to implement Behaviour-Driven Development, it didn’t go particularly well.
Not because the team resisted it, but because I treated it too literally.
This article is about what actually made it work - especially in low-level and technical systems.

I focused on Gherkin, on syntax, on “proper” scenarios, and on reproducing what the books and talks described.
What I got was a lot of feature files, a lot of glue code, and very little shared understanding.

In practice, I was spending a lot of time translating low-level concepts - things like firmware updates, communication protocols, or system states - into high-level “natural language” scenarios that didn’t really help anyone. They were too abstract to guide implementation, and at the same time not concrete enough to give confidence when discussing behaviour with the client or product owner.

The second time I tried BDD, I stopped trying to implement BDD as described, and started trying to implement communication that actually worked.

Since then, I’ve seen BDD become genuinely useful but only when it adapts to the reality of the project and the people involved. I’ve used classic Gherkin, modified Gherkin, Excel sheets with formulas as behavioural models, diagrams, and UI mocks.

All of them worked. None of them were “pure”.
And that’s the core lesson behind everything in this article:

BDD fails as a framework and succeeds as a forcing function for communication and only works once you stop following it and start bending it.

Behaviour Is Not a Language Problem

Most BDD material assumes that behaviour should be expressed in “natural language”.

That sounds reasonable until you remember one uncomfortable fact: natural language is subjective.

Humans struggle to express what they want even in everyday life. Ask two people to agree on dinner and you’ll get friction. Yet we expect the same people to write “clear, universal, executable specifications” in English.

This gets even worse when not everyone comes from the same country or culture. Even the same word can carry different meanings depending on context (British and American English are full of subtle differences). And once you add people working in a second language, the idea of a single, shared “natural” language becomes even more fragile.

Natural language is ambiguous, culturally biased, emotionally loaded, and interpreted differently by different people.

So the real rule is not “use natural language”.

The real rule is: Use whatever medium creates shared understanding fastest.

Sometimes that’s text. Sometimes it’s tables. Sometimes it’s diagrams. Sometimes it’s pictures. Sometimes it’s Excel with formulas.

All of these are valid expressions of behaviour.

The Myth of Universal Natural Language

BDD literature often implies that scenarios should be understandable by everyone.That is impossible. We call it ubiquitous language.

There is no universal natural language. Language is always cultural and contextual. And in our lives, business driven.

“Natural” for a lawyer is not natural for a developer. “Natural” for a control engineer is already technical. And that’s fine.

If all stakeholders understand protocols, standards, schemas, or state machines, then those are the natural language of the business.

A perfectly valid BDD scenario can be:

Given the device is in state CONNECTED
When a READ request with OBIS code X is sent
Then the system must respond with frame Y within 200ms

That is technical. That is business-relevant. That is natural for that domain.

Simplifying it would make it less true, not more accessible.

Behaviour Can Be Expressed in Many Forms

In one project or business area, behaviour can be best expressed as Excel:

Given these inputs → apply these formulas → this is the output.

Rows are scenarios.
Columns are inputs and expected results.
Clients understand it. Developers understand it. Tests can be generated from it.

In another project, behaviour may be best expressed with UI mocks:

Given you are here (picture), with specific options selected, configuration values set, feature flags enabled, and the system already in a particular state,

when you click this (picture),

then this appears (picture).

All of that context was captured naturally in a single visual. Trying to express the same thing purely in text would have resulted in long scenarios full of and, and, and connectors and still less precision.

BDD does not require prose.

It requires a shared behavioural model.

Language is just one possible encoding. Ultimately you might even reach the conclusion that you need multiple types of media to best convey your system.

Automating BDD Without Cucumber

BDD is not about making scenarios executable.

It is about making behaviour traceable and verifiable.
The only real automation requirements are:

you can trace tests to agreed behaviour,
you can produce (good) reports showing scenario coverage and status.

If tests are a perfect 1:1 translation of scenarios, great.

If not, but they still give traceability and evidence, also great.

BDD automation is an evidence system, not a syntax engine.

Language Decays are Avoidable

BDD assumes a shared language exists. Reality: shared language drifts over time.

People join.
People leave.
Terms get overloaded.
Meanings become implicit.

Six months later, the same word means different things to different people and nobody notices. So BDD requires something most teams never build: a maintained glossary.

Not documentation for its own sake, but a semantic source of truth. Every time someone asks “what does this mean?”, there must be a place to consult. And if that place does not exist, that’s not a discussion - that’s a missing artefact.

Language is infrastructure. If you don’t maintain it, behaviour becomes ambiguous again.

Development and Testing Cannot Be Isolated

BDD assumes that behaviour is a shared concern: that design decisions, implementation, and validation all revolve around the same understanding of how the system should behave.

Most organisations, however, are structured around strong role separation: design happens in one place, implementation in another, and validation in yet another, often across different teams and phases.

These two ideas are fundamentally incompatible.

The emulator problem is a perfect illustration of this.

In complex systems, especially those that integrate with external platforms or hardware, meaningful testing often requires emulators or other types of test doubles. Testers depend on them to validate behaviour, but only developers usually have the technical knowledge to build them. Product owners don’t own them, budget rarely plans for them, and testers cannot realistically implement them.

So emulators become a kind of no-man’s land: critical for system validation, but owned by no one. They are built late, are often incomplete, and quickly become fragile and outdated.

At that point, system behaviour is something that is discovered afterwards. You cannot have shared ownership of behaviour and at the same time isolate responsibility for validation. The moment behaviour is validated only after “development is done”, it stops driving design and becomes post-mortem verification.

Which defeats the purpose.

The Knowledge Gap Problem

There is another pattern that often appears in organisations with strong role separation: dedicated validation teams frequently have little to no deep technical knowledge of the systems they are validating. In some cases, these teams are not technical at all, by design.

This is often justified using the classic test pyramid: developers are responsible for unit and integration tests, while testers/validators focuses on system and acceptance testing at the top.
The pyramid itself is not the problem. Having multiple layers of testing is generally a good idea. The problem arises when the pyramid is used to justify a separation of people, rather than a separation of concerns.

On paper, this model looks clean. In practice, it creates a structural knowledge gap.

Validation teams are expected to validate end-to-end behaviour, but are rarely given the technical context required to do so. They often cannot realistically automate tests themselves. At best, they can adjust pre-made scripts or operate tools prepared by others, but they are structurally unable to design test infrastructure, build emulators, or reason about system-level behaviour.
This can work reasonably well for user-facing applications, where behaviour is mostly observable through interfaces and workflows. But it breaks down completely in more technical domains.

If you are testing an embedded system, for example, you cannot meaningfully validate behaviour without understanding:

the underlying protocols and communication patterns,
the difference between technologies (e.g. DLMS, Zigbee, or similar standards),
the configuration and state of the device,
the constraints and failure modes of the hardware itself.

In these contexts, black-box testing is often not just insufficient - it is actively misleading. The most important behaviours are not visible at the UI level and cannot be reasoned about without technical context.

Which again contradicts the idea that behaviour can be validated without understanding the system.

Behaviour Is Necessary, Not Sufficient

Scenarios define what the system should do.
They rarely define performance, resilience, security, or regulatory constraints.
Yet those are often the real reasons systems fail.

For example, contracts sometimes explicitly or implicitly require that the system needs to comply with ISO 27001 or some other framework.

Very rarely do behavioural scenarios capture what that actually means in practice: audit trails, access control policies, incident response procedures, data retention rules, and similar constraints.

This is a structural blind spot in many BDD examples: non-functional and regulatory aspects are often omitted, even though they are frequently the most critical requirements in real systems.

In practice, scenarios alone are not enough to drive engineering decisions.

Ultimately, these aspects can also be modelled using scenarios, but doing so often becomes an exercise in translating existing requirements into a different format, rather than improving understanding.
Architecture notes, protocol references, regulatory standards, and technical constraints must live alongside behavioural scenarios.

Not in a separate universe.

Behaviour Beyond the System

One last point that is often overlooked: behaviour should not only condition development, but everything around it.

If BDD is really about shared understanding of behaviour, then that understanding should apply not just to the system, but also to the way people work together.

That includes the team. But it also includes the client/product-owner.

We all know about scope creep. We all know about fixed-price projects that claim to be agile in scope, but not in budget. A lot of conflict in software projects does not come from technical failure, but from mismatched expectations about how collaboration itself should work.

In fixed-price or highly constrained projects, one way to address this is to explicitly define behavioural expectations as part of the contract or engagement model.

Not in abstract terms, but very concretely:

scenarios are defined collaboratively in workshops,
client input is required within specific timeframes,
test evidence and traceability are provided as part of each delivery by delivering X and Y reports,
demos happen at an agreed frequency and in a specific format,
and anything outside of that flow is explicitly considered out of scope.

Does the client require a specific documentation template? If so, it should be captured in this document. And if the client does not require any specific template, that should be captured as well.

Will the client do any testing themselves? If yes, write down which level is their responsibility. And write down which levels of testing are covered by your team.

The same idea can apply internally within a team: documenting code practices, review standards, how demos are conducted, how decisions are made, and what “done” actually means.

Not as bureaucracy, but as shared behavioural expectations.

Because in the end, most project failures are not caused by missing features. They are caused by people having different mental models of what “collaboration” was supposed to look like.

Making that behaviour explicit is, in many cases, more important than any technical scenario.
This is just one possible format and other approaches can work just as well, but the underlying principle is the same: behaviour should be explicit, shared, and continuously validated.

What to keep from all this rambling

If there is anything worth keeping from all of this:

Behaviour first - but there are many valid ways to express it.
Natural language - only within a specific context.
Tests - traceable to behaviour. Tests, tests, tests.
Tests as living documentation - the best ramp-up material.
No strict phase separation - a feature is only done when behaviour works and is validated.

If, after reading all this, you realise that your team already does most of it, just without calling it BDD, then congratulations.

You were never missing a framework. You were already doing Behaviour-Driven Development.

And if someone tells you that what you are doing is “not really BDD”, that’s fine too. Labels matter much less than outcomes.

And if this approach doesn’t work in your context, that’s also fine. The point is not to copy a process, but to find your own way of following the same principles.

That, in the end, is what BDD was always meant to be about.