Forem: Davide Mibelli

Flutter vs React Native in 2026: I Built the Same App in Both

Davide Mibelli — Thu, 21 May 2026 11:10:43 +0000

I spent three weekends building the same app twice. Not as an experiment — I had a real decision to make. We're adding a mobile companion to one of our internal tools at work, a task management system that runs on Spring Boot microservices. My team has touched Dart before; I'm more comfortable with JavaScript. Neither of us had shipped serious mobile work in the last two years. The only honest way to pick a stack was to build something real in both frameworks and compare what I found.

The app I built isn't a toy. It pulls tasks from a REST API, caches them locally with SQLite, fires scheduled push notifications when deadlines approach, and has smooth list animations on a bottom-nav layout with three tabs. Simple enough to build in a weekend. Complex enough to expose real differences between the two frameworks.

Here's what I found — including a result that surprised me.

The App

Before comparing, let me be specific about what I built. Both versions:

Fetch tasks from a paginated REST endpoint with auth headers
Cache responses locally using SQLite
Support optimistic updates when marking a task complete
Schedule local push notifications 1 hour before each task's due time
Use a bottom navigation bar with three tabs

This isn't production code — it's a controlled comparison. Both apps hit the same mock API running on my MacBook. Same UI design, same feature set, different frameworks.

Setup and Developer Experience

Flutter's setup is more involved upfront. You run flutter doctor, chase down Android SDK versions, configure Xcode command-line tools. On a clean Mac it took me about 45 minutes to get a green build on both simulators. The error messages are usually specific enough to guide you, but it's mechanical work.

React Native with Expo took 15 minutes. npx create-expo-app, pick a template, run npx expo start, scan the QR code. Done. The friction delta is real, especially if you're onboarding a team that's new to mobile development.

The catch: Expo's managed workflow works brilliantly until you need a native module that isn't in Expo's SDK. Then you eject, and you're roughly back to the same complexity as a bare React Native setup. For this test app, I stayed in managed Expo and hit no walls.

Writing UI Code

This is where the frameworks feel most different day-to-day.

Flutter uses widgets — everything is a widget, composable and explicit. Here's the task list item:

class TaskTile extends StatelessWidget {
  final Task task;
  final VoidCallback onComplete;

  const TaskTile({super.key, required this.task, required this.onComplete});

  @override
  Widget build(BuildContext context) {
    return ListTile(
      title: Text(
        task.title,
        style: task.completed
            ? const TextStyle(decoration: TextDecoration.lineThrough)
            : null,
      ),
      subtitle: Text(task.dueAt.toLocal().toString().substring(0, 16)),
      trailing: task.completed
          ? const Icon(Icons.check_circle, color: Colors.green)
          : IconButton(
              icon: const Icon(Icons.radio_button_unchecked),
              onPressed: onComplete,
            ),
    );
  }
}

React Native with TypeScript:

type TaskTileProps = {
  task: Task;
  onComplete: () => void;
};

export function TaskTile({ task, onComplete }: TaskTileProps) {
  return (
    <TouchableOpacity
      style={styles.row}
      onPress={task.completed ? undefined : onComplete}
    >
      <View style={styles.info}>
        <Text style={[styles.title, task.completed && styles.completed]}>
          {task.title}
        </Text>
        <Text style={styles.due}>{formatDate(task.dueAt)}</Text>
      </View>
      {task.completed ? (
        <Ionicons name="checkmark-circle" size={22} color="#4CAF50" />
      ) : (
        <Ionicons name="radio-button-off" size={22} color="#999" />
      )}
    </TouchableOpacity>
  );
}

Both are readable. The Flutter version is more verbose but more structured — you're always in an explicit tree of typed widgets. The React Native version feels immediately familiar if you write React for the web, which is either an advantage or a liability depending on your team's background.

One thing I noticed: Flutter's hot reload is faster and more reliable than React Native's throughout a full day of development. Not dramatically — but consistently.

State Management

I used Riverpod in Flutter and Zustand in React Native. Both are lightweight and explicit. I deliberately avoided Provider or Redux for a project this size.

Flutter with Riverpod:

final tasksProvider = AsyncNotifierProvider<TasksNotifier, List<Task>>(
  TasksNotifier.new,
);

class TasksNotifier extends AsyncNotifier<List<Task>> {
  @override
  Future<List<Task>> build() => _fetchTasks();

  Future<void> completeTask(String id) async {
    final previous = state;
    state = AsyncData(
      state.value!
          .map((t) => t.id == id ? t.copyWith(completed: true) : t)
          .toList(),
    );
    try {
      await TasksApi.complete(id);
    } catch (_) {
      state = previous;
    }
  }
}

React Native with Zustand:

interface TaskStore {
  tasks: Task[];
  loading: boolean;
  fetchTasks: () => Promise<void>;
  completeTask: (id: string) => Promise<void>;
}

export const useTaskStore = create<TaskStore>((set, get) => ({
  tasks: [],
  loading: false,
  fetchTasks: async () => {
    set({ loading: true });
    const tasks = await TasksApi.fetchAll();
    set({ tasks, loading: false });
  },
  completeTask: async (id) => {
    const previous = get().tasks;
    set({
      tasks: previous.map((t) => (t.id === id ? { ...t, completed: true } : t)),
    });
    try {
      await TasksApi.complete(id);
    } catch {
      set({ tasks: previous });
    }
  },
}));

The patterns are nearly identical — optimistic update, rollback on failure. The ergonomics are a wash. If you already know one of these libraries, it takes an afternoon to feel comfortable with the other.

Performance

This is where the narrative shifted for me.

Flutter renders through its own engine. Impeller is now the default on both iOS and Android in 2026. It doesn't use native UI components — it draws everything itself. This means frame-perfect consistency across platforms and animations that don't depend on a JavaScript thread.

React Native's new architecture — JSI plus Fabric, stable and enabled by default since RN 0.74 — eliminated the old async bridge. Thread communication is now synchronous. This closed a significant performance gap in 2024-2025.

In practice, for this app, I couldn't reliably tell the difference on a modern device. Both hit 60fps on the list scroll. The gap appears when you push harder — complex custom animations, heavy computation, very long lists. Flutter still wins there, but you have to be building something demanding to care.

Build sizes: Flutter release APK was 21MB. React Native bare was 18MB. Flutter carries its engine everywhere.

Ecosystem and Third-Party Libraries

React Native benefits from the entire JavaScript ecosystem. Need a date picker, a chart library, a PDF generator? There's an npm package. The question is always whether it has real native bindings or is a pure-JS fallback.

Flutter's pub.dev is smaller but more curated. Packages tend to be higher quality and better maintained because the community is more focused. For common needs — HTTP clients, SQLite, push notifications, state management — the Flutter ecosystem is solid.

Where Flutter still lags: some SDKs ship their React Native version first and Flutter second, sometimes months later. Analytics tools, some payment gateways, specific third-party integrations. If your app depends on one of these, check the Flutter SDK availability before committing.

For push notifications specifically, both flutter_local_notifications and Expo Notifications worked cleanly in my test. No meaningful difference in API quality. Flutter's setup requires touching native config files directly; Expo abstracts that away.

What I Shipped

I shipped the Flutter version. The decision came down to factors specific to my situation.

UI consistency — our app has custom list animations and a design language that needs to look identical on iOS and Android. Flutter's renderer guaranteed that without platform divergence.

Existing Dart exposure — my team had briefly used Flutter in 2023. The ramp-up cost was lower than switching to a mobile-specific React/TypeScript setup everyone would need to learn from scratch.

No exotic SDK requirements — we don't depend on any third-party SDK that would put us at risk of the "Flutter SDK coming soon" problem.

Desktop is on the roadmap — Flutter's single codebase across mobile, desktop, and web matters for us. We already run a Spring Boot backend; adding a Flutter desktop client for internal ops is straightforward.

If our team had been JS-heavy, or if we'd needed to share components with a web frontend, React Native would have been the right call.

The Honest Summary

Flutter makes more sense when:

Pixel-perfect custom UI — you control the renderer entirely
Performance-heavy animations — no JS thread bottleneck
Targeting beyond mobile — Flutter Desktop and Web are real options in 2026
Team is willing to learn Dart — the investment is smaller than it looks

React Native makes more sense when:

JS-heavy team — zero ramp-up if your team already knows React
Sharing logic with web — React Native Web and shared business logic are mature
Expo speed for MVPs — managed Expo is genuinely the fastest path to a working app
Broad SDK availability — some third-party SDKs still prioritize RN bindings

Neither framework is a mistake in 2026. The "Flutter vs React Native" war of 2019-2022 produced a lot of hot takes that aged poorly. Both ship real production apps. Both have active communities. Both have converged enough that your team's background matters more than the framework's raw capabilities.

The one thing I'd push back on: don't pick React Native just because everyone on your team knows JavaScript. Mobile development has enough platform-specific quirks — permission flows, background tasks, notification entitlements, keychain storage — that the framework choice is secondary to understanding how iOS and Android actually work. That learning is required either way.

What are you running in mobile production in 2026? And if you switched frameworks at some point — Flutter to React Native or the reverse — what finally pushed you over the line?

Originally published on Medium.

JWT vs Session Tokens in Spring Boot: A Senior Dev's Decision Guide

Davide Mibelli — Thu, 21 May 2026 11:09:52 +0000

Three years ago I gave the same answer every time someone asked me about authentication in Spring Boot: "use JWT, it's stateless, it scales." I was half right and half wrong, and it took inheriting two production codebases — one broken in a very specific way — to understand which half was which.

This is not a tutorial on how to implement either one. It's the decision guide I wish I'd had before I started recommending JWT by default.

What tutorials actually teach you

Most Spring Boot security tutorials walk you through JWT because it makes for a cleaner demo. You add a filter, validate a signature, set the SecurityContext, done. No database calls, no shared state, stateless by construction. It feels architecturally clean.

What they rarely show: what happens when a user changes their password. Or gets their account suspended. Or logs out on one device and expects that to mean something on all devices. With a pure JWT setup and no blocklist, the answer to all three is "nothing happens until the token expires."

Sessions, on the other hand, feel old-fashioned. "That doesn't scale." "You need sticky sessions." Neither of those is true anymore, and I'll show you why.

How sessions actually work in Spring Boot

Spring Session with Redis is three annotations and a dependency:

@EnableRedisHttpSession
@Configuration
public class SessionConfig {
    // Spring Session handles everything else
}

<dependency>
    <groupId>org.springframework.session</groupId>
    <artifactId>spring-session-data-redis</artifactId>
</dependency>

The client gets an opaque session ID (typically 32 hex characters) stored in an HttpOnly; Secure cookie. Every request sends the cookie, Spring looks up the session in Redis, deserializes it, and populates the SecurityContext. The session data lives in Redis, not in the token itself.

Revocation is sessionRepository.deleteById(sessionId). Instant, no exceptions.

Horizontal scaling works out of the box — every instance connects to the same Redis. No sticky sessions needed. This is not 2009 anymore.

JWT: what you actually get

A JWT is a base64-encoded JSON payload (claims) signed with a secret or private key. The server does not store it. Verification happens locally by checking the signature — no database call, no network hop.

This matters in two specific situations:

Microservices that verify tokens independently. If you have five services and each needs to know who the caller is, sessions require every service to call a shared store or a central auth service. JWT lets each service verify the token locally with just the public key or shared secret.

Third-party and mobile clients. Browsers handle cookies automatically. Native apps and third-party API clients do not. JWT in the Authorization header works everywhere without cookie configuration.

Outside these two cases, you are paying the JWT costs without getting the JWT benefits.

The costs are real:

No revocation without a blocklist. A 15-minute access token cannot be invalidated early. A stolen token is valid until expiry. If you add a Redis blocklist to check on every request, you have just re-added the database call you were trying to avoid.
Token size. A session cookie is 32 bytes. A JWT with a handful of claims is 300–600 bytes in every request header, forever. In high-frequency internal APIs this adds up.
Implementation surface. JWT has a long history of security bugs: alg: none attacks, weak HMAC secrets, missing expiry validation, incorrect audience checks. Spring Security handles most of this correctly, but the complexity budget is higher than sessions.

The real decision framework

Stop asking "which is better?" and ask "what do I actually need?"

Use sessions when:

You control the frontend — a browser-based app using your own backend
You need immediate revocation: logout means logout, password change means all sessions die
You are building a monolith or a small service that owns its own auth
You are already running Redis for caching or queuing

Use JWT when:

Multiple independent services need to verify identity without calling a central store
You have non-browser clients — mobile apps, CLI tools, third-party integrations — that cannot easily handle cookies
You need federated identity: a token issued by an external IdP (Auth0, Keycloak, Cognito) that your service validates

Do not use JWT because:

"It's stateless and scales" — sessions on Redis scale just as well across any number of instances
"Everyone uses it" — cargo-culting security decisions is how you end up with 7-day non-revocable tokens in production

The hybrid setup nobody talks about

Most production systems I've seen that do this well use a combination: sessions for the browser frontend, JWT for service-to-service calls and mobile clients.

Spring Security supports this cleanly with multiple SecurityFilterChain beans:

@Bean
@Order(2)
public SecurityFilterChain jwtFilterChain(HttpSecurity http) throws Exception {
    http
        .securityMatcher("/api/v1/**")
        .sessionManagement(s -> s.sessionCreationPolicy(SessionCreationPolicy.STATELESS))
        .csrf(AbstractHttpConfigurer::disable)
        .addFilterBefore(jwtAuthFilter, UsernamePasswordAuthenticationFilter.class)
        .authorizeHttpRequests(auth -> auth.anyRequest().authenticated());
    return http.build();
}

@Bean
@Order(1)
public SecurityFilterChain sessionFilterChain(HttpSecurity http) throws Exception {
    http
        .securityMatcher("/web/**")
        .sessionManagement(s -> s.sessionCreationPolicy(SessionCreationPolicy.IF_REQUIRED))
        .formLogin(Customizer.withDefaults())
        .logout(logout -> logout.logoutSuccessUrl("/web/login?logout"))
        .authorizeHttpRequests(auth -> auth
            .requestMatchers("/web/login").permitAll()
            .anyRequest().authenticated());
    return http.build();
}

Two chains, different matchers, different strategies. The browser app gets sessions and full revocation. API and service calls get JWT. Spring applies them in @Order sequence — the first matching chain wins.

What the performance difference actually looks like

Sessions add one Redis round-trip per request — typically 0.5–2ms on a well-configured local Redis, 2–5ms if Redis is in a separate availability zone. For a request that already takes 50–200ms to process, that is noise.

JWT validation is in-memory: parse the base64, verify the HMAC signature, check expiry. Sub-millisecond. If you are building an API that needs to handle thousands of requests per second with microsecond budgets, this difference matters. If you are building a standard web application or a business API, it does not.

The token size difference matters more than you would expect in aggregate. A session cookie is SESSION=<32-hex-chars> — about 50 bytes in the Cookie header. A JWT is Authorization: Bearer <base64> — typically 400–700 bytes in the Authorization header, on every single request. In an application with 10,000 active users making 20 requests per hour, that is roughly 70MB/hour in header overhead alone versus 5MB with sessions. On internal microservice APIs with high call frequency, this adds up to real cost.

Neither of these is a reason to choose one over the other by itself. They are factors to weigh against the architectural fit, not arguments that make the decision for you.

Security mistakes that are easy to make with JWT

Spring Security protects you from many JWT pitfalls if you use it correctly, but I have seen all of these in production codebases:

Symmetric secret too short. HS256 requires at least 256 bits (32 bytes). A short secret is brute-forceable. Generate it with openssl rand -base64 32 and store it in your secrets manager, not in application.yml.

No audience or issuer validation. If you have multiple services accepting the same JWT, a token issued for service A can be replayed against service B unless you validate the aud and iss claims. Spring Security's JwtDecoder supports this with .claimValidator("aud", ...).

Logging the token. Access logs, debug statements, error traces. A JWT is a credential. Treat it like a password in your logging configuration.

Using RS256 in a monolith. RS256 (asymmetric) makes sense when multiple services need to verify tokens issued by a single auth service. In a monolith where only one service issues and verifies tokens, RS256 adds key management complexity with no security benefit. HS256 is the right default.

The mistake I see most often

Teams start with JWT because a tutorial recommended it. Six months later, they need to implement "logout from all devices" or "force re-authentication after a password change." At that point they bolt on a Redis blocklist to invalidate tokens before expiry — which means every JWT validation now hits Redis on every request. They have kept all the JWT complexity and added the session store on top.

I have done this. The resulting code is harder to reason about than either pure sessions or pure JWT would have been.

If you need revocation, use sessions. If you genuinely need stateless cross-service verification, use JWT and design around the revocation limitation deliberately — 15-minute access tokens, refresh token rotation, a clear session invalidation policy documented before you ship — not as an afterthought six months later.

The boring answer is often the right one. Spring Session with Redis has been solving this problem correctly since 2015. JWT solves a specific distributed systems problem. Know which one you actually have before you choose.

What does your current setup use, and was it a deliberate choice or did it come from a tutorial?

Originally published on Medium.

The AI Coding Agent Workflow That Actually Works After 1,000 Hours

Davide Mibelli — Tue, 19 May 2026 07:57:45 +0000

The first time I gave an AI agent real autonomy on a production codebase, it confidently refactored a utility method that happened to share a name with a method in a Feign client interface six modules away. The code compiled cleanly. My unit tests passed. Staging broke in a way that took two hours to trace because the JSON serialization behavior had subtly changed.

That was roughly hour 200. I've now crossed 1,000 hours of daily use across projects — Spring Boot microservices, a Flutter mobile app, Python data pipelines, some Go tooling. The workflow I use today is unrecognizable compared to what I started with, and most of what I thought I knew in those first few months was wrong.

The gap between "AI assistant that occasionally saves time" and "force multiplier that ships reliably" isn't about which model you use or which IDE plugin you install. It's almost entirely about how you structure the work before you hand it off.

I'm going to describe what actually works. Not the ideal case — the real case, including where it fails.

The task scoping problem nobody talks about

The most common mistake I see from developers new to agents is giving them goals instead of tasks. "Add authentication to this service" is a goal. An agent handed a goal will make dozens of implicit decisions: which library to use, where to put the filter, how to handle token expiry, whether to add a config property or hardcode something. Each decision is individually reasonable. Collectively, they often produce something that doesn't fit your codebase at all.

The mental model shift that helped me most: treat the agent like an extremely fast junior developer who has read your entire codebase but has no knowledge of your team's unwritten conventions. They'll do exactly what you said, quickly, and be confused why you're upset.

A well-scoped task has a specific file or class to modify, the exact behavior that needs to change (not the outcome you're hoping for), explicit constraints ("don't change the method signature", "stay within this module", "don't add new dependencies"), and a clear definition of done you can verify yourself in under two minutes.

Instead of "add rate limiting to the user endpoints", I write: "Add a rate limiting filter to UserController.java. Use Bucket4j — it's already in the pom. Rate limit to 100 requests per minute per IP using the X-Forwarded-For header. Add a test in UserControllerTest that verifies a 429 is returned on the 101st request within a sliding window. Don't touch any other controllers." That's a task. The first thing was a wish.

Loading context is not the same as prompting

Early on I treated context like a formality — paste in the file, ask the question. What I've learned is that the context you load shapes the entire response, not just the part you're asking about.

For anything non-trivial, I now explicitly load the file being modified, its direct dependencies (the interfaces it implements, the classes it calls), the relevant test file, and any configuration that affects behavior — the relevant application.yml sections, env variables, that kind of thing.

I also say out loud what the agent should NOT need to look at. "Ignore the other controllers. The auth logic is handled upstream in the filter chain — you don't need to worry about it." This sounds redundant but it prevents the agent from going exploring in directions that add noise to the output.

When working in a large Spring Boot monolith, I'll often start a session by describing the module structure explicitly: "This project has five modules. We're only working in user-service. The common module has shared DTOs — you can read it but don't modify it." A few sentences of orientation saves many paragraphs of correction later.

Always plan before you code

The pattern that changed my output quality the most: never go straight to code on anything that touches more than one file.

With Claude Code I use /plan or just ask explicitly for a plan before any implementation. Not because I don't trust the agent to code — but because catching a wrong assumption at the plan stage costs ten seconds. Catching it after the agent has modified seven files costs twenty minutes of untangling and a lot of git checkout.

A plan review also surfaces things I forgot to mention. If the agent's plan includes "add a new UserRepository method", I realize I forgot to say we use a custom JPQL query and we don't add raw methods to the repository interface. That correction takes one sentence before coding. After coding, it's a rewrite.

For tasks that span more than one logical step, I'll ask for the plan broken into phases: "Phase 1: create the new DTO. Phase 2: update the service. Phase 3: update the controller. Phase 4: update the tests." Then I review each phase before proceeding. This is slower than letting it run, but "slower" means an extra three minutes, and it eliminates the whole class of errors where step 4 assumes something about step 2 that's already wrong.

What to never delegate

This is the part most productivity takes leave out.

Database migrations. I write every Flyway migration script myself. An agent will generate syntactically correct SQL that does the wrong thing to your data, and you often won't catch it until you run it. The cost of a wrong migration is too high.

Security logic. JWT validation, permission checks, role hierarchies. I'll let the agent scaffold the structure, but I write the actual predicate logic myself. It's not that agents are bad at it — it's that I need to understand every line of security-sensitive code personally, and "the agent wrote it and I reviewed it" isn't the same as understanding it.

Anything touching shared state in a concurrent context. Thread pool sizing, cache invalidation, queue consumer configuration. I've watched agents write perfectly reasonable-looking code in this area that had subtle race conditions surfacing only under load. Spring's @Async behavior has enough gotchas around exception handling and thread context propagation that I don't trust generated code here without very careful review.

API contracts published to other teams. If I'm changing a REST endpoint or a Kafka message schema that another service consumes, I write the change myself. Contract changes need human intent.

The actual workflow

Here's what a typical feature task looks like now:

1. Write the task spec (bounded, with explicit constraints)
2. Load relevant context explicitly
3. Ask for a plan — review it, correct it
4. Execute phase by phase, reviewing output between phases
5. Run the tests the agent wrote, then run the broader test suite
6. Read the diff, not the agent's summary
7. Commit with a message I write myself

Step 6 is worth repeating: read the actual diff, not the agent's description of what it did. Agents are optimistic narrators. The diff is the ground truth.

The phase execution loop is where most of the real work happens. On a task touching four files, I'll typically have one or two corrections mid-way. That's normal. The correction looks like: "The service method is correct but don't call userRepository.save() directly — use UserService.update() which already handles audit logging. Revise phase 3."

Patterns that work in Java/Spring Boot land

Test first, always. When I ask an agent to add a feature, I almost always ask it to write the test first. Not for TDD philosophy — because a test forces the agent to think about the interface before the implementation. The test spec is a contract. I review it, approve it, then ask for the implementation.

// Ask for this first:
@Test
void shouldReturnUnauthorizedWhenTokenExpired() {
    String expiredToken = tokenGenerator.generateExpiredToken(userId);

    mockMvc.perform(get("/api/users/me")
            .header("Authorization", "Bearer " + expiredToken))
           .andExpect(status().isUnauthorized())
           .andExpect(jsonPath("$.code").value("TOKEN_EXPIRED"));
}

If the agent can write that test clearly, it understands the requirement. If it hedges or makes assumptions in the test itself, I go back and clarify before going further.

Name your constraints explicitly in the prompt. Spring's ecosystem has a lot of ways to do the same thing. "Add caching" could mean @Cacheable, Redis directly, Caffeine, a manual ConcurrentHashMap. I name the technology: "Use @Cacheable with our existing Redis CacheManager bean. Cache name is user-profiles. TTL is already configured in CacheConfig.java."

Ask for the unhappy path. Default agent output handles the happy path well and glosses over error cases. I ask explicitly: "Also handle the case where the external payment service returns a 503. Retry once after 500ms using @Retryable, then throw a PaymentServiceUnavailableException that the controller maps to a 502."

When it breaks — and what to do

Agents go off-rails. It happens less at hour 1,000 than it did at hour 200, but it still happens. The failure modes I see most:

Scope creep. The agent "fixes" something nearby that it noticed while working. The fix is usually not wrong, but it's unexpected and untested. My defense: explicit "do not change anything outside of [specific files]" language in the task, plus reading the diff carefully.

Hallucinated APIs. Especially in less-common libraries, agents will confidently use methods that don't exist. In Spring, this tends to happen with newer APIs or module-specific features. Running the code is the only reliable check — code review misses it sometimes.

The test that tests nothing. An agent writes a test that passes trivially because it's testing a mock returning a mock. I check by asking: "What production behavior would break this test if I deleted it?" If the answer is "nothing," the test is useless.

When a session goes badly wrong — multiple phases deep into a mess — I don't try to patch it. I discard, go back to the last clean commit, and restart with a more constrained task spec. Fighting a bad trajectory is slower than resetting. This took me too long to learn.

The honest productivity picture

After 1,000 hours, my throughput on certain task types is genuinely higher. Boilerplate-heavy work — DTOs, controller scaffolding, Flyway migration stubs, test setup code — goes roughly four times faster. Complex logic involving domain rules, concurrency, or security decisions goes maybe 20% faster because the agent handles the typing while I handle the thinking.

There's also a category where it's slower: anything where I spend more time specifying and reviewing than I'd spend just coding. For a ten-line method in a well-understood domain, writing a good prompt takes longer than writing the method. So I just write the method.

The productivity gains are real. They compound only when you stop treating the agent as a magic box and start treating it like a capable collaborator who needs clear direction, bounded scope, and explicit verification at each step.

What's your experience with task scoping? I'm curious whether the "goals vs tasks" distinction resonates with other teams, or whether there's a completely different framing that works better in your context.

Originally published on Medium.

RAG in Production: What the Tutorials Don't Tell You

Davide Mibelli — Thu, 14 May 2026 13:00:00 +0000

I built a RAG system that scored 91% on our internal eval suite. It retrieved the right chunks four out of five times in every benchmark we ran. We shipped it. Users thought it was broken.

The gap between "works in evaluation" and "works in production" is the thing every RAG tutorial skips. This article is what I learned closing that gap across three different production deployments — a customer support bot, an internal knowledge base, and a document Q&A tool for a legal team.

Why your evals lie to you

The typical RAG eval flow: take 50 question-answer pairs, run retrieval, score chunk relevance, measure answer quality. The benchmark looks good. Production does not.

The problem is evaluation datasets are clean. Real user questions are not. Users ask ambiguous things, reference context from earlier in the conversation, use company-specific jargon that is not in your embeddings vocabulary, and ask questions that span multiple documents. Your 50-pair eval dataset does not cover any of this.

The more subtle problem: retrieval correctness is not the same as answer usefulness. A chunk can be semantically relevant to the query but contain outdated information, contradict another retrieved chunk, or be missing the specific number the user actually needs. Cosine similarity does not catch any of this.

Before you optimize retrieval metrics, instrument what users actually do. In my customer support deployment, the clearest signal was not retrieval recall — it was how often users rephrased their question immediately after getting an answer. That rephrasing rate was the real quality metric.

Chunking is where it actually breaks

The default chunking strategy in most tutorials: split every 512 tokens with 50-token overlap. This is almost always wrong.

The problem is that 512 tokens is an arbitrary number based on older embedding model limits, not on the structure of your documents. A 512-token chunk cut out of the middle of a legal clause or a technical procedure is often meaningless without the surrounding context.

What actually works depends on your document type:

For structured documents (FAQs, product docs, knowledge base articles): chunk by logical unit — one question-answer pair, one procedure step, one concept section. Use your document's own structure as the chunking boundary. If your docs use consistent heading patterns, split on those.

For long-form prose (contracts, reports, research papers): hierarchical chunking. Keep a parent chunk of 1000–2000 tokens for context retrieval, and child chunks of 150–300 tokens for precise matching. At retrieval time, return the child chunk for relevance scoring but pass the parent chunk to the LLM as context.

For code documentation or READMEs: file-level or function-level chunks, never mid-function splits.

The overlap parameter also matters more than people expect. Overlap exists to avoid losing information at chunk boundaries, but it inflates your vector store size and retrieves duplicate context. I found that semantic chunking — splitting at sentence boundaries that mark topic shifts — eliminated the need for overlap almost entirely.

Retrieval is rarely the problem you think it is

When RAG produces wrong answers, the instinct is to improve retrieval. Better embeddings, more chunks, higher top-k. This is usually the wrong lever.

In my experience, retrieval is failing only about 30% of the time when users report bad answers. The other 70% is one of:

The right chunk was retrieved but the LLM ignored it — this is a prompt engineering problem, not a retrieval problem
The answer requires synthesizing across multiple chunks — retrieval returned individually correct chunks but the LLM could not connect them
The question is genuinely unanswerable from the knowledge base — the document does not exist or is outdated

To separate these, add logging at both retrieval and generation time. Log the top-k chunks and the final answer separately. A human spot-check of 20 failure cases per week will show you very quickly which category you are actually in.

When retrieval genuinely is the problem, the highest-leverage fix is hybrid search: dense retrieval (embeddings + cosine similarity) combined with sparse retrieval (BM25 keyword matching). Dense retrieval handles semantic similarity. BM25 handles exact matches — product names, error codes, version numbers, any term where "sounds like" is the wrong answer.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_chroma import Chroma

dense_retriever = Chroma(...).as_retriever(search_kwargs={"k": 10})
sparse_retriever = BM25Retriever.from_documents(docs, k=10)

ensemble = EnsembleRetriever(
    retrievers=[dense_retriever, sparse_retriever],
    weights=[0.6, 0.4]  # tune based on your query distribution
)

The weight split between dense and sparse depends on your documents. Technical documentation with lots of product names and version numbers benefits from higher BM25 weight. Conversational knowledge bases lean toward dense.

Reranking changes the answer quality more than anything else

If there is one thing to add to a RAG pipeline that makes the biggest difference in production answer quality, it is a cross-encoder reranker between retrieval and generation.

The retrieval step uses bi-encoder embeddings — query and document are embedded independently, similarity is a dot product. This is fast but imprecise. A cross-encoder takes the query and a candidate chunk together as a single input and scores their relevance jointly. Much more accurate, but too slow to run over your entire corpus.

The standard pattern: retrieve top-20 with the fast bi-encoder, rerank with the cross-encoder, pass top-3 or top-5 to the LLM.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[str], top_n: int = 5) -> list[str]:
    pairs = [(query, chunk) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for _, chunk in ranked[:top_n]]

In the legal document system, adding reranking reduced hallucinations in answers by roughly 40% — not because retrieval improved, but because the LLM stopped having to sort through marginally relevant context and could focus on the actually relevant chunks.

The infrastructure issues no tutorial shows

Context window budget. Top-5 chunks at 500 tokens each is 2500 tokens before you add the system prompt and the user message. With GPT-4 this is fine. With smaller models or high-volume APIs where you want to minimize token cost, you need explicit context budgeting. Know your LLM's context window, subtract your fixed prompt overhead, and set your chunk count and size to fit.

Stale chunks. Documents in production change. Your chunk embeddings do not update automatically. You need a pipeline that detects document changes (by checksum or last-modified timestamp) and re-embeds only the changed documents. I've seen teams manually re-index their entire corpus monthly as a workaround. That is not a solution for any corpus over a few thousand documents.

Conflicting information. If the same concept is documented in multiple places with different details, retrieval will return both. The LLM will either pick one arbitrarily or produce a contradictory answer. The fix is upstream — deduplicate your knowledge base and establish a single source of truth. Retrieval cannot save you from bad source data.

Context window management

One problem that grows slowly and then all at once: as your knowledge base expands and you increase top-k to improve recall, your context window fills up. You pass 10 chunks at 500 tokens each, plus system prompt, plus conversation history, and suddenly you are at 7000 tokens per request on a model with an 8000-token limit.

The naive fix is to increase top-k and hope the model attends to the right parts. The correct fix is explicit context budgeting:

MAX_CONTEXT_TOKENS = 4000  # reserve headroom for prompt + answer
TOKENS_PER_CHUNK = 400     # approximate after chunking

max_chunks = MAX_CONTEXT_TOKENS // TOKENS_PER_CHUNK  # = 10

chunks = rerank(query, retrieve(query, k=20), top_n=max_chunks)

Calculate the budget before retrieval, not after. If you are hitting the limit regularly, reduce chunk size rather than reducing top-k — smaller chunks at the same top-k gives the model more coverage with the same token budget.

What to actually monitor

Stop measuring retrieval precision in isolation. In production, instrument:

Rephrasing rate: how often a user asks a follow-up that is essentially the same question reworded. High rate means the answer was not useful, regardless of retrieval metrics.
Answer rejection rate: if your UI has thumbs-down feedback, track it. Correlate failures with retrieved chunk IDs to identify which documents produce bad answers consistently.
Latency by pipeline stage: retrieval, reranking, and generation each have different latency profiles and different optimization paths. Aggregate P95 latency tells you nothing useful about where to look.
Unanswerable rate: how often the LLM says "I don't have information on this." Below 5% and your system is probably hallucinating answers it should refuse. Above 30% and your knowledge base has coverage gaps.
Chunk age: track when each chunk was last re-indexed. Any chunk older than your document update frequency is potentially stale. This one takes ten minutes to add to your ingestion pipeline and saves hours of debugging mysterious wrong answers three months from now.

The RAG tutorial teaches you to build a pipeline. Production teaches you to instrument one. The teams I've seen succeed were instrumenting before they had users, not after.

What is the failure mode you hit first in your own RAG deployment?

Originally published on Medium.

Spring Boot JWT Authentication: The Complete Setup Most Tutorials Get Wrong

Davide Mibelli — Tue, 12 May 2026 05:18:42 +0000

I've read probably forty Spring Boot JWT tutorials over the years. They all show you the same thing: how to generate a token on login, how to validate it on each request, and how to wire up SecurityFilterChain. And they all stop right there.

What they skip is everything that matters in production: refresh token rotation, token revocation, and not sending tokens in JavaScript-readable headers when you don't have to. I've inherited two codebases where a "working JWT implementation" turned out to be a security hole you could drive a truck through. This article is the setup I now use by default.

What the typical tutorial gives you

The standard walkthrough produces something like this:

@Bean
public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
    http.csrf(AbstractHttpConfigurer::disable)
        .sessionManagement(s -> s.sessionCreationPolicy(SessionCreationPolicy.STATELESS))
        .authorizeHttpRequests(auth -> auth
            .requestMatchers("/auth/**").permitAll()
            .anyRequest().authenticated())
        .addFilterBefore(jwtAuthFilter, UsernamePasswordAuthenticationFilter.class);
    return http.build();
}

This is fine. The filter reads the Authorization: Bearer <token> header, validates the signature, sets the SecurityContext. You can deploy this. The problem is what comes next.

The access token has an expiry — typically 15 to 60 minutes. What happens when it expires? In most tutorials: the user logs in again. In production: users complain, sessions die mid-work, and someone "fixes" it by setting expiry to 7 days. Now you have a long-lived token you can't revoke.

The right structure: short-lived access + rotating refresh

The pattern I use in every Spring Boot project now:

Access token: 15 minutes, stored in memory (JS variable or React state), never in localStorage
Refresh token: 7 days, stored in an HttpOnly; Secure; SameSite=Strict cookie, not accessible to JavaScript
Rotation: every refresh request issues a new refresh token and invalidates the old one
Revocation store: a small table (or Redis set) of invalidated refresh token IDs

Here's the refresh token entity:

@Entity
@Table(name = "refresh_tokens")
public class RefreshToken {
    @Id
    private String id; // UUID, stored in the cookie value

    @ManyToOne(fetch = FetchType.LAZY)
    private User user;

    private Instant expiresAt;
    private boolean revoked;

    @CreationTimestamp
    private Instant createdAt;
}

And the service that handles rotation:

@Service
@Transactional
public class RefreshTokenService {

    private final RefreshTokenRepository repo;
    private final Duration refreshExpiry = Duration.ofDays(7);

    public RefreshToken create(User user) {
        RefreshToken token = new RefreshToken();
        token.setId(UUID.randomUUID().toString());
        token.setUser(user);
        token.setExpiresAt(Instant.now().plus(refreshExpiry));
        token.setRevoked(false);
        return repo.save(token);
    }

    public RefreshToken rotate(String oldTokenId) {
        RefreshToken old = repo.findById(oldTokenId)
            .orElseThrow(() -> new InvalidTokenException("Refresh token not found"));

        if (old.isRevoked() || old.getExpiresAt().isBefore(Instant.now())) {
            // Possible token reuse attack — revoke entire family
            revokeAllForUser(old.getUser());
            throw new InvalidTokenException("Refresh token expired or revoked");
        }

        old.setRevoked(true);
        repo.save(old);

        return create(old.getUser());
    }

    public void revokeAllForUser(User user) {
        repo.revokeAllByUser(user);
    }
}

The reuse detection is the part most tutorials omit entirely. If an attacker steals a refresh token and uses it, you now have two parties trying to use it. When the legitimate client tries to rotate and finds its token already consumed, you revoke everything and force a new login. This is the refresh token family pattern from the OAuth 2.0 Security Best Current Practice spec.

The cookie setup

The refresh token goes out in the response as a cookie, not in the JSON body:

@PostMapping("/auth/login")
public ResponseEntity<AccessTokenResponse> login(
        @RequestBody LoginRequest req,
        HttpServletResponse response) {

    User user = authService.authenticate(req.email(), req.password());

    String accessToken = jwtService.generateAccessToken(user);
    RefreshToken refreshToken = refreshTokenService.create(user);

    ResponseCookie cookie = ResponseCookie.from("refresh_token", refreshToken.getId())
        .httpOnly(true)
        .secure(true)
        .sameSite("Strict")
        .path("/auth/refresh")
        .maxAge(Duration.ofDays(7))
        .build();

    response.addHeader(HttpHeaders.SET_COOKIE, cookie.toString());

    return ResponseEntity.ok(new AccessTokenResponse(accessToken));
}

Note .path("/auth/refresh") — the cookie is scoped to the refresh endpoint only. The browser won't send it on any other request. This reduces the attack surface on CSRF (though SameSite=Strict already handles most of that).

The refresh endpoint reads the cookie:

@PostMapping("/auth/refresh")
public ResponseEntity<AccessTokenResponse> refresh(
        @CookieValue(name = "refresh_token", required = false) String refreshTokenId,
        HttpServletResponse response) {

    if (refreshTokenId == null) {
        return ResponseEntity.status(HttpStatus.UNAUTHORIZED).build();
    }

    RefreshToken newRefreshToken = refreshTokenService.rotate(refreshTokenId);
    String newAccessToken = jwtService.generateAccessToken(newRefreshToken.getUser());

    ResponseCookie cookie = ResponseCookie.from("refresh_token", newRefreshToken.getId())
        .httpOnly(true)
        .secure(true)
        .sameSite("Strict")
        .path("/auth/refresh")
        .maxAge(Duration.ofDays(7))
        .build();

    response.addHeader(HttpHeaders.SET_COOKIE, cookie.toString());

    return ResponseEntity.ok(new AccessTokenResponse(newAccessToken));
}

The JWT service itself

Nothing exotic here, but I'll include it for completeness. I use io.jsonwebtoken:jjwt-api (JJWT 0.12.x) with Spring Boot 3:

@Service
public class JwtService {

    @Value("${app.jwt.secret}")
    private String secret;

    private static final Duration ACCESS_EXPIRY = Duration.ofMinutes(15);

    private SecretKey key() {
        return Keys.hmacShaKeyFor(Decoders.BASE64.decode(secret));
    }

    public String generateAccessToken(User user) {
        return Jwts.builder()
            .subject(user.getId().toString())
            .claim("email", user.getEmail())
            .claim("roles", user.getRoles())
            .issuedAt(new Date())
            .expiration(Date.from(Instant.now().plus(ACCESS_EXPIRY)))
            .signWith(key())
            .compact();
    }

    public Claims validateAndParse(String token) {
        return Jwts.parser()
            .verifyWith(key())
            .build()
            .parseSignedClaims(token)
            .getPayload();
    }
}

The secret must be at least 256 bits (32 bytes) for HS256. Generate it once and store it in your secrets manager, not in application.yml committed to git:

openssl rand -base64 32

The filter

@Component
@RequiredArgsConstructor
public class JwtAuthFilter extends OncePerRequestFilter {

    private final JwtService jwtService;
    private final UserDetailsService userDetailsService;

    @Override
    protected void doFilterInternal(
            HttpServletRequest request,
            HttpServletResponse response,
            FilterChain chain) throws ServletException, IOException {

        String header = request.getHeader(HttpHeaders.AUTHORIZATION);
        if (header == null || !header.startsWith("Bearer ")) {
            chain.doFilter(request, response);
            return;
        }

        String token = header.substring(7);
        try {
            Claims claims = jwtService.validateAndParse(token);
            String userId = claims.getSubject();

            if (userId != null && SecurityContextHolder.getContext().getAuthentication() == null) {
                UserDetails userDetails = userDetailsService.loadUserByUsername(userId);
                UsernamePasswordAuthenticationToken auth = new UsernamePasswordAuthenticationToken(
                    userDetails, null, userDetails.getAuthorities());
                auth.setDetails(new WebAuthenticationDetailsSource().buildDetails(request));
                SecurityContextHolder.getContext().setAuthentication(auth);
            }
        } catch (JwtException e) {
            // Token invalid or expired — let the request proceed unauthenticated
            // The security config will reject it at the endpoint level
        }

        chain.doFilter(request, response);
    }
}

I catch JwtException broadly here rather than differentiating expired vs. tampered. From the filter's perspective the right behavior is the same: don't authenticate. The client should catch the 401 and attempt a refresh.

Logout

Logout needs to revoke the refresh token and clear the cookie:

@PostMapping("/auth/logout")
public ResponseEntity<Void> logout(
        @CookieValue(name = "refresh_token", required = false) String refreshTokenId,
        HttpServletResponse response) {

    if (refreshTokenId != null) {
        refreshTokenService.revoke(refreshTokenId);
    }

    ResponseCookie cleared = ResponseCookie.from("refresh_token", "")
        .httpOnly(true)
        .secure(true)
        .sameSite("Strict")
        .path("/auth/refresh")
        .maxAge(0)
        .build();

    response.addHeader(HttpHeaders.SET_COOKIE, cleared.toString());

    return ResponseEntity.noContent().build();
}

Setting maxAge(0) tells the browser to delete the cookie immediately.

What I skip in this setup

A few things I deliberately leave out of a standard deployment:

JWT blocklist for access tokens. Once you have 15-minute expiry and a working refresh flow, blocking access tokens before expiry is usually not worth the latency of a blocklist check on every request. If you need immediate revocation (e.g. "terminate all sessions now"), revoke all refresh tokens — the access tokens will expire naturally within 15 minutes.

RS256 asymmetric signing. Useful if multiple services need to verify tokens without sharing the secret. In a monolith or a small microservices setup where the auth service also issues tokens to itself, HMAC-SHA256 is simpler and faster.

Remember me / device management. Out of scope here, but the refresh token table already gives you the foundation: add a deviceName column, show the user their active sessions, let them revoke specific ones.

The only thing preventing most teams from shipping this instead of the tutorial version is the extra thirty minutes it takes to add the refresh token table and rotation logic. It's worth it.

What does your current JWT setup look like — are you doing refresh rotation, or relying on long-lived tokens?

Originally published on Medium.

I Tested Claude Code, OpenCode, and Codex for 30 Days — Here's My Verdict

Davide Mibelli — Wed, 06 May 2026 10:02:19 +0000

In March 2026, I made a rule: every piece of non-trivial code I write has to go through at least one AI coding agent before I commit it. Not for autocomplete — for the whole thing. Architecture decisions, test generation, refactoring, debugging. The full workflow.

I used three tools in rotation: Claude Code, OpenCode, and OpenAI Codex CLI. All on the same projects — a Spring Boot microservice, a Flutter mobile app, and a personal side project in Python. Thirty days, 100+ hours of active coding across all three.

A note on scope: I deliberately chose terminal-first agents, not IDE-integrated tools like Cursor or GitHub Copilot. My workflow already lives in the terminal, and I wanted agents that reason about entire codebases rather than just the file I have open. If you're evaluating Cursor, this comparison won't map directly.

Here's what actually happened.

This is a preview. Read the full article on Medium: I Tested Claude Code, OpenCode, and Codex for 30 Days — Here's My Verdict

Spring Boot 4.0 Migration: What Nobody Tells You About the Breaking Changes

Davide Mibelli — Wed, 29 Apr 2026 13:14:24 +0000

I upgraded two production applications to Spring Boot 4.0 the week it went GA. I read the migration guide, skimmed the release notes, and thought the hardest part would be the Jackson 3 change everyone was talking about.

The Jackson 3 change was not the hardest part.

Two applications, a few days of debugging, and one very confusing test suite later, I have a clear picture of what actually breaks — and what the official guide glosses over.

This is not a walkthrough of the migration guide. This is the stuff that costs you real time.

This is a preview. Read the full article on Medium: Spring Boot 4.0 Migration: What Nobody Tells You About the Breaking Changes

Stop Fighting CSS: The Google Stitch + Antigravity Stack for Solo Developers

Davide Mibelli — Fri, 24 Apr 2026 09:07:34 +0000

Most developers I know have the same problem: the logic is solid, but the UI looks like it was built in a hurry — because it was.
I spent two days on a login flow for a side project before I gave up and tried a different approach: Google Stitch for the UI, Antigravity as the IDE, and the MCP bridge between them. Here's what the workflow actually looks like.

The core idea

Google Stitch gives you pre-configured UI patterns — "Stitches" — that are already accessible and mathematically sound. You pick a functional category (Hero, Card, List), set one primary color and one font, and it handles the rest. The MCP connection to Antigravity means you never manually export manifests: change a button style in the Stitch web UI, and it reflects in your running simulator instantly.

The implementation

Once you've connected Stitch to Antigravity via MCP: Configure Server in the command palette, your components become requestable by name:

import { useStitch } from '@antigravity/react-hooks';

function DashboardCard({ userId }) {
  const { Component, loading } = useStitch('UserCard');
  if (loading) return <Placeholder />;
  return (
    <Component
      username={userId.name}
      onAction={(ev) => console.log('User tapped:', ev)}
    />
  );
}

Your React component has no idea what UserCard looks like. It just requests it. If you decide a List works better than a Card, you change it in Stitch — the code stays the same.

What I didn't expect

The MCP connection is bidirectional. You can send performance feedback from Antigravity back to Stitch, so the design tool knows which components are causing frame drops on real devices. The documentation barely mentions this — it took me a while to find it.

The full guide

I wrote a more detailed walkthrough on Medium covering the full setup, the Gravity Fields fetch pattern, and the telemetry configuration: https://medium.com/p/02aa7c97e131

What's your approach for UI as a solo dev — do you start from the design or the logic?

From DALL-E to gpt-image-2: The Architectural Bet That Finally Fixed AI Text

Davide Mibelli — Thu, 23 Apr 2026 21:04:44 +0000

This article was originally published on Medium.

Two years ago, if you asked an AI to design a menu for a Mexican restaurant, you’d get a beautiful layout of “enchuita” and “churiros.” It looked like food, and the font looked like letters, but it was essentially a visual fever dream. The “burrto” became a classic meme in dev circles — a reminder that while AI could paint like Caravaggio, it had the literacy of a toddler.

Yesterday, OpenAI launched ChatGPT Images 2.0 (gpt-image-2). I ran the same test. The menu was perfect. Not just the spelling, but the hierarchy, the prices, and the specialized diacritics. It is no longer just “generating pixels.” It is communicating.

This isn’t a minor version bump or a better training set. It’s a total architectural pivot that signals the end of an era. If you’ve spent the last three years building workflows around diffusion models, it’s time to rethink your pipeline.

1. Why text was broken (and how they fixed it)

To understand why gpt-image-2 works, you have to understand why DALL-E 3 failed at spelling. Diffusion models — the tech behind almost every major generator until now — work by denoising. They start with static and try to “find” an image. Because text pixels make up a tiny fraction of a training image, the model learned the texture of text rather than the logic of characters. To a diffusion model, an “A” is just a specific arrangement of lines, not a semantic unit.

OpenAI has quietly abandoned diffusion. While they won’t officially confirm the guts of the system, the PNG metadata and the model’s behavior tell the story: this is an autoregressive model.

It generates images the same way GPT-4 generates code — by predicting the next token. By integrating image generation directly into the language model pipeline, the model isn’t “drawing” a word; it’s “writing” an image. When the architecture treats a pixel and a letter as parts of the same conceptual stream, the “enchuita” problem simply vanishes.

2. The end of the CSS overlay hack

For those of us in agency work or product dev, AI images have always been a “background only” tool. If a client wanted a marketing banner with a specific CTA, we’d generate the art, then use a graphics library or CSS to overlay the text. It was the only way to ensure the brand name wasn’t spelled “Gooogle.”

Gpt-image-2 changes that calculus. With near-perfect rendering of Latin, Kanji, and Hindi scripts, the “post-processing” stage of the workflow is suddenly on the chopping block. You can now generate multi-paneled assets or social media posts where the text is baked into the composition with proper lighting and perspective.

But there’s a catch for your budget. At approximately $0.21 per high-quality 1024x1024 render, this is roughly 60% more expensive than the previous generation. If you’re at a high-volume startup, that’s a significant line item.

3. Thinking before rendering

The most impressive part of the new model isn’t the resolution — it’s the “thinking mode.” Borrowed from reasoning models like o3, the generator now spends compute time planning the layout before it touches a single pixel.

I watched it handle a prompt for “a grid of six distinct objects, each with a label in a different language.” Previous models would lose count by object four and turn the labels into Sanskrit-flavored gibberish. Gpt-image-2 paused, “thought” (generating reasoning tokens), and then executed. It can count. It can follow layout constraints.

This moves AI generation from “creative toy” to “reliable infrastructure.” Reliability is what we actually need in production. I’d much rather pay more for a single correct image than spend credits on ten “cheap” re-rolls.

4. The DALL-E eulogy

OpenAI is shutting down DALL-E 2 and 3 on May 12, 2026. Not moving them to a legacy tier — shutting them down.

This is a massive signal. It’s an admission that the diffusion approach hit a ceiling that no amount of fine-tuning could break. By retiring the DALL-E brand in favor of a unified ChatGPT Image model, OpenAI is betting that the future of Multimodality is a single, unified architecture.

The wall between “thinking” and “seeing” is being torn down. We used to have a brain (LLM) that sent instructions to a hand (Diffusion model). Now, the brain is doing the drawing itself.

5. What I’m still worried about

Despite the polish, there are gaps. The knowledge cutoff is December 2025. If you need a render involving a trend or news event from early 2026, you’re reliant on the web search tool, which adds latency and even more cost.

Furthermore, the pricing model is now “tokenized” for images. Thinking mode adds a variable cost based on how many reasoning tokens the model uses to plan the composition. This makes it incredibly hard to predict API costs for complex apps. You aren’t just paying for an image; you’re paying for the “brain power” required to frame it.

6. The 2026 reality check

If you are building a simple placeholder tool, stick to cheaper, older models. But for any workflow where the image is the content — marketing, UI prototyping, or localized assets — the shift to autoregressive generation is a one-way door.

We’re entering a phase where the term “image model” feels dated. We just have models. They happen to output pixels sometimes and Python code others. The fact that it can finally spell “Burrito” is just the first sign that the gap between human intent and machine execution has finally closed.