Forem: Juan Torchia

Prisma Server Actions in Next.js 16: the patterns that work and the N+1 that sneaks up on you

Juan Torchia — Mon, 18 May 2026 12:31:25 +0000

Prisma Server Actions in Next.js 16: the patterns that work and the N+1 that sneaks up on you

Next.js 16 shipped recently with App Router improvements and Server Actions stabilized as a first-class primitive. The community is adopting Server Actions as the natural replacement for API routes on mutations. The migration looks obvious — less boilerplate, co-location with the component, shared types between client and server. I started moving in that direction too. And somewhere along the way I ran into an N+1 that didn't come from Prisma: it came from how I was composing the Actions.

My thesis is this: Prisma ORM 5 doesn't introduce N+1 in Server Actions. Action composition does — the pattern of calling multiple independent Actions from the same component, or chaining them without collapsing the queries. It's an architecture problem, not an ORM problem. And it has a solution, but you have to know where to look.

Classic N+1 vs. composition N+1 in Server Actions

The classic N+1 with Prisma is well-known: you iterate over a list and fire a separate query for each item because you forgot the include. The official Prisma docs on query optimization cover it precisely — use include or select with nested relations, or for more complex cases, findMany with relational filters instead of queries in a loop.

The composition N+1 in Server Actions is different. It doesn't show up inside the body of a single Action — it shows up when the component calls multiple Actions in sequence or in parallel, and each Action opens its own connection with its own Prisma cursor. Under SSR load, that becomes connection pool pressure that never appears in local tests.

Look at this problematic pattern:

// app/dashboard/page.tsx
// ⚠️ Problematic pattern: three independent Actions
// each one opens its own connection to the pool

import { getUserProfile } from "@/actions/user"
import { getRecentOrders } from "@/actions/orders"
import { getNotifications } from "@/actions/notifications"

export default async function DashboardPage() {
  // Three separate round-trips, three pool connections
  const profile = await getUserProfile()
  const orders = await getRecentOrders()
  const notifications = await getNotifications()

  return <Dashboard profile={profile} orders={orders} notifications={notifications} />
}

Each of those Actions has its own prisma.user.findUnique, its own prisma.order.findMany, its own prisma.notification.findMany. Three queries that could be resolved with a single well-designed call — or at minimum with Promise.all to parallelize them.

The connection pool under SSR load

Prisma uses an internal connection pool. In Next.js App Router with SSR, each request can fire multiple Server Actions in the same render. If every component on the page calls its own Action, the pool receives a short but intense burst of connections per user visit.

The most common pattern that generates this problem is using prisma as a global singleton alongside PrismaClient instantiated in each separate module. Prisma's documentation explicitly recommends using a singleton instance in serverless and SSR environments:

// lib/prisma.ts
// Singleton pattern recommended by Prisma for Next.js
// Source: https://www.prisma.io/docs/orm/prisma-client/queries/query-optimization-performance

import { PrismaClient } from "@prisma/client"

const globalForPrisma = globalThis as unknown as {
  prisma: PrismaClient | undefined
}

export const prisma =
  globalForPrisma.prisma ??
  new PrismaClient({
    log: process.env.NODE_ENV === "development" ? ["query", "warn", "error"] : ["error"],
  })

if (process.env.NODE_ENV !== "production") globalForPrisma.prisma = prisma

If you skip this pattern, every hot reload in development — and potentially every cold start in production with some providers — can instantiate a fresh PrismaClient with its own pool. The result: exhausted connections with no obvious warning in the logs.

The patterns that work: collapsing queries into a single Action

The antidote to the composition N+1 is simple to state but requires discipline: one Action per use case, not one Action per entity. Instead of three independent Actions for the dashboard, one single Action that groups the three queries with Promise.all:

// actions/dashboard.ts
// ✅ Correct pattern: one Action that collapses the queries
// Promise.all for real parallelism within the same connection

"use server"

import { prisma } from "@/lib/prisma"
import { auth } from "@/lib/auth"

export async function getDashboardData() {
  const session = await auth()
  if (!session?.user?.id) throw new Error("Not authenticated")

  const userId = session.user.id

  // Single pool invocation — three queries in parallel
  const [profile, orders, notifications] = await Promise.all([
    prisma.user.findUnique({
      where: { id: userId },
      select: { name: true, email: true, avatarUrl: true },
    }),
    prisma.order.findMany({
      where: { userId, createdAt: { gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) } },
      orderBy: { createdAt: "desc" },
      take: 10,
    }),
    prisma.notification.findMany({
      where: { userId, read: false },
      orderBy: { createdAt: "desc" },
      take: 5,
    }),
  ])

  return { profile, orders, notifications }
}

The difference isn't just about queries — it's about design. An Action that groups data for a specific use case is easier to cache, easier to test, and more honest about what problem it's actually solving.

The forgotten include and the query that multiplied

The classic N+1 still lives inside Actions. If you iterate over results and fire a nested query per item, Prisma isn't going to save you — that's on you. The most frequent pattern I see in codebases just starting with Server Actions:

// ⚠️ Classic N+1 inside an Action
// One query per order to fetch the product

"use server"

import { prisma } from "@/lib/prisma"

export async function getOrdersWithProducts(userId: string) {
  const orders = await prisma.order.findMany({ where: { userId } })

  // ❌ N+1: one query per order
  const ordersWithProduct = await Promise.all(
    orders.map(async (order) => {
      const product = await prisma.product.findUnique({
        where: { id: order.productId },
      })
      return { ...order, product }
    })
  )

  return ordersWithProduct
}

The correct fix is to collapse with include:

// ✅ Correct include: a single query with implicit JOIN
// Prisma collapses everything into a single round-trip

"use server"

import { prisma } from "@/lib/prisma"

export async function getOrdersWithProducts(userId: string) {
  return prisma.order.findMany({
    where: { userId },
    include: {
      product: {
        select: { name: true, price: true, imageUrl: true },
      },
    },
    orderBy: { createdAt: "desc" },
    take: 20,
  })
}

The select inside the include matters: you're not pulling the full product object, you're pulling exactly the fields the component needs. That reduces the serialized payload Next.js has to transfer between server and client.

Real gotchas: what the 15-minute tutorial doesn't cover

"use server" doesn't guarantee automatic serialization of Prisma errors. If an Action throws a PrismaClientKnownRequestError (say, a constraint violation), that error doesn't reach the client the way you'd expect in all cases. You need to wrap with try/catch and serialize the error explicitly:

// actions/user.ts
// Explicit Prisma error handling in Server Actions

"use server"

import { prisma } from "@/lib/prisma"
import { Prisma } from "@prisma/client"

export async function createUser(data: { email: string; name: string }) {
  try {
    return await prisma.user.create({ data })
  } catch (error) {
    // Unique constraint violation (P2002 in Prisma)
    if (error instanceof Prisma.PrismaClientKnownRequestError) {
      if (error.code === "P2002") {
        return { error: "That email is already registered" }
      }
    }
    // Unexpected error: log it, don't expose it
    console.error("[createUser]", error)
    return { error: "Internal error. Please try again." }
  }
}

Query logging in development is your best diagnostic tool. The singleton above already includes log: ["query"] in development — that lets you see exactly how many queries each render fires. If you see the same SELECT repeated N times in the terminal, you have an N+1 and you can attack it before it hits production.

Server Actions and React 19 useOptimistic can mask the problem. If you use useOptimistic to update the UI before the Action resolves, perceived latency drops — but the queries are still there. Don't confuse improved UX with optimized queries.

This connects to something I already documented when looking at how OpenTelemetry in Spring Boot reveals the real problem when the log says OK: observability surface matters. In Next.js 16, if you don't have query traces, the Action log can look healthy while queries multiply underneath.

FAQ: Prisma Server Actions Next.js 16 N+1

Why does N+1 appear in Server Actions when it didn't in my API routes?
In API routes, the natural pattern was one route = one handler = one query. In Server Actions, co-location with the component invites you to create one Action per entity, and components end up calling several Actions in the same render. That composition generates multiple round-trips that never existed in an API route because the query was centralized.

Does Prisma ORM 5 have any mechanism to automatically detect N+1?
Not automatically at runtime, but you can enable query logging (log: ["query"]) to see them in development. There are community proposals for a native N+1 detector, but as of this post it's not a stable feature. The official optimization docs document the patterns to avoid, but detection is still manual or via external tooling.

How many PrismaClient instances should I have in a Next.js 16 project?
One. Using the singleton pattern with globalThis. More than one instance means more than one connection pool, which under SSR load can exhaust available database connections. This is especially critical on serverless providers where each function can have its own process.

Is Promise.all inside an Action enough to fix the pool problem?
For multiple independent queries inside a single Action, yes: Promise.all parallelizes them within the same invocation and the pool handles a single connection (or the minimum needed). What Promise.all does not fix is when you have multiple independent Actions fired from different components in the same render — that needs consolidation at the architecture level.

How does this affect Next.js 16 caching?
Next.js 16 has Data Cache and Full Route Cache. If you use fetch or unstable_cache, you can cache the result of a Server Action. But the N+1 happens before the cache — if the Action isn't cached (mutations, data with no-store), every request executes the queries. The right pattern is to cache the entire Action with unstable_cache when the data allows it, not to cache individual queries inside it.

Does this pattern also apply to Prisma with pure Server Components (no Actions)?
Yes, but with a difference: in Server Components without Actions, queries live directly in the component and Next.js can do component-level caching more easily. The composition problem is more acute with Server Actions because the mental model of "one Action = one button or form" leads to excessive granularity that multiplies round-trips.

What I'm keeping and what I'm not buying

I'm keeping this pattern: one Action per use case, not one Action per entity. That's the most important mindset shift when migrating from API routes to Server Actions with Prisma.

What I'm not buying is the narrative that Server Actions automatically simplify the data model. They simplify the boilerplate — shared types, no explicit endpoint — but the responsibility to not multiply queries is still yours. If you were coming from API routes where one route = one well-considered query, jumping to Actions can lead to query sprawl that's actually worse.

The honest trade-off: Server Actions win on DX and co-location. They lose on visibility into which queries fire per render if you don't have logging active. Before deploying any page with multiple Actions, pull up the dev terminal with log: ["query"] running and count how many SELECTs appear per render. If the number surprises you, you have work to do.

This connects directly to what I documented in Prisma vs JDBC: the benchmark that almost made me blame the wrong ORM — the ORM is rarely the problem. Query shape is. And in Next.js 16 with Server Actions, shape is defined by the Action architecture, not by Prisma.

For those coming from the Spring Boot world, there's an interesting parallel with retry budget and amplification: every abstraction that looks like a simplification introduces its own amplification vector. In Server Actions, that vector is granular query composition.

Sources:

Prisma Docs — Query optimization & performance

This article was originally published on juanchi.dev

Prisma Server Actions en Next.js 16: los patrones que funcionan y el N+1 que aparece cuando no lo esperás

Juan Torchia — Mon, 18 May 2026 12:31:21 +0000

Prisma Server Actions en Next.js 16: los patrones que funcionan y el N+1 que aparece cuando no lo esperás

Next.js 16 salió hace poco con mejoras en el App Router y estabilización de Server Actions como primitiva de primera clase. La comunidad está adoptando Server Actions como el reemplazo natural de las API routes para mutaciones. La migración parece obvia — menos boilerplate, co-location con el componente, tipo compartido entre cliente y servidor. Yo también empecé a moverme en esa dirección. Y en algún punto del camino encontré un N+1 que no venía de Prisma: venía de cómo estaba componiendo las Actions.

Mi tesis es esta: Prisma ORM 5 no introduce N+1 en Server Actions. Lo introduce la composición de Server Actions — el patrón de llamar múltiples acciones independientes desde el mismo componente o encadenarlas sin colapsar las queries. Es un problema de arquitectura, no de ORM. Y tiene solución, pero hay que saber dónde mirar.

El N+1 clásico vs el N+1 de composición en Server Actions

En el N+1 clásico con Prisma, el problema es conocido: iterás sobre una lista y por cada ítem hacés una query separada porque olvidaste el include. La documentación oficial de Prisma sobre optimización lo documenta con precisión: la solución es usar include o select con relaciones nested, o en casos más complejos, findMany con filtros relacionales en lugar de queries en loop.

El N+1 de composición en Server Actions es diferente. No aparece en el cuerpo de una sola Action — aparece cuando el componente llama a varias Actions en secuencia o en paralelo, y cada Action abre su propia conexión con su propio cursor de Prisma. Bajo carga de SSR, eso se convierte en una presión sobre el connection pool que no aparece en tests locales.

Mirá este patrón problemático:

// app/dashboard/page.tsx
// ⚠️ Patrón problemático: tres Actions independientes
// cada una abre su propia conexión al pool

import { getUserProfile } from "@/actions/usuario"
import { getRecentOrders } from "@/actions/pedidos"
import { getNotifications } from "@/actions/notificaciones"

export default async function DashboardPage() {
  // Tres round-trips separados, tres conexiones del pool
  const perfil = await getUserProfile()
  const pedidos = await getRecentOrders()
  const notificaciones = await getNotifications()

  return <Dashboard perfil={perfil} pedidos={pedidos} notificaciones={notificaciones} />
}

Cada una de esas Actions tiene su propio prisma.user.findUnique, su propio prisma.order.findMany, su propio prisma.notification.findMany. Tres queries que podrían resolverse con una sola llamada bien diseñada — o al menos con Promise.all para paralelizarlas.

El connection pool bajo carga de SSR

Prisma usa un connection pool interno. En Next.js App Router con SSR, cada request puede disparar múltiples Server Actions en el mismo render. Si cada componente de la página llama su propia Action, el pool recibe una ráfaga corta pero intensa de conexiones por cada visita de usuario.

El patrón más común que genera este problema es el uso de prisma como singleton global junto con el PrismaClient instanciado en cada módulo separado. La documentación de Prisma recomienda explícitamente usar una instancia singleton en entornos serverless y SSR:

// lib/prisma.ts
// Patrón singleton recomendado por Prisma para Next.js
// Fuente: https://www.prisma.io/docs/orm/prisma-client/queries/query-optimization-performance

import { PrismaClient } from "@prisma/client"

const globalForPrisma = globalThis as unknown as {
  prisma: PrismaClient | undefined
}

export const prisma =
  globalForPrisma.prisma ??
  new PrismaClient({
    log: process.env.NODE_ENV === "development" ? ["query", "warn", "error"] : ["error"],
  })

if (process.env.NODE_ENV !== "production") globalForPrisma.prisma = prisma

Si no usás este patrón, cada hot reload en desarrollo — y potencialmente cada cold start en producción con algunos providers — puede instanciar un PrismaClient nuevo con su propio pool. El resultado: conexiones agotadas sin advertencia obvia en los logs.

Los patrones que funcionan: colapsar queries en una sola Action

El antídoto al N+1 de composición es simple de enunciar pero requiere disciplina: una Action por caso de uso, no una Action por entidad. En lugar de tres Actions independientes para el dashboard, una sola Action que agrupa las tres queries con Promise.all:

// actions/dashboard.ts
// ✅ Patrón correcto: una Action que colapsa las queries
// Promise.all para paralelismo real dentro de la misma conexión

"use server"

import { prisma } from "@/lib/prisma"
import { auth } from "@/lib/auth"

export async function getDashboardData() {
  const session = await auth()
  if (!session?.user?.id) throw new Error("No autenticado")

  const userId = session.user.id

  // Una sola invocación al pool — tres queries en paralelo
  const [perfil, pedidos, notificaciones] = await Promise.all([
    prisma.user.findUnique({
      where: { id: userId },
      select: { nombre: true, email: true, avatarUrl: true },
    }),
    prisma.order.findMany({
      where: { userId, creadoEn: { gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) } },
      orderBy: { creadoEn: "desc" },
      take: 10,
    }),
    prisma.notification.findMany({
      where: { userId, leida: false },
      orderBy: { creadoEn: "desc" },
      take: 5,
    }),
  ])

  return { perfil, pedidos, notificaciones }
}

La diferencia no es solo de queries — es de diseño. Una Action que agrupa los datos de un caso de uso específico es más fácil de cachear, más fácil de testear y más honesta sobre qué problema está resolviendo.

El include que se olvidó y la query que se multiplicó

El N+1 clásico todavía existe dentro de las Actions. Si iterás resultados y hacés una query anidada por cada ítem, Prisma no lo va a prevenir solo — eso es tuyo. El patrón más frecuente que veo en codebases que empiezan con Server Actions:

// ⚠️ N+1 clásico dentro de una Action
// Una query por cada pedido para traer el producto

"use server"

import { prisma } from "@/lib/prisma"

export async function getPedidosConProductos(userId: string) {
  const pedidos = await prisma.order.findMany({ where: { userId } })

  // ❌ N+1: una query por cada pedido
  const pedidosConProducto = await Promise.all(
    pedidos.map(async (pedido) => {
      const producto = await prisma.product.findUnique({
        where: { id: pedido.productId },
      })
      return { ...pedido, producto }
    })
  )

  return pedidosConProducto
}

La solución correcta es colapsar con include:

// ✅ Include correcto: una sola query con JOIN implícito
// Prisma colapsa todo en un único round-trip

"use server"

import { prisma } from "@/lib/prisma"

export async function getPedidosConProductos(userId: string) {
  return prisma.order.findMany({
    where: { userId },
    include: {
      producto: {
        select: { nombre: true, precio: true, imagenUrl: true },
      },
    },
    orderBy: { creadoEn: "desc" },
    take: 20,
  })
}

El select dentro del include es importante: no traés el objeto completo de producto, traés exactamente los campos que el componente necesita. Eso reduce el payload serializado que Next.js tiene que transferir entre server y cliente.

Gotchas reales: lo que no aparece en el tutorial de 15 minutos

El "use server" no garantiza serialización automática de errores de Prisma. Si una Action lanza un PrismaClientKnownRequestError (por ejemplo, un constraint violation), ese error no llega al cliente de la forma que esperás en todos los casos. Necesitás wrapear con try/catch y serializar el error explícitamente:

// actions/usuario.ts
// Manejo explícito de errores de Prisma en Server Actions

"use server"

import { prisma } from "@/lib/prisma"
import { Prisma } from "@prisma/client"

export async function crearUsuario(data: { email: string; nombre: string }) {
  try {
    return await prisma.user.create({ data })
  } catch (error) {
    // Constraint unique violation (P2002 en Prisma)
    if (error instanceof Prisma.PrismaClientKnownRequestError) {
      if (error.code === "P2002") {
        return { error: "El email ya está registrado" }
      }
    }
    // Error no esperado: loguear, no exponer
    console.error("[crearUsuario]", error)
    return { error: "Error interno. Intentá de nuevo." }
  }
}

El logging de queries en desarrollo es tu mejor herramienta de diagnóstico. El singleton de arriba ya incluye log: ["query"] en desarrollo — eso te permite ver exactamente cuántas queries dispara cada render. Si ves el mismo SELECT repetido N veces en el terminal, tenés un N+1 y podés atacarlo antes de que llegue a producción.

Server Actions y React 19 useOptimistic pueden ocultar el problema. Si usás useOptimistic para actualizar la UI antes de que la Action resuelva, la percepción de latencia baja — pero las queries siguen estando. No confundas UX mejorada con queries optimizadas.

Esto conecta con algo que ya documenté al analizar cómo OpenTelemetry en Spring Boot muestra el problema real cuando el log dice OK: la superficie de observabilidad importa. En Next.js 16, si no tenés trazas de queries, el log de la Action puede parecer saludable mientras las queries se multiplican por debajo.

FAQ: Prisma Server Actions Next.js 16 N+1

¿Por qué aparece N+1 en Server Actions si no aparecía en mis API routes?
En API routes, el patrón natural era una ruta = un handler = una query. En Server Actions, la co-location con el componente invita a crear una Action por entidad, y los componentes terminan llamando varias Actions en el mismo render. Esa composición genera múltiples round-trips que en una API route no existían porque la query estaba centralizada.

¿Prisma ORM 5 tiene algún mecanismo para detectar N+1 automáticamente?
No automáticamente en runtime, pero sí podés habilitar el log de queries (log: ["query"]) para verlas en desarrollo. Hay propuestas en la comunidad para un detector de N+1 nativo, pero a la fecha de este post no es una feature estable. La documentación oficial de optimización documenta los patrones a evitar, pero la detección sigue siendo manual o via herramientas externas.

¿Cuántas instancias de PrismaClient debería tener en un proyecto Next.js 16?
Una sola, usando el patrón singleton con globalThis. Más de una instancia significa más de un connection pool, lo que bajo carga de SSR puede agotar las conexiones disponibles en la base de datos. Esto es especialmente crítico en providers serverless donde cada función puede tener su propio proceso.

¿Promise.all dentro de una Action es suficiente para resolver el problema de pool?
Para el caso de múltiples queries independientes dentro de una Action, sí: Promise.all las paraleliza dentro de la misma invocación y el pool maneja una sola conexión (o las mínimas necesarias). El problema que Promise.all no resuelve es cuando tenés múltiples Actions independientes disparadas desde distintos componentes del mismo render — ahí necesitás consolidar a nivel de arquitectura.

¿Cómo afecta esto al caching de Next.js 16?
Next.js 16 tiene caching de Data Cache y Full Route Cache. Si usás fetch o unstable_cache, podés cachear el resultado de una Server Action. Pero el N+1 ocurre antes del cache — si la Action no está cacheada (por ejemplo, en mutaciones o en datos con no-store), cada request ejecuta las queries. El patrón correcto es cachear la Action completa con unstable_cache cuando los datos lo permiten, no cachear queries individuales dentro de ella.

¿Este patrón aplica también a Prisma con Server Components puros (sin Actions)?
Sí, pero con una diferencia: en Server Components sin Actions, las queries viven en el componente directamente y Next.js puede hacer caching a nivel de componente más fácilmente. El problema de composición se acentúa con Server Actions porque el modelo mental de "una Action = un botón o formulario" lleva a granularidad excesiva que multiplica los round-trips.

Lo que me quedo y lo que no compro

Me quedo con este patrón: una Action por caso de uso, no una Action por entidad. Es el cambio de mentalidad más importante al migrar de API routes a Server Actions con Prisma.

Lo que no compro es la narrativa de que Server Actions simplifican el modelo de datos automáticamente. Simplifican el boilerplate — el tipo compartido, el endpoint explícito — pero la responsabilidad de no multiplicar queries sigue siendo tuya. Si venías de API routes donde una ruta = una query bien pensada, el salto a Actions puede llevar a una dispersión de queries que es peor.

El trade-off honesto: Server Actions ganan en DX y co-location. Pierden en visibilidad de qué queries se disparan por render si no tenés el logging activo. Antes de deployar cualquier página con múltiples Actions, revisá el terminal de desarrollo con log: ["query"] activo y contá cuántos SELECT aparecen por render. Si el número te sorprende, tenés trabajo por hacer.

Esto se conecta directamente con lo que documenté en Prisma vs JDBC: el benchmark que casi me hace culpar al ORM equivocado — el ORM rara vez es el problema. La forma de las queries sí lo es. Y en Next.js 16 con Server Actions, la forma la define la arquitectura de las Actions, no Prisma.

Para los que vienen del mundo Spring Boot, hay un paralelo interesante con el presupuesto de retry y amplificación: cada abstracción que parece simplificar introduce su propio vector de amplificación. En Server Actions, ese vector es la composición granular de queries.

Fuentes:

Prisma Docs — Query optimization & performance

Este artículo fue publicado originalmente en juanchi.dev

Spring Boot 2026: Why Measuring Only Startup Time Is a Trap

Juan Torchia — Sun, 17 May 2026 21:22:44 +0000

There's a question that surfaces every time someone mentions GraalVM or Spring AOT in a technical meeting: how long does it take to start? It's the first metric that hits the screen, the number that closes the debate in five minutes. The problem is that question alone isn't enough to make any serious architecture decision, and in 2026 we have enough evidence to prove it with a reproducible lab.

I built JuanTorchia/springboot-jvm-2026 (tag editorial-final-startup-matrix) around exactly that working hypothesis: if you only look at startup time, you're ignoring half the costs that actually matter in production.

The lab backend is not a Hello World

Choosing what to measure matters as much as measuring it. A GET /ping endpoint that returns {"status":"ok"} doesn't activate the same bean graph or the same JIT behavior as a real application. So the lab backend has concrete surface area:

POST /api/orders with Jakarta Validation on a record
GET /api/orders/{id} with Spring Data JDBC on PostgreSQL 17
POST /api/work with deterministic work (iterative CRC32, up to 5,000 iterations)
Flyway for migrations, Actuator for readiness/liveness
HikariCP with the pool explicitly configured in the benchmark profile

The WorkService deserves its own paragraph because it's the only endpoint that mixes real CPU with a database query (countOrders()). That matters: without that endpoint, native and classic JVM look practically identical on warm latency because the JIT has nothing interesting to optimize.

// WorkService.java — deterministic work to force real differences between modes
public long calculateScore(String input, int iterations) {
    byte[] seed = input.getBytes(StandardCharsets.UTF_8);
    long score = 17;
    for (int i = 0; i < iterations; i++) {
        CRC32 crc = new CRC32();
        crc.update(seed);
        crc.update(longToBytes(score + i));
        // rotation + golden Fibonacci constant for dispersion
        score = Long.rotateLeft(score ^ crc.getValue(), 7) + 0x9E3779B97F4A7C15L;
    }
    return score & Long.MAX_VALUE;
}

The 5_000 iteration cap isn't arbitrary: I validated it with WorkServiceTest to keep the cap predictable and prevent the benchmark from accidentally becoming a throughput test.

Four modes, four distinct operational surfaces

The lab compares:

jvm: java -jar on Eclipse Temurin 21, the baseline for every team that hasn't touched anything
cds: JVM with a dynamic AppCDS archive prepared in a separate phase
aot-jvm: Spring Boot AOT on JVM, with -Dspring.aot.enabled=true verified in the container
native: GraalVM Native Image compiled inside ghcr.io/graalvm/native-image-community:21

That last point about AOT has a story. In the editorial run on May 17, 2026 (17:31–17:44 Buenos Aires time), the aot-jvm results made no sense until I confirmed the flag was actually reaching the container. Without spring.aot.enabled=true verified in the runtime env, AOT mode is indistinguishable from classic JVM on startup. The results/environment.json captures exactly that so anyone reproducing the lab knows what was actually running.

The Dockerfile.native does the full build inside the builder container:

# Dockerfile.native — the native build happens inside the builder, no local GraalVM required
FROM ghcr.io/graalvm/native-image-community:21 AS builder
WORKDIR /workspace
RUN microdnf install -y maven && microdnf clean all
COPY .mvn/ .mvn/
COPY mvnw pom.xml ./
COPY src/ src/
RUN chmod +x ./mvnw && ./mvnw -Pnative -DskipTests native:compile

FROM ubuntu:24.04
# final image with no JRE: just the compiled binary
COPY --from=builder /workspace/target/startup-lab /workspace/startup-lab
ENTRYPOINT ["/workspace/startup-lab"]

That means the startup-lab binary runs without a JRE in the final image. Smaller image, much faster startup, but the cost shifted entirely to build time. That's the central trade-off of native mode: you don't eliminate work, you move it from runtime to build time.

What the startup number doesn't capture

In this local matrix, native reduced startup time and RSS compared to JVM modes. That's true and reproducible on the editorial-final-startup-matrix tag. But that number alone doesn't tell the full story.

Build time for native is an order of magnitude higher than a classic mvn package. If you're on a CI pipeline with frequent deploys, that cost shows up on every merge to main. It's not a startup cost: it's a development cycle cost.

First-request latency can differ materially from warm latency. On classic JVM, the first request pays the cost of unloaded classes and a cold JIT. On native there's no JIT, so the first request and request number one thousand have a similar profile. That can be an advantage or a disadvantage depending on your actual load profile.

The AppCDS preparation cost is a third dimension that only appears in cds mode: there's an archive dump phase that runs before the container is ready for traffic. Operationally that means an initialization step that doesn't exist in the other modes, and that you need to model in your deploy pipeline if CDS is the option.

Warm latency under sustained load, GC behavior under high memory pressure, and scheduling on Kubernetes are dimensions this lab intentionally doesn't measure. Running three iterations on Docker Desktop over WSL2 on Windows is not production. What the lab does guarantee is local reproducibility: anyone can clone the repo and reproduce the matrix with:

# Windows — full editorial run with 3 runs per mode and native enabled
powershell -NoProfile -ExecutionPolicy Bypass -File .\scripts\run-lab.ps1 -Preset editorial

The decision startup time can't make on its own

My position after building this: startup time is useful as a tiebreaker when everything else is even. Using it as the primary metric to choose between classic JVM, AppCDS, AOT-JVM, and native is making an architecture decision on a single axis.

What I can claim with evidence from this matrix:

If the requirement is startup around 1.4 seconds and controlled RSS in this matrix, native delivers that, but you pay with higher build time and the loss of JIT at warm.
If the team needs fast CI cycles and current startup is tolerable, AOT-JVM with -Dspring.aot.enabled=true improves boot time without changing the deploy artifact.
AppCDS has the lowest operational change cost of all four, but it has that preparation phase that needs to be explicitly modeled.
Classic JVM is still the correct baseline for any comparison. Dropping it without measuring the other three axes is pure vibes.

There's no universal winner. There are trade-offs that depend on how many times per hour the service scales, how heavy the CI pipeline is, and whether the team can take on the additional operational complexity of native.

The repo is at JuanTorchia/springboot-jvm-2026, tag editorial-final-startup-matrix. Raw results are in results/raw/*.json and the aggregated matrix in results/comparison.md. If you're going to cite it, use the wording from the README: "In the editorial-final-startup-matrix tag of JuanTorchia/springboot-jvm-2026, measured locally on Windows Docker Desktop/WSL2..." — that environment context isn't a decorative disclaimer, it's part of the data.

What's the dimension that drives your decision most between these four modes? Build time, warm latency, or library compatibility on native?

This article was originally published on juanchi.dev

Spring Boot 2026: por qué medir solo startup time es una trampa

Juan Torchia — Sun, 17 May 2026 21:22:37 +0000

Hay una pregunta que aparece cada vez que alguien toca GraalVM o Spring AOT en una reunión técnica: ¿cuánto tarda en arrancar? Es la primera métrica que vuela a la pantalla, el número que cierra el debate en cinco minutos. El problema es que esa pregunta sola no alcanza para tomar ninguna decisión de arquitectura seria, y en 2026 tenemos suficiente evidencia para demostrarlo con un laboratorio reproducible.

Armé JuanTorchia/springboot-jvm-2026 (tag editorial-final-startup-matrix) exactamente con esa hipótesis de trabajo: si solo mirás startup time, estás ignorando la mitad de los costos que importan en producción.

El backend de laboratorio no es un Hello World

Elegir qué medir importa tanto como medir. Un endpoint GET /ping que devuelve {"status":"ok"} no activa el mismo grafo de beans ni el mismo comportamiento de JIT que una aplicación real. Por eso el backend del lab tiene superficie concreta:

POST /api/orders con Jakarta Validation sobre un record
GET /api/orders/{id} con Spring Data JDBC sobre PostgreSQL 17
POST /api/work con trabajo determinístico (CRC32 iterativo, hasta 5.000 iteraciones)
Flyway para migraciones, Actuator para readiness/liveness
HikariCP con pool configurado explícitamente en el perfil benchmark

El WorkService merece un párrafo aparte porque es el único endpoint que mezcla CPU real con una query de base de datos (countOrders()). Eso importa: sin ese endpoint, native y JVM clásica se ven prácticamente iguales en warm latency porque el JIT no tiene nada interesante que optimizar.

// WorkService.java — trabajo determinístico para forzar diferencias reales entre modos
public long calculateScore(String input, int iterations) {
    byte[] seed = input.getBytes(StandardCharsets.UTF_8);
    long score = 17;
    for (int i = 0; i < iterations; i++) {
        CRC32 crc = new CRC32();
        crc.update(seed);
        crc.update(longToBytes(score + i));
        // rotación + constante Fibonacci aurea para dispersión
        score = Long.rotateLeft(score ^ crc.getValue(), 7) + 0x9E3779B97F4A7C15L;
    }
    return score & Long.MAX_VALUE;
}

El límite de 5_000 iteraciones no es arbitrario: lo validé con WorkServiceTest para que el cap sea predecible y el benchmark no se vuelva una prueba de throughput accidental.

Cuatro modos, cuatro superficies operativas distintas

El lab compara:

jvm: java -jar sobre Eclipse Temurin 21, el baseline de toda empresa que no tocó nada
cds: JVM con archivo AppCDS dinámico preparado en una fase separada
aot-jvm: Spring Boot AOT sobre JVM, con -Dspring.aot.enabled=true verificado en el contenedor
native: GraalVM Native Image compilado dentro de ghcr.io/graalvm/native-image-community:21

Ese último punto del AOT tiene historia. En la corrida editorial del 17 de mayo de 2026 (17:31–17:44 hora Buenos Aires), los resultados de aot-jvm no tenían sentido hasta que confirmé que el flag estaba llegando al contenedor. Sin spring.aot.enabled=true verificado en el env del runtime, el modo AOT no se diferencia del JVM clásico en startup. El results/environment.json captura eso exactamente para que cualquiera que reproduzca el lab sepa qué estaba corriendo.

El Dockerfile.native hace el build completo adentro del contenedor builder:

# Dockerfile.native — el build de native ocurre dentro del builder, no requiere GraalVM local
FROM ghcr.io/graalvm/native-image-community:21 AS builder
WORKDIR /workspace
RUN microdnf install -y maven && microdnf clean all
COPY .mvn/ .mvn/
COPY mvnw pom.xml ./
COPY src/ src/
RUN chmod +x ./mvnw && ./mvnw -Pnative -DskipTests native:compile

FROM ubuntu:24.04
# imagen final sin JRE: solo el binario compilado
COPY --from=builder /workspace/target/startup-lab /workspace/startup-lab
ENTRYPOINT ["/workspace/startup-lab"]

Eso significa que el binario startup-lab corre sin JRE en la imagen final. Imagen más chica, startup mucho más rápido, pero el costo se desplazó completamente al build. Esa es la decisión central del modo native: no eliminás trabajo, lo movés de runtime a build time.

Lo que el número de startup no captura

En esta matriz local, native redujo el startup time y el RSS respecto a los modos JVM. Eso es cierto y reproducible en el tag editorial-final-startup-matrix. Pero ese número solo no cuenta la historia completa.

El build time de native es un orden de magnitud mayor que mvn package clásico. Si estás en un pipeline de CI con deploy frecuente, ese costo aparece en cada merge a main. No es un costo de startup: es un costo de ciclo de desarrollo.

La latencia de primer request puede diferir materialmente de la latencia warm. En JVM clásica, el primer request paga el costo de clases no cargadas y JIT frío. En native no hay JIT, así que el primer request y el request número mil tienen perfil similar. Eso puede ser una ventaja o una desventaja dependiendo del perfil de carga real.

El costo de preparación de AppCDS es un tercer momento que aparece solo en el modo cds: hay una fase de dump del archivo que corre antes de que el contenedor esté listo para tráfico. Operativamente eso implica un paso de inicialización que no existe en los otros modos, y que hay que modelar en el pipeline de deploy si CDS es la opción.

La warm latency bajo carga sostenida, el comportamiento del GC en memoria alta, y el scheduling en Kubernetes son dimensiones que este lab no mide intencionalmente. Correr tres iteraciones en Docker Desktop sobre WSL2 en Windows no es producción. Lo que el lab sí garantiza es reproducibilidad local: cualquiera puede clonar el repo y reproducir la matriz con:

# Windows — corrida editorial completa con 3 runs por modo y native habilitado
powershell -NoProfile -ExecutionPolicy Bypass -File .\scripts\run-lab.ps1 -Preset editorial

La decisión que el número de startup no puede tomar sola

Mi postura después de armar esto: el startup time es útil como tiebreaker cuando todo lo demás está empatado. Usarlo como métrica primaria para elegir entre JVM clásica, AppCDS, AOT-JVM y native es tomar una decisión de arquitectura con un solo eje.

Lo que sí puedo afirmar con evidencia de esta matriz:

Si el requisito es startup alrededor de 1,4 segundos y RSS controlado en esta matriz, native entrega eso, pero pagás con build time mayor y pérdida de JIT en warm.
Si el equipo necesita ciclos de CI rápidos y el startup actual es tolerable, AOT-JVM con -Dspring.aot.enabled=true mejora el arranque sin cambiar el artefacto de deploy.
AppCDS tiene el menor costo de cambio operativo de todos, pero tiene esa fase de preparación que hay que modelar explícitamente.
JVM clásica todavía es el baseline correcto para cualquier comparativa. Abandonarla sin medir los otros tres ejes es puro vibes.

No hay un ganador universal. Hay trade-offs que dependen de cuántas veces por hora escala el servicio, qué tan pesado es el pipeline de CI, y si el equipo puede asumir la complejidad operativa adicional de native.

El repo está en JuanTorchia/springboot-jvm-2026, tag editorial-final-startup-matrix. Los resultados raw están en results/raw/*.json y la matriz agregada en results/comparison.md. Si vas a citarlo, usá el wording del README: "In the editorial-final-startup-matrix tag of JuanTorchia/springboot-jvm-2026, measured locally on Windows Docker Desktop/WSL2..." — ese contexto de entorno no es un disclaimer decorativo, es parte del dato.

¿Cuál es la dimensión que más te mueve en la decisión entre estos cuatro modos? ¿Build time, warm latency, o compatibilidad de librerías en native?

Este artículo fue publicado originalmente en juanchi.dev

Show HN: Needle distilled Gemini tool calling into 26M parameters — technical read, zero hype

Juan Torchia — Sun, 17 May 2026 12:30:43 +0000

Show HN: Needle distilled Gemini tool calling into 26M parameters — technical read, zero hype

I was in the middle of reviewing my Ollama pipeline when the HN post appeared: Needle, a 26M parameter model distilled from Gemini specifically for tool calling. My first reaction was skeptical. 26M sounds like a toy. Then I read more carefully and understood that the interesting point isn't the size — it's the problem they're actually attacking.

Here's my technical read. No euphoria, no easy dismissal.

The real problem behind Needle and Gemini tool calling distillation

My thesis is this: the bottleneck in systems with external tools isn't the LLM's general reasoning — it's the parsability of the output. If the model produces malformed JSON, calls functions with wrong arguments, or hallucinates tool names that don't exist, the whole system breaks — doesn't matter how "intelligent" the model is at other tasks.

I ran into this directly while building agent loops with Claude Code. The most fragile part was never the reasoning; it was the reliability of the data contract. It reminded me of when I resisted TypeScript for years thinking types were bureaucracy. Then I understood that most avoidable failures start as poorly expressed data contracts. Tool calling is exactly the same: a model can be brilliant in prose and terrible at respecting a strict JSON schema under latency pressure.

Needle attacks that specific point: it takes Gemini's tool calling behavior — which is consistent and well-structured — and distills it into a small, specialized model. The hypothesis is that for this specific task, 26M parameters trained on the right behavior can outperform giant generalist models that were never fine-tuned to respect function schemas with precision.

Is it true? In their own benchmarks, according to the project repo, yes. In my own real production, I don't know yet — and that difference matters.

What knowledge distillation is and why it matters here

Knowledge distillation is a technique where a large model — the teacher — generates outputs that are then used to train a smaller model — the student. The student doesn't learn from raw data: it learns to imitate the teacher's behavior on the distributions that matter most.

# Simplified concept of the distillation pipeline for tool calling:
# 1. Teacher (Gemini) generates thousands of correct tool calling examples
# 2. Student (Needle, 26M) trains on those examples
# 3. The student learns the teacher's output distribution, not hand-written rules

For tool calling, this makes particular sense. You don't need the model to know universal history. You need it to, when you hand it this schema:

// Tool definition — the model has to respect this 100%
const tools = [
  {
    name: "search_product",
    description: "Searches for a product by ID in the catalog",
    parameters: {
      type: "object",
      properties: {
        product_id: { type: "string" },
        include_stock: { type: "boolean" }
      },
      required: ["product_id"]
    }
  }
]

Produce exactly:

{
  "name": "search_product",
  "arguments": {
    "product_id": "SKU-4821",
    "include_stock": true
  }
}

Not some creative variation with renamed keys, wrong types, or invented fields. Small generalist models fail at this constantly. If Needle solves it reliably, the use case exists.

How to test it in Ollama: a reproducible checklist

If you want to validate whether a model like Needle has a place in your stack, the criterion shouldn't be someone else's benchmark. It should be your own set of tools under your system's real conditions.

# Step 1: Install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh

# Step 2: When the model is available in the Ollama registry, pull directly
# (check availability at https://ollama.com/search)
ollama pull needle  # tentative name — verify the official registry

# Step 3: Prepare your own tool calling test suite
# Don't use the model README's examples; use YOUR real tools

// tool-calling-test.ts
// Validation criteria I'd use to evaluate any small model

interface TestResult {
  case: string;
  expected: object;
  received: string;
  validJson: boolean;
  schemaRespected: boolean;
  latencyMs: number;
}

async function evaluateToolCallingModel(
  model: string,
  cases: Array<{ prompt: string; expectedSchema: object }>
): Promise<TestResult[]> {
  const results: TestResult[] = [];

  for (const testCase of cases) {
    const start = Date.now();

    // Call the model via Ollama API
    const response = await fetch("http://localhost:11434/api/chat", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        model: model,
        messages: [{ role: "user", content: testCase.prompt }],
        // Pass tools as part of the request
        tools: [testCase.expectedSchema],
        stream: false,
      }),
    });

    const data = await response.json();
    const latency = Date.now() - start;

    // Validate if the JSON is parseable and if it respects the schema
    let validJson = false;
    let schemaRespected = false;
    let received = "";

    try {
      // The tool_call should be in message.tool_calls[0]
      const toolCall = data.message?.tool_calls?.[0];
      received = JSON.stringify(toolCall ?? data.message?.content ?? "");
      validJson = !!toolCall;
      // Basic schema validation: required keys must be present
      if (toolCall?.function?.arguments) {
        const args = toolCall.function.arguments;
        const requiredKeys = Object.keys(testCase.expectedSchema);
        schemaRespected = requiredKeys.every((k) => k in args);
      }
    } catch {
      received = "parse error";
    }

    results.push({
      case: testCase.prompt.slice(0, 50),
      expected: testCase.expectedSchema,
      received,
      validJson,
      schemaRespected,
      latencyMs: latency,
    });
  }

  return results;
}

My minimum acceptance criteria for any tool calling model in a real system:

Metric	Minimum acceptable	Why
Valid JSON	99%+	A parse error in production breaks the entire flow
Schema respected	95%+	Wrong arguments are silently dangerous
p95 latency	< 500ms local	If it's slower than an external API, you've lost the point
Tool name hallucination	0%	An invented name is a non-recoverable error

The limits that the hype doesn't mention

There are three limitations that don't show up in the headlines and that I consider essential before betting on a distilled model in a real system.

First, the teacher's distribution defines the ceiling. If Gemini has biases in how it generates tool calls — certain argument patterns, certain naming conventions — the student inherits them unfiltered. This matters if your API has conventions that drift from Gemini's style.

Second, generalization to unseen schemas is an open question. A distilled model can be excellent on the patterns it learned and brittle against complex schemas with anyOf, nested $refs, or conditional validations. You have to test it explicitly against your own schemas — don't assume the general benchmark applies.

Third, 26M parameters implies limited context capacity. In systems where the prompt includes many tools simultaneously — common in backends with dozens of endpoints exposed as tools — degradation can be significant. That's a hypothesis to validate, not assume.

None of this invalidates the project. It locates it. The same discipline I applied when reviewing pnpm workspaces cache issues in CI applies here: understand the limit first, then decide if it fits.

Where Needle makes sense and where it doesn't

Scenarios where it makes sense to try Needle:

Local agent pipelines where network latency to external APIs is the bottleneck
Edge devices or resource-constrained environments where a 26M model fits in memory comfortably
Systems with a bounded and stable set of tools — not dozens of shifting schemas
As a local fallback when external APIs are unavailable

Scenarios where it probably doesn't cut it:

Systems where reasoning between tool calling steps is complex — deciding when to call which tool, not just how to call it
APIs with deeply nested or polymorphic schemas
Flows where long conversational context matters — the 26M context limit is going to hurt
Environments that need auditable safety guarantees — a privately distilled model is a considerably more opaque box

The tension that surfaced in the Spring Boot Actuator in production post applies differently here: the comfort of "it works in the demo" can hide surface risks that only show up under load or with unexpected inputs.

What this signals for the small model ecosystem

The uncomfortable thing about Needle isn't the model itself. It's what it confirms: functional specialization is going to pressure the hegemony of large general models on structured tasks.

Tool calling, intent classification, entity extraction with fixed schemas — these are tasks where a well-trained distilled model can beat GPT-4 or Claude on cost and latency without sacrificing reliability. That changes the architecture calculation.

In my current stack with Claude Code for complex reasoning and Ollama for local tasks, there's a gap exactly where Needle would aim: the tool router that decides which function to call and with what arguments, without needing the overhead of a 70B model for that. I'm not saying I'll adopt it tomorrow. I'm saying the category makes sense and the experiment deserves follow-through.

Same as when I evaluated Jakarta EE vs Spring Boot tradeoffs or compared package managers in real monorepos, the honest answer isn't "adopt it now" or "ignore it" — it's "test it against your own criteria before committing."

FAQ: Needle, distillation, and tool calling in small models

What exactly is model distillation in the LLM context?
It's a process where a large model (teacher) generates a dataset of correct behavior — in this case, well-formed tool calling examples — which is used to train a small model (student). The student learns to imitate the teacher's output distribution on the specific tasks it was distilled for, without needing the teacher's full architecture.

Is 26M parameters enough for reliable tool calling?
Depends on the scope. For a bounded set of tools with simple schemas, probably yes. For systems with dozens of complex tools, long contexts, or multi-step reasoning, it's an open hypothesis. The project's own benchmark is optimistic; validation against your own schemas is mandatory before betting on it.

How do I test it locally without risking a production system?
With Ollama, if the model is available in the registry, it's as simple as ollama pull [name] and then evaluating with your own script against the schemas you already use. The validation checklist in this post is a starting point. Always against your real tools — never against the README examples.

What's the practical difference between Needle and using function calling from OpenAI or Anthropic?
Latency, cost, and privacy. A local model has no network RTT, no per-token cost, and doesn't send your tool schemas to an external API. The tradeoff is that reliability depends entirely on the local model's training quality, without the backing of a provider with an SLA.

Is it worth it for an individual stack or only for companies with infrastructure?
A 26M model runs on a MacBook with 8GB of RAM without drama. This isn't enterprise infrastructure. If you're already using Ollama for other tasks — like I am — adding a specialized model is operationally trivial. The real cost is evaluation time, not hardware.

What happens if the model hallucinates a tool name that doesn't exist in my system?
That's the worst case and you have to design for it as an expected failure. The routing layer that consumes the model's output has to validate that the tool call name corresponds to a registered tool before executing anything. If it doesn't exist, the error has to be explicit and not silent. This is basic defensive design, independent of which model you use.

Conclusion: test it with your eyes open

I'm not going to say Needle is the future or that it's noise. My position is more specific: functional distillation of large model behavior into small specialized models is a legitimate direction, and tool calling is a use case where it makes genuine technical sense.

What I don't buy is enthusiasm without friction. A 26M model has real limits around context, generalization, and reliability on unseen schemas. Those limits don't appear in the HN post and they will appear in production.

My concrete recommendation: if you have an agent pipeline with a stable set of tools and latency is a problem, build a test harness with your own schemas, run it against the acceptance criteria in this post, and measure. If it clears 99% valid JSON and 95% schema respected on your own cases, you have something useful. If not, you know exactly why.

That's more useful than any benchmark someone else wrote.

Are you using local models for tool calling? Tell me at juanchi.dev what stack you built and where you hit the limits.

This article was originally published on juanchi.dev

Show HN: Needle distilled Gemini tool calling en 26M parámetros — lectura técnica sin hype

Juan Torchia — Sun, 17 May 2026 12:30:38 +0000

Show HN: Needle distilled Gemini tool calling en 26M parámetros — lectura técnica sin hype

Estaba revisando mi pipeline de Ollama cuando apareció el post en HN: Needle, un modelo de 26M de parámetros destilado desde Gemini específicamente para tool calling. Mi primera reacción fue escéptica. 26M suena a juguete. Después leí con más calma y entendí que el punto interesante no es el tamaño: es el problema que están atacando.

Acá va mi lectura técnica, sin euforia y sin descarte fácil.

El problema real detrás de Needle y la destilación de Gemini para tool calling

Mi tesis es esta: el cuello de botella en sistemas con herramientas externas no es el razonamiento general del LLM, sino la parsabilidad del output. Si el modelo produce JSON mal formado, llama funciones con argumentos incorrectos o alucina nombres de tools que no existen, el sistema entero se rompe — no importa qué tan "inteligente" sea el modelo en otras tareas.

Esto lo experimenté directamente mientras armaba loops de agentes con Claude Code. La parte más frágil nunca fue el razonamiento; fue la confiabilidad del contrato de datos. Me acordé de cuando me resistí a TypeScript durante años pensando que los tipos eran burocracia. Después entendí que muchas fallas evitables empiezan como contratos de datos mal expresados. Con tool calling pasa exactamente lo mismo: un modelo puede ser brillante en prosa y pésimo para respetar un esquema JSON estricto bajo presión de latencia.

Needle ataca ese punto específico: toma el comportamiento de tool calling de Gemini — que es consistente y bien estructurado — y lo destila en un modelo pequeño y especializado. La hipótesis es que para esta tarea concreta, 26M entrenados con el comportamiento correcto pueden superar a modelos gigantes generalistas que no fueron ajustados para respetar esquemas de función con precisión.

¿Es verdad? En benchmarks propios, según el repositorio del proyecto, sí. En producción real propia, no lo sé todavía — y esa diferencia importa.

Qué es la destilación de conocimiento y por qué importa aquí

La destilación de conocimiento (knowledge distillation) es una técnica donde un modelo grande — el teacher — genera outputs que después se usan para entrenar un modelo pequeño — el student. El student no aprende de datos crudos: aprende a imitar el comportamiento del teacher en las distribuciones que más importan.

# Concepto simplificado del pipeline de destilación para tool calling:
# 1. Teacher (Gemini) genera miles de ejemplos de tool calling correcto
# 2. Student (Needle, 26M) entrena sobre esos ejemplos
# 3. El student aprende la distribución de outputs del teacher, no reglas escritas a mano

Para tool calling, esto tiene sentido particular. No necesitás que el modelo sepa historia universal. Necesitás que cuando le pasés este schema:

// Definición de herramienta — el modelo tiene que respetar esto al 100%
const tools = [
  {
    name: "buscar_producto",
    description: "Busca un producto por ID en el catálogo",
    parameters: {
      type: "object",
      properties: {
        producto_id: { type: "string" },
        incluir_stock: { type: "boolean" }
      },
      required: ["producto_id"]
    }
  }
]

El output sea exactamente:

{
  "name": "buscar_producto",
  "arguments": {
    "producto_id": "SKU-4821",
    "incluir_stock": true
  }
}

Y no alguna variación creativa con claves renombradas, tipos erróneos o campos inventados. En eso los modelos pequeños generalistas fallan bastante. Si Needle lo resuelve de forma confiable, el caso de uso existe.

Cómo probarlo en Ollama: checklist reproducible

Si querés validar si un modelo como Needle tiene lugar en tu stack, el criterio no debería ser un benchmark ajeno. Debería ser tu propio conjunto de herramientas bajo las condiciones reales de tu sistema.

# Paso 1: Instalar Ollama si no lo tenés
curl -fsSL https://ollama.com/install.sh | sh

# Paso 2: Cuando el modelo esté disponible en Ollama registry, pull directo
# (verificar disponibilidad en https://ollama.com/search)
ollama pull needle  # nombre tentativo — verificar el registry oficial

# Paso 3: Preparar un set de pruebas de tool calling propio
# No uses los ejemplos del README del modelo; usá TUS herramientas reales

// prueba-tool-calling.ts
// Criterios de validación que yo usaría para evaluar cualquier modelo pequeño

interface ResultadoPrueba {
  caso: string;
  esperado: object;
  obtenido: string;
  jsonValido: boolean;
  schemaRespetado: boolean;
  latenciaMs: number;
}

async function evaluarModeloToolCalling(
  modelo: string,
  casos: Array<{ prompt: string; schemaEsperado: object }>
): Promise<ResultadoPrueba[]> {
  const resultados: ResultadoPrueba[] = [];

  for (const caso of casos) {
    const inicio = Date.now();

    // Llamada al modelo vía API de Ollama
    const respuesta = await fetch("http://localhost:11434/api/chat", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        model: modelo,
        messages: [{ role: "user", content: caso.prompt }],
        // Pasar las herramientas como parte del request
        tools: [caso.schemaEsperado],
        stream: false,
      }),
    });

    const data = await respuesta.json();
    const latencia = Date.now() - inicio;

    // Validar si el JSON es parseable y si respeta el schema
    let jsonValido = false;
    let schemaRespetado = false;
    let obtenido = "";

    try {
      // El tool_call debería estar en message.tool_calls[0]
      const toolCall = data.message?.tool_calls?.[0];
      obtenido = JSON.stringify(toolCall ?? data.message?.content ?? "");
      jsonValido = !!toolCall;
      // Validación básica de schema: las claves required tienen que estar presentes
      if (toolCall?.function?.arguments) {
        const args = toolCall.function.arguments;
        const requiredKeys = Object.keys(caso.schemaEsperado);
        schemaRespetado = requiredKeys.every((k) => k in args);
      }
    } catch {
      obtenido = "parse error";
    }

    resultados.push({
      caso: caso.prompt.slice(0, 50),
      esperado: caso.schemaEsperado,
      obtenido,
      jsonValido,
      schemaRespetado,
      latenciaMs: latencia,
    });
  }

  return resultados;
}

Mi criterio mínimo de aceptación para cualquier modelo de tool calling en un sistema real:

Métrica	Mínimo aceptable	Por qué
JSON válido	99%+	Un parse error en producción rompe el flujo entero
Schema respetado	95%+	Argumentos incorrectos son silenciosamente peligrosos
Latencia p95	< 500ms local	Si tarda más que una API externa, perdiste el punto
Hallucination de tool names	0%	Un nombre inventado es un error no recuperable

Los límites que el hype no menciona

Hay tres limitaciones que no aparecen en los titulares y que me parecen centrales antes de apostar por un modelo destilado en un sistema real.

Primero, la distribución del teacher define el techo. Si Gemini tiene sesgos en cómo genera tool calls — ciertos patrones de argumentos, ciertas convenciones de nombrado — el student los hereda sin filtro. Esto importa si tu API tiene convenciones que se alejan del estilo de Gemini.

Segundo, la generalización a schemas no vistos es una pregunta abierta. Un modelo destilado puede ser excelente en los patrones que aprendió y frágil frente a schemas complejos con anyOf, $ref anidados o validaciones condicionales. Hay que probarlo explícitamente con los schemas propios, no asumir que el benchmark general aplica.

Tercero, el tamaño de 26M parámetros implica capacidad de contexto limitada. En sistemas donde el prompt incluye muchas herramientas al mismo tiempo — algo común en backends con docenas de endpoints expuestos como tools — la degradación puede ser significativa. Es una hipótesis que hay que validar, no asumir.

Esto no invalida el proyecto. Lo ubica. La misma disciplina que apliqué al revisar problemas de caché en CI con pnpm workspaces aplica acá: primero entender el límite, después decidir si encaja.

Dónde Needle sí tiene sentido y dónde no

Escenarios donde tiene sentido probar Needle:

Pipelines de agentes locales donde la latencia de red hacia APIs externas es el cuello de botella
Edge devices o entornos con recursos limitados donde un modelo de 26M entra en memoria cómodamente
Sistemas con un conjunto acotado y estable de herramientas — no docenas de schemas cambiantes
Como fallback local cuando las APIs externas no están disponibles

Escenarios donde probablemente no alcanza:

Sistemas donde el razonamiento entre pasos de tool calling es complejo — decidir cuándo llamar qué tool, no solo cómo llamarla
APIs con schemas profundamente anidados o polimórficos
Flujos donde el contexto conversacional largo importa — el límite de contexto de 26M va a doler
Entornos que necesitan garantías de seguridad auditables — un modelo destilado privado es una caja más opaca

La tensión que señaló el post de Spring Boot Actuator en producción aplica de otra manera acá: la comodidad de "funciona en el demo" puede esconder riesgos de superficie que solo aparecen bajo carga o con inputs inesperados.

Lo que esto anticipa para el ecosistema de modelos pequeños

Lo incómodo de Needle no es el modelo en sí. Es lo que confirma: la especialización funcional va a presionar la hegemonía de los modelos grandes generales en tareas estructuradas.

Tool calling, clasificación de intents, extracción de entidades con schema fijo — son tareas donde un modelo destilado bien entrenado puede ganarle a GPT-4 o Claude en costo y latencia sin sacrificar confiabilidad. Eso cambia el cálculo de arquitectura.

En mi stack actual con Claude Code para razonamiento complejo y Ollama para tareas locales, hay un hueco exactamente donde Needle apuntaría: el router de herramientas que decide qué función llamar y con qué argumentos, sin necesitar el overhead de un modelo de 70B para eso. No digo que lo vaya a adoptar mañana. Digo que la categoría tiene sentido y que el experimento merece seguimiento.

Al igual que cuando evalué tradeoffs de Jakarta EE vs Spring Boot o comparé gestores de paquetes en monorepos reales, la respuesta honesta no es "adoptalo ya" ni "ignoralo": es "probalo con tus propios criterios antes de comprometerte".

FAQ: Needle, destilación y tool calling en modelos pequeños

¿Qué es exactamente la destilación de modelos en el contexto de LLMs?
Es un proceso donde un modelo grande (teacher) genera un dataset de comportamiento correcto — en este caso, ejemplos de tool calling bien formados — que se usa para entrenar un modelo pequeño (student). El student aprende a imitar la distribución de outputs del teacher en las tareas específicas para las que fue destilado, sin necesitar la arquitectura completa del teacher.

¿26M parámetros es suficiente para tool calling confiable?
Depende del scope. Para un conjunto acotado de herramientas con schemas simples, probablemente sí. Para sistemas con docenas de herramientas complejas, contextos largos o razonamiento multi-paso, es una hipótesis abierta. El benchmark del proyecto es optimista; la validación con schemas propios es obligatoria antes de apostar.

¿Cómo lo pruebo localmente sin comprometer un sistema en producción?
Con Ollama, si el modelo está disponible en el registry, es tan simple como ollama pull [nombre] y después evaluar con un script propio contra los schemas que ya usás. El checklist de validación de este post es un punto de partida. Siempre contra tus herramientas reales, nunca contra los ejemplos del README.

¿Cuál es la diferencia práctica entre Needle y usar function calling de OpenAI o Anthropic?
Latencia, costo y privacidad. Un modelo local no tiene RTT de red, no tiene costo por token y no manda los schemas de tus herramientas a una API externa. La contrapartida es que la confiabilidad depende enteramente de la calidad del entrenamiento del modelo local, sin el respaldo de un proveedor con SLA.

¿Vale la pena para un stack individual o solo para empresas con infraestructura?
Un modelo de 26M entra en una MacBook con 8GB de RAM sin drama. No es infraestructura de empresa. Si ya usás Ollama para otras tareas — como yo — agregar un modelo especializado es operativamente trivial. El costo real es el tiempo de evaluación, no el hardware.

¿Qué pasa si el modelo alucina un nombre de herramienta que no existe en mi sistema?
Es el peor caso y hay que diseñarlo como falla esperada. La capa de routing que consume el output del modelo tiene que validar que el name de la tool call corresponda a una herramienta registrada antes de ejecutar. Si no existe, el error tiene que ser explícito y no silencioso. Esto es diseño defensivo básico, independiente del modelo que uses.

Conclusión: probalo con los ojos abiertos

No voy a decir que Needle es el futuro ni que es ruido. Mi postura es más específica: la destilación funcional de comportamiento de modelos grandes en modelos pequeños especializados es una dirección legítima, y tool calling es un caso de uso donde tiene sentido técnico genuino.

Lo que no compro es el entusiasmo sin fricción. Un modelo de 26M tiene límites reales de contexto, de generalización y de confiabilidad bajo schemas no vistos. Esos límites no aparecen en el post de HN y aparecerán en producción.

Mi recomendación concreta: si tenés un pipeline de agentes con un conjunto estable de herramientas y latencia es un problema, armá un harness de prueba con los schemas propios, correlo contra los criterios de aceptación del post y medí. Si pasa el umbral de 99% de JSON válido y 95% de schema respetado en tus propios casos, tenés algo útil. Si no, sabés exactamente por qué.

Eso es más útil que cualquier benchmark ajeno.

¿Estás usando modelos locales para tool calling? Contame en juanchi.dev qué stack armaste y dónde encontraste los límites.

Este artículo fue publicado originalmente en juanchi.dev

OpenTelemetry on Spring Boot 3: when logs say OK and traces show the problem

Juan Torchia — Sat, 16 May 2026 19:06:15 +0000

There's a question I've asked myself many times while debugging backend systems: did the request take long because the DB was slow, because the downstream kept us waiting, or because some internal loop fired 60 queries to fetch 60 records? The log says duration_ms=340 and status=200. That's it. You start guessing.

That moment of uncertainty is where this lab came from. Not to measure OpenTelemetry overhead, not to compare Jaeger against Tempo, but to answer something more concrete: what signals do you lose when you only have good logs, and what shows up when you add a trace?

The repo is at github.com/JuanTorchia/opentelemetry-spring-boot-lab, commit c12ea4e848dc431c8bbd324318399172302fe053, tag editorial-final-diagnosis-comparison-v2.

The setup: a lab that produces evidence, not benchmarks

The stack is Spring Boot 3.5.7, Java 21, PostgreSQL 16, OpenTelemetry API 1.43.0, OpenTelemetry Java Agent 2.9.0, and Jaeger all-in-one. Everything starts with Docker Compose. To reproduce it:

# Quick smoke test with small dataset (1k tasks)
.\scripts\run-lab.ps1 -Mode smoke -Size small

# Full editorial run (50k tasks, 200 requests, warmup 20, concurrency 8)
.\scripts\run-lab.ps1 -Mode editorial -Size editorial -Runs 3 -Requests 200 -Warmup 20 -Concurrency 8

The runner starts Compose, downloads the agent into tools/, packages the jar, seeds Postgres with synthetic tables (organizations, users, projects, tasks, comments), runs the scenarios, queries Jaeger by traceId, and regenerates the reports in results/.

Jaeger was chosen for local simplicity: one image, web UI, REST API to query traces by traceId. Tempo is also valid, but needs more moving parts for a local editorial demo. This is not a production stack recommendation.

The editorial dataset has 50,000 tasks. The small dataset has 1,000. That difference matters so the N+1 produces visible fan-out rather than a microsecond gap that disappears into noise.

The instrumentation decision I care about most

The pom.xml has opentelemetry-api as a compile dependency, but the agent arrives at runtime. That means HTTP server, HTTP client, and JDBC are instrumented automatically without touching business code.

Manual spans are used only for business stages that the agent can't infer:

// LabService.java — manual span to mark business intent
Span span = tracer.spanBuilder("business.n_plus_one.load_tasks_then_comments").startSpan();
try (var ignored = span.makeCurrent()) {
    // first fetches tasks, then runs one query per task
    List<Map<String, Object>> tasks = jdbcTemplate.queryForList(
        "select t.id, t.title, u.display_name as assignee from tasks t "
        + "join users u on u.id = t.assignee_id order by t.id limit ?",
        limit);
    for (Map<String, Object> task : tasks) {
        Long taskId = ((Number) task.get("id")).longValue();
        // this query repeats per task → fan-out
        Integer comments = jdbcTemplate.queryForObject(
            "select count(*) from comments where task_id = ?",
            Integer.class, taskId);
        // ...
    }
    span.setAttribute("lab.n_plus_one.expected_extra_queries", enriched.size());
} finally {
    span.end();
}

That mix is more honest for the post: auto-instrumentation for infrastructure, manual spans to explain intent. If I had used only manual spans, the lab would require observability-specific code in every layer. If I had relied only on the agent, business spans would be invisible.

The logback-spring.xml injects traceId and spanId into every log line:

<!-- logback-spring.xml -->
<pattern>%d{yyyy-MM-dd'T'HH:mm:ss.SSSXXX} %-5level traceId=%X{trace_id:-none} spanId=%X{span_id:-none} %logger{36} - %msg%n</pattern>

That's what connects both worlds. A log with traceId lets you jump directly to the trace in Jaeger. Without it, logs and traces are islands.

The Matrix That Summarizes The Diagnosis

Scenario	p95	Avg spans	Avg DB spans	Error spans/request	Defensible diagnosis
baseline	55 ms	3.04	1.04	0	Healthy request, no weird story.
optimized	59 ms	3.04	1.04	0	Same functional shape, no DB fan-out.
n-plus-one	209 ms	63.38	61.38	0	DB fan-out visible inside one request.
downstream-slow	374 ms	4	0	0	Time concentrates in downstream.
mixed	395 ms	7.57	1.57	0	DB, downstream, and transformation compete.
partial-error	184 ms	6.27	1.27	3	Downstream error inside a partial response.

This table is not trying to crown a tool. It summarizes which signals are available for diagnosis. The strong point is not that one number is universal: it is that N+1 leaves a very different shape than the optimized case, and that shape does not appear in a flat log unless you enable SQL debug.

What the six scenarios reveal

The lab has six endpoints: baseline, n-plus-one, optimized, downstream-slow, mixed, and partial-error. Each produces different signals that the runner consolidates into results/comparison.md and results/diagnosis-comparison.md.

The finding I most want to defend:

N+1 vs optimized: both return the same response shape. The log for both says status=200. The difference lives in the trace: n-plus-one generates an average of 63.38 spans per request in the editorial run; optimized generates 3.04. That's not a universal performance claim — it's a diagnostic signal. With only logs and no SQL debug enabled, the difference is ambiguous. With the trace, DB fan-out is visible without extra configuration.

Downstream-slow: p95 sits at 374 ms, very close to the configured 300 ms delay. Logs show total duration and traceId. What they don't show is where that time went: was it DB? was it the downstream? was it in-memory transformation? The trace separates it: the downstream HTTP client span dominates the hierarchy. The local DB appears as a secondary span with low duration.

Mixed: this is where flat logs fail the most. Three stages compete (DB, downstream, transformation) and none is obviously dominant. p95 reaches 395 ms. The trace shows the temporal distribution per stage. The log just says it was slow.

Partial-error: the endpoint responds with HTTP 206 (partial content). The log records traceId, status, and error type. The trace goes further: the downstream span is marked with error, nested under a request that technically responded. Logs and trace don't replace each other here — they complement. The log alerts and lets you correlate. The trace places the error in the causal hierarchy.

The Screenshot That Changed The Diagnosis

In Jaeger, n-plus-one does not look like a request that is merely a bit slower. It looks like a request with DB fan-out: many repeated spans under the same business operation.

The optimized case, on the other hand, keeps a compact shape. I do not need to read the code to suspect that the previous case was not "Postgres is slow" in the abstract, but the query shape.

The partial-error case matters for another reason: the request can respond, while the downstream span is marked as errored. That nuance is exactly where logs and traces complement each other: the log alerts, the trace locates.

The honest limits of the metrics

The *_vs_root_pct fields in results/diagnosis-comparison.md are cumulative percentages of span durations exported by Jaeger. They can exceed 100% when there are nested spans, client/server pairs, or overlap. The duration_denominator_type field indicates what was used as the denominator: root_span, http_request_span, or largest_observed_span if the trace was ambiguous.

These are not overhead numbers. They are not an exact distribution of real request time. They are cumulative diagnostic signals. Treating them like CPU percentages would be a misread that this lab doesn't try to encourage.

Similarly, diagnosis_confidence_* is an editorial classification coded in ScenarioDiagnosis.java, not an automatically measured metric. For N+1, diagnosisConfidenceLogs is low and diagnosisConfidenceTrace is high. That reflects the fact that without SQL debug, the log is ambiguous. It's not a universal benchmark of which tool is better.

My position: what I accept and what I don't buy

I accept that OpenTelemetry with the Java Agent is a reasonable way to add structural visibility to a Spring Boot 3 app without polluting business code. JDBC and HTTP client auto-instrumentation works well for common scenarios.

I don't buy the narrative that traces replace logs. The lab's RequestCompletionLoggingFilter is a Servlet filter that records every completed request with scenario, method, path, status, and duration. Those logs are operationally useful even when Jaeger is unavailable. The traceId in the log is the bridge, not the replacement.

I also don't buy that Jaeger is the only valid option. It was chosen because it starts with one image and has a ready web UI. Tempo, Zipkin, or any OTLP-compatible backend would solve the same problem in this context.

The honest trade-off is this: auto-instrumentation reduces accidental work but adds an agent on the classpath that exports data in the background. In a local lab that's trivial. In production, agent overhead depends on load, exporter configuration, and sampling. This lab doesn't measure that, and claiming otherwise would be misleading.

What to do with this

If you already have structured logs in production with traceId and spanId, the next step isn't replacing anything. It's adding a trace backend and connecting both worlds. The lab shows that Spring Boot 3 auto-instrumentation with the Java Agent is enough for common scenarios, and that manual spans only make sense when you want to name business intent that the agent can't infer.

If you're evaluating whether the effort is worth it: the case where it's most clearly justified isn't the healthy baseline. It's the mixed scenario or the N+1, where logs give you a number and the trace gives you a shape. The difference between guessing and diagnosing.

After this lab, my rule is simple: logs tell you what happened; traces help you understand how it happened. If the flat log only gives you total duration, you do not have an explanation yet. You have a clue.

This article was originally published on juanchi.dev

OpenTelemetry en Spring Boot 3: cuando el log dice OK y el trace muestra el problema

Juan Torchia — Sat, 16 May 2026 19:06:09 +0000

Hay una pregunta que me hice muchas veces debuggeando sistemas backend: ¿la request tardó porque la DB fue lenta, porque el downstream nos clavó, o porque algún loop interno disparó 60 queries para traer 60 registros? El log dice duration_ms=340 y status=200. Eso es todo. Empezás a adivinar.

Ese momento de incertidumbre fue el origen de este laboratorio. No para medir overhead de OpenTelemetry, no para comparar Jaeger contra Tempo, sino para responder algo más concreto: ¿qué señales perdés cuando solo tenés logs buenos, y qué aparece cuando sumás un trace?

El repo está en github.com/JuanTorchia/opentelemetry-spring-boot-lab, commit c12ea4e848dc431c8bbd324318399172302fe053, tag editorial-final-diagnosis-comparison-v2.

El setup: un laboratorio que produce evidencia, no benchmarks

El stack es Spring Boot 3.5.7, Java 21, PostgreSQL 16, OpenTelemetry API 1.43.0, OpenTelemetry Java Agent 2.9.0 y Jaeger all-in-one. Todo levanta con Docker Compose. Para reproducirlo:

# Smoke rápido con dataset pequeño (1k tasks)
.\scripts\run-lab.ps1 -Mode smoke -Size small

# Corrida editorial completa (50k tasks, 200 requests, warmup 20, concurrencia 8)
.\scripts\run-lab.ps1 -Mode editorial -Size editorial -Runs 3 -Requests 200 -Warmup 20 -Concurrency 8

El runner levanta Compose, descarga el agente en tools/, empaqueta el jar, seedea Postgres con tablas sintéticas (organizations, users, projects, tasks, comments), ejecuta los escenarios, consulta Jaeger por traceId y regenera los reportes en results/.

Jaeger fue elegido por simplicidad local: una imagen, UI web, API REST para consultar traces por traceId. Tempo también es válido, pero necesita más piezas para una demo editorial local. No es una recomendación de stack productivo.

El dataset editorial tiene 50.000 tasks. El small tiene 1.000. La diferencia importa para que el N+1 produzca fan-out visible y no una diferencia de microsegundos que desaparece en el ruido.

La decisión de instrumentación que más me importa

El pom.xml tiene opentelemetry-api como dependencia de compilación, pero el agente llega en runtime. Eso significa que HTTP server, HTTP client y JDBC se instrumentan automáticamente sin tocar el código de negocio.

Los spans manuales se usan solo para etapas de negocio que el agente no puede inferir:

// LabService.java — span manual para marcar intención de negocio
Span span = tracer.spanBuilder("business.n_plus_one.load_tasks_then_comments").startSpan();
try (var ignored = span.makeCurrent()) {
    // primero trae tasks, luego hace una query por cada una
    List<Map<String, Object>> tasks = jdbcTemplate.queryForList(
        "select t.id, t.title, u.display_name as assignee from tasks t "
        + "join users u on u.id = t.assignee_id order by t.id limit ?",
        limit);
    for (Map<String, Object> task : tasks) {
        Long taskId = ((Number) task.get("id")).longValue();
        // esta query se repite por cada task → fan-out
        Integer comments = jdbcTemplate.queryForObject(
            "select count(*) from comments where task_id = ?",
            Integer.class, taskId);
        // ...
    }
    span.setAttribute("lab.n_plus_one.expected_extra_queries", enriched.size());
} finally {
    span.end();
}

Esa mezcla es más honesta para el post: auto-instrumentación para infraestructura, spans manuales para explicar intención. Si hubiera usado solo spans manuales, el lab requeriría código específico de observabilidad en cada capa. Si hubiera confiado solo en el agente, los spans de negocio serían invisibles.

El logback-spring.xml inyecta traceId y spanId en cada línea de log:

<!-- logback-spring.xml -->
<pattern>%d{yyyy-MM-dd'T'HH:mm:ss.SSSXXX} %-5level traceId=%X{trace_id:-none} spanId=%X{span_id:-none} %logger{36} - %msg%n</pattern>

Eso es lo que conecta ambos mundos. Un log con traceId te permite saltar directo al trace en Jaeger. Sin eso, logs y traces son islas.

La matriz que resume el diagnóstico

Escenario	p95	Spans promedio	DB spans promedio	Error spans/request	Diagnóstico defendible
baseline	55 ms	3,04	1,04	0	Request sana, sin historia rara.
optimized	59 ms	3,04	1,04	0	Misma forma funcional, sin fan-out DB.
n-plus-one	209 ms	63,38	61,38	0	Fan-out DB visible en una sola request.
downstream-slow	374 ms	4	0	0	El tiempo se concentra en downstream.
mixed	395 ms	7,57	1,57	0	DB, downstream y transformación compiten.
partial-error	184 ms	6,27	1,27	3	Error downstream dentro de una respuesta parcial.

Esta tabla no intenta coronar una herramienta. Resume qué señales quedan disponibles para diagnosticar. El dato fuerte no es que un número sea universal: es que el N+1 deja una forma muy distinta al caso optimizado, y esa forma no aparece en un log plano sin activar SQL debug.

Lo que revelan los seis escenarios

El laboratorio tiene seis endpoints: baseline, n-plus-one, optimized, downstream-slow, mixed y partial-error. Cada uno produce señales diferentes que el runner consolida en results/comparison.md y results/diagnosis-comparison.md.

El hallazgo que más me interesa defender:

N+1 vs optimized: ambos devuelven el mismo shape de respuesta. El log de ambos dice status=200. La diferencia está en el trace: n-plus-one genera un promedio de 63,38 spans por request en la corrida editorial; optimized genera 3,04. Eso no es un claim de performance universal, es una señal diagnóstica. Con solo los logs, sin activar SQL debug, la diferencia es ambigua. Con el trace, el fan-out DB es visible sin configuración extra.

Downstream-slow: el p95 está en 374 ms, muy cerca del delay configurado de 300 ms. Los logs muestran la duración total y el traceId. Lo que no muestran es dónde se fue ese tiempo: ¿fue DB? ¿fue el downstream? ¿fue transformación en memoria? El trace lo separa: el span HTTP client del downstream domina la jerarquía. La DB local aparece como span secundario de duración baja.

Mixed: aquí es donde los logs planos fallan más. Tres etapas compiten (DB, downstream, transformación) y ninguna es dominante de forma obvia. El p95 llega a 395 ms. El trace muestra la distribución temporal por etapa. El log solo dice que tardó.

Partial-error: el endpoint responde con HTTP 206 (partial content). El log registra el traceId, el status y el tipo de error. El trace va más lejos: el span del downstream está marcado con error, anidado bajo una request que técnicamente respondió. Logs y trace no se reemplazan acá, se complementan. El log avisa y permite correlacionar. El trace ubica el error en la jerarquía causal.

La captura que cambió el diagnóstico

En Jaeger, n-plus-one no se ve como una request apenas más lenta. Se ve como una request con fan-out DB: muchos spans repetidos bajo una misma operación de negocio.

El caso optimizado, en cambio, mantiene una forma compacta. No necesito mirar el código para sospechar que el problema del caso anterior no era "Postgres lento" en abstracto, sino el shape de queries.

El caso de error parcial también vale por otra razón: la request puede responder, pero el span del downstream queda marcado con error. Ese matiz es justo donde logs y traces se complementan: el log avisa, el trace ubica.

El límite honesto de las métricas

Los campos *_vs_root_pct en results/diagnosis-comparison.md son porcentajes acumulados de duración de spans exportados por Jaeger. Pueden superar el 100% cuando hay spans anidados, pares cliente/servidor o solapamiento. El campo duration_denominator_type indica qué se usó como denominador: root_span, http_request_span o largest_observed_span si la traza quedó ambigua.

No son overhead. No son distribución exacta del tiempo real de la request. Son señales diagnósticas acumuladas. Usarlas como si fueran porcentajes de CPU sería un error de interpretación que este lab no intenta fomentar.

De la misma forma, diagnosis_confidence_* es una clasificación editorial codificada en ScenarioDiagnosis.java, no una métrica medida automáticamente. Para N+1, diagnosisConfidenceLogs es low y diagnosisConfidenceTrace es high. Eso refleja que sin SQL debug, el log es ambiguo. No es un benchmark universal de qué herramienta es mejor.

Mi postura: qué acepto y qué no compro

Acepto que OpenTelemetry con el Java Agent es una forma razonable de agregar visibilidad estructural a una app Spring Boot 3 sin ensuciar el código de negocio. La auto-instrumentación de JDBC y HTTP client funciona bien para escenarios comunes.

No compro la narrativa de que los traces reemplazan los logs. El RequestCompletionLoggingFilter del lab es un filtro Servlet que registra cada request completada con escenario, método, path, status y duración. Esos logs son operativamente útiles aunque Jaeger no esté disponible. El traceId en el log es el puente, no el reemplazo.

Tampoco compro que Jaeger sea la única opción válida. Se eligió porque levanta con una imagen y tiene UI web lista. Tempo, Zipkin o cualquier backend compatible con OTLP resolverían el mismo problema en este contexto.

El trade-off honesto es este: la auto-instrumentación reduce trabajo accidental pero agrega un agente en el classpath que exporta datos en background. En un laboratorio local eso es trivial. En producción, el overhead del agente depende de la carga, la configuración del exporter y el sampling. Este lab no mide eso, y sería engañoso afirmar que sí.

Qué hacer con esto

Si ya tenés logs estructurados en producción con traceId y spanId, el paso siguiente no es reemplazar nada. Es agregar el backend de traces y conectar ambos mundos. El lab muestra que la auto-instrumentación de Spring Boot 3 con el Java Agent es suficiente para los escenarios comunes, y que los spans manuales tienen sentido solo cuando querés nombrar intención de negocio que el agente no puede inferir.

Si estás evaluando si vale la pena el esfuerzo: el caso donde más claramente lo justifica no es el baseline sano. Es el escenario mixto o el N+1, donde los logs te dan un número y el trace te da una forma. La diferencia entre adivinar y diagnosticar.

Después de este lab, mi regla queda así: logs para saber qué pasó; traces para entender cómo pasó. Si el log plano te da solo duración total, todavía no tenés una explicación. Tenés una pista.

Este artículo fue publicado originalmente en juanchi.dev

Prisma vs JDBC: the benchmark that almost made me blame the wrong ORM

Juan Torchia — Sat, 16 May 2026 01:36:53 +0000

There's a discussion that surfaces every time someone posts an ORM benchmark: "of course JDBC is faster, you're measuring the abstraction". They're right, but only halfway. What nobody says is that the abstraction isn't the only culprit — sometimes the culprit is you, because you let an N+1 slip through without noticing.

I built prismavsjdbc to test this in a controlled way. It's not a benchmark about who wins. It's a lab where the same PostgreSQL 16, the same 50k-task dataset, and the same business scenarios run against two stacks: Node.js 24 LTS + TypeScript + Prisma 5 on one side, and Spring Boot 3 + Java 21 LTS + JdbcTemplate on the other. The analyzed commit is 2cd33e32bd29a1d4b46a26af0b56d6a912f5e4f5, tag best-effort-editorial-final.

The thesis I'm defending is this: query shape, SQL/request, and N+1 explain more than the slogan "ORM vs raw SQL". When you optimize the shape, both stacks improve. When you don't, both stacks charge you.

The problem that almost made me draw the wrong conclusion

The first version of the lab had an obvious trap, even though I didn't see it at first. It compared the most comfortable Prisma implementation — using include to fetch relations — against a manual join in JDBC. The result was predictable: JDBC measured 1 SQL/request, idiomatic Prisma measured 4 SQL/request on read-by-id, and latency reflected that.

Incorrect conclusion I almost published: "Prisma is slower because it emits more queries".

Correct conclusion: I was comparing different shapes. Prisma's include fires separate queries per relation — that's not a bug, it's the documented contract of the API. JDBC did a join because I wrote it that way. It's not fair to compare them without acknowledging that.

That friction changed the entire lab design: I needed three levels within each stack.

Three levels: naive, idiomatic, best-effort

Adding the level column to results/comparison.csv was the most important decision in the project. Without it, any results table is a trap for the reader.

naive: the most direct implementation possible, with no thought given to performance. In both stacks, this includes deliberate N+1 — per-task queries inside a loop.
idiomatic: the normal, maintainable way to write code in each stack. Prisma with include and _count, JDBC with the join any Java dev would write without obsessing over micro-optimizations.
best-effort: the tightest code the team would accept without it becoming a hack. For Prisma, this means dropping to $queryRaw when the shape is aggregational.

The read-by-id scenario with idiomatic Prisma measured 4 SQL/request due to include. The read-by-id-best-effort variant with $queryRaw dropped to 1 SQL/request — the same join JDBC uses. The PostgreSQL plan for that query is clean:

-- read-by-id-best-effort: same SQL in Prisma $queryRaw and in JdbcTemplate
select t.id, t.title, t.status, t.created_at as "createdAt",
       p.id as "projectId", p.name as "projectName",
       o.id as "organizationId", o.name as "organizationName",
       u.id as "assigneeId", u.display_name as "assigneeName"
from tasks t
join projects p on p.id = t.project_id
join organizations o on o.id = p.organization_id
join users u on u.id = t.assignee_id
where t.id = '00000000-0000-4000-0100-000000000001'::uuid
limit 1;
-- Execution Time: 0.242 ms, Buffers: shared hit=9

When Prisma and JDBC emit the same SQL, the PostgreSQL plan is identical. That closes the runtime debate: the bottleneck was the shape, not the client.

N+1 is the usual villain, but the lab shows it with numbers

The n-plus-one-trap scenario exists to make explicit something every developer knows in theory but underestimates in practice. The naive level in both stacks fires individual queries per task — on a 50k-task dataset with concurrency 16, that scales brutally.

The biggest jump in the lab wasn't between Prisma and JDBC. It was between naive and idiomatic within Prisma. When you go from N+1 to include/_count, the reduction in SQL/request is immediate and visible in latency. After that, if you want to squeeze more, $queryRaw gives you another jump — but smaller than the first.

The interesting part on the Java side is that CountingJdbc — the wrapper over JdbcTemplate in apps/jdbc-service/src/main/java/com/example/jdbclab/CountingJdbc.java — uses an AtomicLong to count queries. That allows an objective SQL/request comparison without relying on logs or pg_stat_statements as the primary source:

// CountingJdbc.java — instrumentation with no magic, easy to audit
@Component
public class CountingJdbc {
  private final JdbcTemplate jdbc;
  private final AtomicLong queryCount = new AtomicLong();

  public <T> List<T> query(String sql, RowMapper<T> mapper, Object... args) {
    // each call to the wrapper adds 1 to the counter
    queryCount.incrementAndGet();
    return jdbc.query(sql, mapper, args);
  }

  public long count() {
    return queryCount.get();
  }
}

On the Prisma side, the equivalent lives in apps/prisma-client/src/db.ts: it hooks into the client's query event to count. That symmetry in instrumentation is what makes the SQL/request numbers comparable across stacks.

When $queryRaw makes sense and when it's a surrender

This is the part where a lot of Prisma posts aren't honest. $queryRaw exists and is valid, but using it for everything is admitting you don't want to use Prisma — you're using PostgreSQL with a fancy TypeScript client.

The decision in the lab was clear: best-effort with $queryRaw makes sense in relation-summary and report-aggregation because the shape is genuinely aggregational. Prisma groupBy doesn't cleanly express date_trunc + join by organization, and forcing it would be worse than writing SQL.

By contrast, paginated-list has no best-effort variant because idiomatic Prisma already emits 1 SQL/request with findMany and filters. Adding $queryRaw there wouldn't change anything meaningful — it would be complexity with no benefit.

The table in docs/brief-post.md models this well: the level column isn't a scale of "how much effort you put in" but of "how much the SQL shape changes when you apply the variant".

What the lab can't guarantee

The HTTP runner is homegrown — not k6 or wrk. The hardware is local. Docker Desktop, GC, plan cache, and indexes can shift absolute latencies between runs. The editorial run used 3 runs, 300 requests per run, 30 warmup requests, concurrency 16, and a 50k-task dataset — but those numbers on different hardware can produce different results.

The version matrix (docs/java-version-matrix.md) shows Java 21 vs Java 25: there are differences, but the main argument — that N+1 and SQL/request dominate — holds on both JVMs. Java 25 improved read-by-id by ~20% over Java 21 in the local run, but that doesn't change the fact that the problem in relation-summary-naive was the shape, not the JVM.

I wouldn't publish those absolute numbers as universal truth. I publish them as evidence of a pattern: when you change the shape, the delta is orders of magnitude larger than when you change the runtime.

The position I landed on

Prisma is not slow. Prisma with include emitting 4 queries where you could emit 1 is an ergonomics trade-off with an observable cost — and that cost is worth it for most endpoints in an API that isn't under extreme pressure. When shape genuinely matters, $queryRaw exists and works well.

JDBC with JdbcTemplate is not superior just because it's raw SQL. It's predictable because the developer controls the shape from the start. The risk is on the other side: that nobody checks whether those Java loops are also doing N+1 without an ORM to blame.

The lab is reproducible. If you have Docker, Node 24 LTS, and Java 21 or 25, you can run it:

# full editorial run — Bash
bash scripts/run-lab.sh --mode editorial --size editorial --runs 3 --requests 300 --warmup 30 --concurrency 16

And if you just want to verify the scenarios run without errors before committing time:

# quick smoke test to validate the setup
bash scripts/run-lab.sh --mode smoke --size small

The code is at github.com/JuanTorchia/prismavsjdbc. Editorial results are in results/comparison.csv and results/comparison.md.

What I'd like to know: in the stack you're using right now, do you have real visibility into the SQL/request count for each endpoint? Or do you assume the ORM handles it on its own?

This article was originally published on juanchi.dev.

Prisma vs JDBC: el benchmark que casi me hace culpar al ORM equivocado

Juan Torchia — Sat, 16 May 2026 01:36:44 +0000

Hay una discusión que aparece cada vez que alguien postea un benchmark de ORM: "claro que JDBC es más rápido, estás midiendo la abstracción". Y tienen razón, pero solo a medias. Lo que nadie dice es que la abstracción no es el único culpable — a veces el culpable sos vos, que dejaste pasar un N+1 sin darte cuenta.

Armé prismavsjdbc para probar esto de forma controlada. No es un benchmark de quién gana. Es un laboratorio donde el mismo PostgreSQL 16, el mismo dataset de 50k tasks y los mismos casos de negocio corren contra dos stacks: Node.js 24 LTS + TypeScript + Prisma 5 por un lado, y Spring Boot 3 + Java 21 LTS + JdbcTemplate por el otro. El commit analizado es 2cd33e32bd29a1d4b46a26af0b56d6a912f5e4f5, tag best-effort-editorial-final.

La tesis que defiendo es esta: query shape, SQL/request y N+1 explican más que el slogan "ORM vs SQL directo". Cuando optimizás el shape, los dos stacks mejoran. Cuando no, los dos te cobran.

El problema que casi me hace concluir mal

La primera versión del laboratorio tenía una trampa obvia, aunque no la vi al principio. Comparaba la implementación más cómoda de Prisma — usando include para traer relaciones — contra un join manual en JDBC. El resultado era predecible: JDBC medía 1 SQL/request, Prisma idiomatic medía 4 SQL/request en read-by-id, y la latencia lo reflejaba.

Conclusión incorrecta que casi publico: "Prisma es más lento porque emite más queries".

Conclusión correcta: estaba comparando shapes distintos. El include de Prisma hace queries separadas por relación — no es un bug, es el contrato documentado de la API. JDBC hacía un join porque yo lo escribí así. No es fair compararlos sin reconocerlo.

Esa es la fricción que cambió todo el diseño del lab: necesitaba tres niveles dentro de cada stack.

Tres niveles: naive, idiomatic, best-effort

Agregar la columna level al results/comparison.csv fue la decisión más importante del proyecto. Sin ella, cualquier tabla de resultados es una trampa para el lector.

naive: la implementación más directa posible, sin pensar en performance. En ambos stacks, esto incluye N+1 deliberado — consultas por task dentro de un loop.
idiomatic: la forma normal y mantenible de escribir el código en cada stack. Prisma con include y _count, JDBC con el join que escribiría cualquier dev Java sin obsesionarse con micro-optimizaciones.
best-effort: el código más ajustado que acepta el equipo sin convertirse en un hack. Para Prisma, esto significa bajar a $queryRaw cuando el shape es agregacional.

El escenario read-by-id con Prisma idiomatic midió 4 SQL/request por el include. La variante read-by-id-best-effort con $queryRaw bajó a 1 SQL/request — el mismo join que usa JDBC. El plan de PostgreSQL para ese query es limpio:

-- read-by-id-best-effort: mismo SQL en Prisma $queryRaw y en JdbcTemplate
select t.id, t.title, t.status, t.created_at as "createdAt",
       p.id as "projectId", p.name as "projectName",
       o.id as "organizationId", o.name as "organizationName",
       u.id as "assigneeId", u.display_name as "assigneeName"
from tasks t
join projects p on p.id = t.project_id
join organizations o on o.id = p.organization_id
join users u on u.id = t.assignee_id
where t.id = '00000000-0000-4000-0100-000000000001'::uuid
limit 1;
-- Execution Time: 0.242 ms, Buffers: shared hit=9

Cuando Prisma y JDBC emiten el mismo SQL, el plan de PostgreSQL es idéntico. Eso cierra la discusión del runtime: el cuello de botella era el shape, no el cliente.

El N+1 es el villano de siempre, pero el lab lo muestra con números

El escenario n-plus-one-trap existe para hacer explícito algo que cualquier desarrollador sabe en teoría pero subestima en práctica. El nivel naive en ambos stacks hace consultas individuales por task — en un dataset de 50k tasks con concurrencia 16, eso escala de manera brutal.

El salto más importante en el lab no fue entre Prisma y JDBC. Fue entre naive e idiomatic dentro de Prisma. Cuando pasás de N+1 a include/_count, la reducción de SQL/request es inmediata y visible en la latencia. Después, si querés apretarlo más, $queryRaw te da otro salto — pero menor que el primero.

Lo interesante del lado Java es que CountingJdbc — el wrapper sobre JdbcTemplate que está en apps/jdbc-service/src/main/java/com/example/jdbclab/CountingJdbc.java — usa un AtomicLong para contar queries. Eso permite comparar SQL/request de forma objetiva sin depender de logs ni de pg_stat_statements como fuente principal:

// CountingJdbc.java — instrumentación sin magia, fácil de auditar
@Component
public class CountingJdbc {
  private final JdbcTemplate jdbc;
  private final AtomicLong queryCount = new AtomicLong();

  public <T> List<T> query(String sql, RowMapper<T> mapper, Object... args) {
    // cada llamada al wrapper suma 1 al contador
    queryCount.incrementAndGet();
    return jdbc.query(sql, mapper, args);
  }

  public long count() {
    return queryCount.get();
  }
}

Del lado de Prisma, el equivalente está en apps/prisma-client/src/db.ts: se engancha al evento query del cliente para contar. Esa simetría en la instrumentación es lo que hace que los números de SQL/request sean comparables entre stacks.

Cuándo $queryRaw tiene sentido y cuándo es una rendición

Esta es la parte donde muchos posts sobre Prisma no son honestos. $queryRaw existe y es válido, pero usarlo para todo es admitir que no querés usar Prisma — estás usando PostgreSQL con un cliente TypeScript de lujo.

La decisión en el lab fue clara: best-effort con $queryRaw tiene sentido en relation-summary y report-aggregation porque el shape es genuinamente agregacional. Prisma groupBy no expresa limpiamente date_trunc + join por organization, y forzarlo sería peor que escribir SQL.

En cambio, paginated-list no tiene variante best-effort porque Prisma idiomatic ya emite 1 SQL/request con findMany y filtros. Agregar $queryRaw ahí no cambiaría nada relevante — sería complejidad sin beneficio.

La tabla en docs/brief-post.md lo modela bien: la columna level no es una escala de "cuánto esfuerzo pusiste" sino de "cuánto cambia el shape SQL cuando aplicás la variante".

Lo que el lab no puede garantizar

El runner HTTP es propio — no es k6 ni wrk. El hardware es local. Docker Desktop, GC, plan cache e índices pueden mover las latencias absolutas entre corridas. La corrida editorial usó 3 runs, 300 requests por run, warmup de 30, concurrencia 16 y dataset de 50k tasks, pero esos números en otro hardware pueden dar resultados distintos.

La matriz de versiones (docs/java-version-matrix.md) muestra Java 21 vs Java 25: hay diferencias, pero el argumento principal — que N+1 y SQL/request dominan — se mantiene en ambas JVMs. Java 25 mejoró read-by-id un ~20% sobre Java 21 en la corrida local, pero eso no cambia que el problema en relation-summary-naive era el shape, no la JVM.

No publicaría esos números absolutos como verdad universal. Los publico como evidencia de un patrón: cuando cambiás el shape, el delta es órdenes de magnitud mayor que cuando cambiás el runtime.

La postura que me quedé

Prisma no es lento. Prisma con include que emite 4 queries donde podrías emitir 1 es una decisión de ergonomía que tiene un costo observable — y ese costo vale la pena en la mayoría de los endpoints de una API que no está bajo presión extrema. Cuando el shape importa de verdad, $queryRaw existe y funciona bien.

JDBC con JdbcTemplate no es superior por ser SQL directo. Es predecible porque el desarrollador controla el shape desde el primer momento. El riesgo está en el lado opuesto: que nadie revise si esos loops en Java también están haciendo N+1 sin que el ORM sea el chivo expiatorio.

El lab es reproducible. Si tenés Docker, Node 24 LTS y Java 21 o 25, podés correrlo:

# corrida editorial completa — Bash
bash scripts/run-lab.sh --mode editorial --size editorial --runs 3 --requests 300 --warmup 30 --concurrency 16

Y si querés solo verificar que los escenarios corren sin errores antes de comprometer tiempo:

# smoke rápido para validar el setup
bash scripts/run-lab.sh --mode smoke --size small

El código está en github.com/JuanTorchia/prismavsjdbc. Los resultados editoriales están en results/comparison.csv y results/comparison.md.

Lo que me gustaría saber: en el stack que usás ahora mismo, ¿tenés visibilidad real del SQL/request de cada endpoint? ¿O asumís que el ORM lo resuelve solo?

Este articulo fue publicado originalmente en juanchi.dev.

Retry isn't free: budget, amplification, and the cost that never shows up in p95

Juan Torchia — Fri, 15 May 2026 15:55:35 +0000

There's a decision I've gotten wrong more than once: adding retry as if it were a free improvement. Configure three attempts with exponential backoff, the system looks more stable on the dashboard, done. What I wasn't watching was how many extra calls I was sending to the downstream on every failure.

This post comes from an experiment I built to measure exactly that: when retry buys real availability, when it multiplies pressure, and when it simply changes nothing because the problem isn't transient. The repo is retry-resilience-experiment, commit bdfc350, with Spring Boot 3.3.5, Java 21, Resilience4j 2.2.0, and k6 as the load generator.

My thesis is simple: retry is budget. Each extra attempt consumes user wait time, hits the real downstream, and can accelerate a degradation that was already in progress. It's not a feature you flip on and call it done.

The problem with only looking at success rate

When the downstream has simulated random failures at 35%, the difference between policies is visible. With no-retry-standard-timeout, the success rate in that run was 0.6529. With immediate-retry, it climbed to 0.955. That looks like a clear win.

But the number that matters is right next to it: retry_amplification_factor. With immediate-retry on random-failures it reached 1.465. That means for every user request, the system made 1.465 real calls to the downstream. In jitter-random-failures it was 1.471. The downstream received almost 47% more traffic than k6 generated.

For transient failures that might be acceptable. The downstream is failing for external reasons, retries land at different moments, and the outcome improves. But that 47% extra isn't abstract: downstream capacity has to exist to absorb it. If the service is already at its limit, that overhead is the nudge that tips it over.

The metric the repo defines as a contract for not fooling yourself is exactly that:

// MetricSnapshot.java — this line exists to prevent self-deception
double retryAmplificationFactor, // downstream_calls / total_requests

If you only look at successRate and errorRate, you can believe you won when you actually pushed 47% more load onto a system that was already struggling.

progressive-degradation: where retry can accelerate the collapse

This scenario is the most interesting one methodologically, and also the one with the most important warning.

The PROGRESSIVE_DEGRADATION downstream implements this:

// DownstreamScenario.java — delay grows with each real call received
case PROGRESSIVE_DEGRADATION ->
    Duration.ofMillis(Math.min(900, 80 + callNumber * 3));

The delay isn't external or fixed: it grows with callNumber, which is the counter of real calls to the downstream. That means a policy with more retries generates more calls, and those calls accelerate the degradation. It's not the same failure for everyone: policies with retry degrade faster because they push harder.

The numbers from the run show this clearly. With no-retry-standard-timeout, 7720 total requests were processed and 7720 downstream calls were initiated. With immediate-retry, total requests dropped to 2939 but downstream calls went up to 8699, with an amplification factor of 2.96. The retry policy processed fewer user requests but made more downstream calls.

To be clear: this isn't a design flaw, it's the point of the experiment. The lab documents it explicitly in docs/brief-post.md: progressive-degradation should be read as load-sensitive degradation, not as an identical external failure for all policies. If you treat it as a direct comparison between policies under the same conditions, the conclusion is framed wrong from the start.

What you can conclude: in scenarios where the degradation rate depends on the volume of calls received, retries can be an accelerant. That has a name in production: retry storm. And the lab reproduces it in a controlled way.

The percentiles that lie to you when there are timeouts

There's a technical detail that changed how I read the results, and the README documents it honestly.

The caller timeout is implemented with future.cancel(true) in the RetryExecutor:

// RetryExecutor.java — cancel(true) interrupts the attempt from the caller side
try {
    future.get(policy.timeout().toMillis(), TimeUnit.MILLISECONDS);
    return new AttemptResult(true, elapsedMs(started), "ok", true);
} catch (TimeoutException timeout) {
    future.cancel(true);
    return new AttemptResult(false, elapsedMs(started), "timeout", true);
}

When an attempt exceeds the timeout, the latency recorded for that attempt is capped by the caller timeout: STANDARD_TIMEOUT = Duration.ofMillis(260). That's why in progressive-degradation almost all all_attempt_p95_ms and all_attempt_p99_ms values show exactly 260. It's not that the downstream responded in 260 ms: it's that the caller stopped waiting at 260 ms and recorded that as the attempt latency.

What happens after the cancel(true) in the simulated downstream isn't fully modeled. In a real system with HTTP, a database, or a queue, the downstream may keep executing work even after the client has given up. The lab counts initiated calls but can't guarantee there's no residual work post-cancellation.

This also matters for reading successful_requests_per_second. The value of 0.95 that appears across several progressive-degradation scenarios isn't the system's maximum capacity: it's the useful work observed under that closed k6 load. With a different VU configuration, a different duration, or a real network, the numbers would differ.

circuit-breaker and bulkhead: visible rejections as a protection signal

In progressive-degradation, the circuit breaker produces something that looks contradictory at first glance. The 13-circuit-breaker-progressive-degradation run has total_requests = 44777 and circuit_breaker_rejected = 44718. The error rate is 0.9987. That looks catastrophic.

But look at the downstream calls: 198. Amplification factor: 0.004. The circuit breaker almost completely stopped sending calls to the downstream. The rejections are visible to the client, but the downstream is protected.

Compare that with immediate-retry-progressive-degradation, which has downstream_calls = 8699 and keeps failing at the same rate, and the trade-off becomes obvious. The circuit breaker chooses to reject fast rather than multiply pressure on something that can no longer respond.

The bulkhead in the same run shows a different variant: bulkhead_rejected = 22122 with downstream_calls = 3668. It limits concurrency instead of opening the circuit, but the effect is similar: it reduces downstream pressure at the cost of visible rejections.

Those concurrency signals (max_inflight_downstream = 16 for bulkhead, 40 for most other runs) are observations, not proof of saturation. The lab renamed the metric from saturationObservation to concurrencyObservation for exactly that reason: high max_inflight doesn't prove CPU, network, or connection pool saturation. It's a signal that invites investigation, not a conclusion.

What I conclude and what I don't

This experiment is a local simulation, a single published run, against a simulated downstream with in-memory delays. The numbers don't represent production, don't represent any real provider, and don't support claiming "this policy scales to X RPS". If you want to publish exact values with strong claims, the README says it clearly: run at least three editorial runs and look for consistency, not a single pass.

What I think can be sustained:

In transient failures, retry can improve success rate but always has an amplification factor greater than 1. That overhead exists and has to fit within the system.
In load-sensitive degradation, more retries can accelerate the degradation because they generate more calls. This isn't universal, but the scenario is real and the experiment reproduces it.
p95 and p99 of attempts don't tell you the real downstream latency when there are timeouts: they tell you how long the caller waited before giving up.
Circuit breaker and bulkhead produce visible rejections that can be exactly the right decision to protect the system.

What I don't conclude: that one policy is better than another in the abstract, that these numbers apply to a different system, or that max_inflight_downstream proves saturation.

The question I'm leaving open for further exploration: how much real residual work actually remains in the downstream after a future.cancel(true) in a system with an HTTP connection pool? The lab notes it as a known limitation. In production that's exactly where the difference lies between a timeout that protects and one that only hides the problem.

The repo is at github.com/JuanTorchia/retry-resilience-experiment. If you run it and get different numbers, I want to know.

This article was originally published on juanchi.dev.

Retry no es gratis: presupuesto, amplificación y el costo que no aparece en el p95

Juan Torchia — Fri, 15 May 2026 15:55:26 +0000

Hay una decisión que tomé mal más de una vez: agregar retry como si fuera una mejora sin costo. Configuro tres intentos con backoff exponencial, el sistema se ve más estable en el dashboard, y listo. Lo que no estaba mirando era cuántas llamadas extra le estaba mandando al downstream en cada falla.

Este post nace de un experimento que armé para medir eso con precisión: cuándo retry compra disponibilidad real, cuándo multiplica presión y cuándo simplemente no cambia nada porque el problema no es transitorio. El repo es retry-resilience-experiment, commit bdfc350, con Spring Boot 3.3.5, Java 21, Resilience4j 2.2.0 y k6 como generador de carga.

Mi tesis es simple: retry es presupuesto. Cada intento extra consume tiempo de espera del usuario, llama al downstream real y puede acelerar una degradación que ya estaba en curso. No es una feature que activás y listo.

El problema de mirar solo el success rate

Cuando el downstream tiene fallas aleatorias simuladas al 35%, la diferencia entre políticas es visible. Con no-retry-standard-timeout, el success rate en esa corrida fue 0.6529. Con immediate-retry, subió a 0.955. Eso parece una victoria clara.

Pero el número que importa está al lado: el retry_amplification_factor. Con immediate-retry en random-failures llegó a 1.465. Eso significa que por cada request del usuario, el sistema hizo 1.465 llamadas reales al downstream. En jitter-random-failures fue 1.471. El downstream recibió casi un 47% más de tráfico del que generó k6.

En fallas transitorias eso puede ser aceptable. El downstream está fallando por razones externas, los reintentos aterrizan en momentos distintos y el resultado mejora. Pero ese 47% extra no es abstracto: tiene que existir capacidad downstream para absorberlo. Si el servicio ya está al límite, ese overhead es el empujón que lo tira.

La métrica que el repo define como contrato para no engañarse es exactamente esa:

// MetricSnapshot.java — la razón de esta línea es evitar autoengaño
double retryAmplificationFactor, // downstream_calls / total_requests

Si solo mirás successRate y errorRate, podés creer que ganaste cuando en realidad le metiste 47% más de carga a un sistema que ya estaba sufriendo.

progressive-degradation: donde el retry puede acelerar la caída

Este escenario es el más interesante metodológicamente, y también el que tiene la advertencia más importante.

El downstream de PROGRESSIVE_DEGRADATION implementa esto:

// DownstreamScenario.java — el delay sube con cada llamada real recibida
case PROGRESSIVE_DEGRADATION ->
    Duration.ofMillis(Math.min(900, 80 + callNumber * 3));

El delay no es externo ni fijo: crece con callNumber, que es el contador de llamadas reales al downstream. Esto significa que una política con más retries genera más llamadas, y esas llamadas aceleran la degradación. No es la misma falla para todos: las políticas con retry se degradan más rápido porque presionan más.

Los números de la corrida muestran eso claramente. Con no-retry-standard-timeout se procesaron 7720 requests totales y se iniciaron 7720 llamadas downstream. Con immediate-retry, los requests totales bajaron a 2939 pero las llamadas downstream subieron a 8699, con un amplification factor de 2.96. La policy con retry procesó menos requests de usuarios pero le hizo más llamadas al downstream.

Ahora bien: esto no es un fallo de diseño, es el punto del experimento. El laboratorio lo documenta explícitamente en docs/brief-post.md: progressive-degradation debe leerse como degradación sensible a carga, no como falla externa idéntica para todos. Si lo tratás como comparación directa entre políticas bajo las mismas condiciones, la conclusión está mal planteada desde el vantage point.

Lo que sí podés concluir: en escenarios donde la velocidad de degradación depende del volumen de llamadas recibidas, los retries pueden ser un acelerador de la caída. Eso tiene nombre en producción: retry storm. Y el laboratorio lo reproduce de forma controlada.

Los percentiles que te mienten cuando hay timeouts

Hay un detalle técnico que cambió mi forma de leer los resultados, y que el README documenta con honestidad.

El timeout del caller se implementa con future.cancel(true) en el RetryExecutor:

// RetryExecutor.java — el cancel(true) interrumpe el intento desde el caller
try {
    future.get(policy.timeout().toMillis(), TimeUnit.MILLISECONDS);
    return new AttemptResult(true, elapsedMs(started), "ok", true);
} catch (TimeoutException timeout) {
    future.cancel(true);
    return new AttemptResult(false, elapsedMs(started), "timeout", true);
}

Cuando un intento vence el timeout, la latencia registrada para ese intento está capada por el timeout del caller: STANDARD_TIMEOUT = Duration.ofMillis(260). Por eso en progressive-degradation casi todos los all_attempt_p95_ms y all_attempt_p99_ms muestran exactamente 260. No es que el downstream respondió en 260 ms: es que el caller dejó de esperar a los 260 ms y registró eso como latencia del intento.

Lo que pasa después del cancel(true) en el downstream simulado no se modela completamente. En un sistema real con HTTP, base de datos o cola, el downstream puede seguir ejecutando trabajo aunque el cliente ya no espere. El laboratorio cuenta llamadas iniciadas, pero no puede garantizar que no hay trabajo residual post-cancelación.

Esto importa para leer successful_requests_per_second también. El valor de 0.95 que aparece en varios escenarios de progressive-degradation no es la capacidad máxima del sistema: es el trabajo útil observado bajo esa carga cerrada de k6. Con otra configuración de VUs, otra duración o una red real, los números serían distintos.

circuit-breaker y bulkhead: rechazos visibles como señal de protección

En progressive-degradation, el circuit breaker produce algo que parece contradictorio al primer vistazo. La corrida 13-circuit-breaker-progressive-degradation tiene total_requests = 44777 y circuit_breaker_rejected = 44718. El error rate es 0.9987. Eso parece catastrófico.

Pero mirá las llamadas downstream: 198. Amplification factor: 0.004. El circuit breaker dejó de mandar llamadas al downstream casi por completo. Los rechazos son visibles hacia el cliente, pero el downstream está protegido.

Si comparás con immediate-retry-progressive-degradation, que tiene downstream_calls = 8699 y sigue fallando igual, el trade-off se hace evidente. El circuit breaker elige rechazar rápido antes que multiplicar presión sobre algo que ya no puede responder.

El bulkhead en la misma corrida muestra una variante distinta: bulkhead_rejected = 22122 con downstream_calls = 3668. Limita concurrencia en lugar de cortar el circuito, pero el efecto es similar: reduce presión downstream a costa de rechazos visibles.

Esas señales de concurrencia (max_inflight_downstream = 16 para bulkhead, 40 para la mayoría de las otras corridas) son observaciones, no prueba de saturación. El laboratorio renombró la métrica de saturationObservation a concurrencyObservation exactamente por eso: max_inflight alto no prueba saturación de CPU, red ni pool de conexiones. Es una señal que invita a investigar, no una conclusión.

Qué concluyo y qué no

Este experimento es una simulación local, corrida única publicada, sobre un downstream simulado con delays en memoria. Los números no representan producción, no representan ningún proveedor real y no permiten afirmar "esta política escala a X RPS". Si querés publicar valores exactos con claims fuertes, el README lo dice claramente: hacé al menos tres corridas editorial y mirá consistencia, no una sola pasada.

Lo que sí creo que puede sostenerse:

En fallas transitorias, retry puede mejorar success rate pero siempre tiene un amplification factor mayor a 1. Ese overhead existe y tiene que caber en el sistema.
En degradación sensible a carga, más retries pueden acelerar la degradación porque generan más llamadas. Esto no es universal, pero el escenario es real y el experimento lo reproduce.
p95 y p99 de intentos no te cuentan la latencia real del downstream cuando hay timeouts: te cuentan cuánto esperó el caller antes de rendirse.
Circuit breaker y bulkhead producen rechazos visibles que pueden ser exactamente la decisión correcta para proteger el sistema.

Lo que no concluyo: que una política es mejor que otra en abstracto, que estos números aplican a otro sistema, o que max_inflight_downstream prueba saturación.

La pregunta que me dejo para seguir explorando: ¿cuánto trabajo residual real queda en el downstream después de un future.cancel(true) en un sistema con pool de conexiones HTTP? El laboratorio lo anota como limitación conocida. En producción eso es exactamente donde está la diferencia entre un timeout que protege y uno que solo esconde el problema.

El repo está en github.com/JuanTorchia/retry-resilience-experiment. Si lo corrés y obtenés números distintos, me interesa saberlo.

Este articulo fue publicado originalmente en juanchi.dev.