Forem: Hideki Mori

Nobody knows when a job will finish. I'd still like to report it accurately.

Hideki Mori — Mon, 04 May 2026 13:00:00 +0000

Most async APIs commit to one thing: starting your job. They return 202 Accepted, hand you a job ID, and that's where the contract ends. The rest is your problem.

I do something different. I make one promise:

When your job is done, I'll tell you accurately. Until then, I'll keep retrying.

That's the entire contract for everything I've ever shipped. It sounds small. In practice, it's the only thing I actually do.

The shape every job in my system shares

You hand me work.

You wait.

I retry as hard as I can.

I report when it's done.

That's it. Whether the job is OCR on a scanned PDF, structured extraction from a long document, or refining the translation of an XLIFF file — the shape is identical. You give me an input. You don't watch the screen. I come back when I have something honest to report.

This sounds obvious until you try to actually deliver it.

Why "started" is easier than "finished"

Returning 202 Accepted is easy. The hard part starts right after that.

Real jobs hit things like:

Vendor APIs that occasionally throw 503. No reason. Just sometimes.
Native binaries that core dump. Twice in a row, then fine for a week.
Subprocesses that go zombie. Not crashed. Not finished. Just defunct. The OS still holds them.
Disks that fill up with stale debug files because something somewhere wrote them and forgot.

If you ship "started, here's a job ID, good luck" and call that an API, you're outsourcing all of the above to your user.

I'm not willing to do that. So I take the work back inside.

What that looks like in code

I'm not going to name any vendor. They don't matter. What matters is the shape. The code below is a simplified sketch — the production version handles a lot more (PDF library version quirks, fallback engines when the first one rejects the input, demo-mode page limits, and a long list of vendor-specific error codes that mean "retry," "skip," or "stop"). The shape is what survives.

Here's a sketch of the inside of one of my conversion services:

public JobResult runJob(Input input) throws Exception {
    for (int attempt = 0; attempt < MAX_RETRIES; attempt++) {
        Process child = new ProcessBuilder(
                "java", "-cp", classpath, EngineMain.class.getName())
            .redirectErrorStream(true)
            .start();
        passInputToStdin(child, input);

        long started = System.currentTimeMillis();
        while (child.isAlive()) {
            if (System.currentTimeMillis() - started > MAX_RUNTIME_MS) {
                child.destroyForcibly();
            }
            if (isDefunct(child)) {
                reap(child);
                break;
            }
            sweepStaleCoreFiles(workDir, MAX_CORE_AGE_MS);
            Thread.sleep(POLL_INTERVAL_MS);
        }

        ChildOutcome outcome = readOutcome(child);
        if (outcome.isTransientError()) continue; // retry
        if (outcome.isIrrelevantError()) {
            log.info("irrelevant error, treating as success: {}", outcome);
            return outcome.toSuccessResult();
        }
        if (outcome.hasResult()) return outcome.toResult();
    }
    return JobResult.failedAfterRetries(MAX_RETRIES);
}

A few things in there are worth pointing at.

new ProcessBuilder("java", ..., EngineMain.class.getName()). Not "call a library function." Not "use the SDK." I literally re-enter main from another process. The reason is that the underlying engine, in its native form, is unreliable enough that I want process-level isolation. If it dies, only the child dies.

if (isDefunct(child)) { reap(child); break; }. Native binaries don't always exit cleanly. Sometimes they're not crashed and not running — they're stuck. The parent has to notice, decide, and clean up.

sweepStaleCoreFiles(workDir, MAX_CORE_AGE_MS). When a child crashes hard, the OS dumps a core file. That file is huge. If you don't sweep it, the disk fills up. There is no clever solution here. You sweep.

outcome.isTransientError() → continue. Some vendor errors come and go. The fix is to wait and try again. If you don't try again, your user sees failed. If you do try again, your user sees "took a bit longer." I pick the second one.

outcome.isIrrelevantError() → log and return success. This is the part that surprises people. Some errors aren't actually errors for the use case. They're noise the engine emits. Knowing which is which takes years, and is most of the actual product.

None of this is elegant. None of it shows up in an architecture diagram. It all lives in the gap between "the job was submitted" and "the job is done, here's the result."

That gap is what I do.

What I gave up

I don't promise low latency. I can't. The thing I'm waiting on isn't predictable.

I don't promise the job will always succeed. Sometimes the input is genuinely broken. Then I report that, accurately, instead of pretending.

I don't promise streaming partial results. I keep the user out of the loop until I have something stable to hand back. The cost is they wait. The benefit is they don't see noise.

These trade-offs aren't sophisticated. They're just consistent.

I didn't design this. It survived.

Looking back, this is how every job-shaped API I've ever built has worked. I didn't sit down one day and decide on a contract. I kept ending up here.

Each time I tried to ship something where the API said started and stopped caring, the user came back asking what happened. So I started caring. Each time I tried to surface every transient error to the user, the user got scared. So I started absorbing them. Each time I tried to make jobs faster by skipping the cleanup, the disks filled up. So I started sweeping.

After enough years of this, what's left is a single rule:

When the job is done, I'll tell you accurately. Until then, I'll keep retrying.

Whether that's the right contract for your system, I genuinely don't know. It's just the only one I've found that survives.

Earlier in this series: The Accordion Pattern: Why I stopped writing one fat LLM prompt

The Accordion Pattern: Why I stopped writing one fat LLM prompt

Hideki Mori — Wed, 29 Apr 2026 07:51:20 +0000

Most structured-extraction tutorials look the same. Take a document, write one big prompt that says "extract A, B, C, D, E, F", get JSON back. Done.

This works on short inputs.

It quietly breaks on long ones.

After running this in production for a while, I stopped doing it. Here's what I switched to and why.

The fat prompt problem

Say you have a 50-page report and you want a structured summary out of it. The natural first move is something like:

Extract:
- title
- sections (with headings)
- purpose
- mentioned services
- acceptance criteria
- ...
Return JSON in this shape: { ... }

You hand the whole document to the model. It returns JSON. It looks fine on the first try.

Then you scale it up and three things happen:

Quality drifts. The model "forgets" mid-document. Later sections are summarized worse than earlier ones, or fields go missing.
One bad field poisons the whole call. If "acceptance criteria" hallucinates, you don't just lose that field — the whole record gets quarantined for review.
Latency goes up, parallelism goes down. A single 30k-token call takes what it takes. You can't shard it.

You can fight this with longer prompts, more examples, stricter formatting rules. I did. It buys you maybe 10% more reliability and costs you a lot of prompt-engineering time.

The structural problem doesn't go away.

What I do now: split it

The pattern I use looks like an accordion that expands:

[ document ]
     │
     ▼
[ Stage 1: segment ]   ← one prompt, one job: produce a list
     │
     ▼
[ array of segments ]
     │
     ▼ (fan out)
[ Stage 2: extract ]   ← one prompt, runs per segment
     │
     ▼
[ structured records ]

Stage 1 reads the whole document and returns a clean array of segments — sections, paragraphs, line items, whatever the right unit is for the task.

Stage 2 takes one segment at a time and extracts the structured fields you actually want.

Two prompts, each doing one thing.

Why this works better

Each prompt has a single job.
Stage 1 is "find the boundaries". Stage 2 is "extract the schema". Neither prompt has to hold both ideas at once. You can write each one tightly. Examples are shorter and more on-point.

Errors localize.
If Stage 2 fails on segment 7, you re-run segment 7. You don't redo the whole document. Bad fields get isolated to one record instead of contaminating the whole batch.

Stage 2 parallelizes naturally.
The output of Stage 1 is an array. Fan it out. Run 50 small extractions in parallel instead of one big one. Total wall-clock time drops, and so does the variance.

Cache hits go up.
If the same segment shows up twice (templates, standard headers, repeated forms), Stage 2 sees the same input and you can cache. The fat-prompt version sees the entire document as one unique input every time.

Long documents stop being scary.
The hard limit on a fat prompt is the model's context window. The accordion pattern doesn't have that ceiling. Stage 1 still has to read the whole document, but its output is small. Stage 2 only ever sees one segment.

What it costs

It's not free.

You're making more LLM calls — one for Stage 1 plus N for Stage 2 instead of one. On short inputs that's wasteful. The accordion pattern is for documents long enough that fat prompts start failing, not for two-paragraph emails.

You also need to think a little harder about what a "segment" is for your task. Sometimes it's a section heading. Sometimes it's a row in a table. Sometimes it's a logical unit that doesn't map to any visible boundary. That's a design decision and it matters.

When to use it

Reach for the accordion when:

The document is long enough that you've seen the model lose the thread mid-way.
The output schema has more than ~5 fields and they don't all care about the same context.
You need to retry failed records without redoing successful ones.
You want parallelism.

Stick with one fat prompt when:

The input is short and the schema is small.
The fields are tightly coupled (extracting one needs context from another).
You're prototyping and don't care yet.

A small concrete example

I run this on a service called StructFlow. The shape of the calls is roughly:

# Stage 1: segment
curl -X POST https://gw.ldxhub.io/structflow/jobs \
  -H "Authorization: Bearer $KEY" \
  -d '{
    "model": "google/gemini-3-flash-preview",
    "system_prompt": "Split this document into logical sections. Return one JSON record per section.",
    "example_output": { "section_title": "...", "section_text": "..." },
    "inputs": [{ "id": "doc1", "data": { "text": "..." } }]
  }'

The response gives you back an array. Then Stage 2:

# Stage 2: extract (one call per segment, run in parallel)
curl -X POST https://gw.ldxhub.io/structflow/jobs \
  -H "Authorization: Bearer $KEY" \
  -d '{
    "model": "google/gemini-3-flash-preview",
    "system_prompt": "From this section, extract: purpose, mentioned services, acceptance criteria.",
    "example_output": { "purpose": "...", "mentioned_services": [], "acceptance_criteria": [] },
    "inputs": [{ "id": "sec1", "data": { "section_text": "..." } }]
  }'

Two calls, each focused. One returns segments. The other turns each segment into structured fields.

That's the whole pattern.

Why I'm posting this

I built LDX hub partly to make this pattern easy to run — one API, async jobs, file-based input/output so Stage 1's output is directly usable as Stage 2's input. But the pattern itself doesn't depend on any specific tool. You can do it with raw OpenAI calls, Anthropic calls, anything that takes a prompt and returns text.

The takeaway isn't "use my API". It's: if your structured extraction is getting flaky on long inputs, the answer probably isn't a longer prompt. It's two prompts.

If you've tried something similar — or if you've got a case where this falls apart — I'd be curious to hear it.