Forem: Jovann Thompson

Debug Log #2 — The Off-By-One That Didn’t Crash (It Just Lied)

Jovann Thompson — Tue, 26 May 2026 03:57:20 +0000

I built a local pipeline to take long chat transcripts saved as PDFs and turn them into something structured, cleaned output where every conversational turn is rewritten into paired labels:

INPUT 1 / OUTPUT 1
INPUT 2 / OUTPUT 2

That pairing is the contract. It’s what makes the transcript auditable instead of just scrollable.

The Symptom

When doing a last integrity pass, I opened the cleaned PDF to confirm the labeling holds from start to finish. But right at the beginning the artifact was telling me a different story:

INPUT 2 / OUTPUT 1
INPUT 3 / OUTPUT 2

The system was still alternating input/output, the output existed, the pipeline completed. But the numbering was shifted from the first turn. The system runs, the output exists, and the output is quietly lying by one. That lie ripples into every downstream count, integrity check, and assumption built on top of it.

The real question became: where is the first place the system starts lying?

Initial Confusion

At first I kept framing it as a counting issue, maybe something in the missing-input/missing-output analysis, maybe a reporting mismatch, maybe the integrity summary was slightly off. I didn’t want to rerun the entire dataset just to test a small correctness problem, so I tried to do it the right way: make a small sample input, isolate the stage, validate expected versus actual.

That immediately raised practical questions I couldn’t dodge. Where do I even inject a sample? If my entrypoint starts at PDFs, how do I test a mid-stage without breaking the whole flow? If I create a CSV, which CSV does the stage actually expect?

The framing itself was the problem. I was treating it like a reporting bug when it was actually a contract bug.

What the Bug Really Was

The system was never meant to count like:

INPUT 1, OUTPUT 2, INPUT 3, OUTPUT 4...

It was meant to preserve paired conversational turns:

INPUT 1 / OUTPUT 1
INPUT 2 / OUTPUT 2

So if the cleaned PDF starts at INPUT 2 / OUTPUT 1, the core failure isn’t in downstream analysis. The numbering contract is being violated somewhere upstream, and everything else is just inheriting the damage. Reframing it that way collapsed the search space immediately. Stop looking at reporting, trace back to wherever the labels get written in the first place.

The Trap I Almost Fell Into

Before that reframe landed, I tried to build a debug input using raw “you said / chatgpt said” style text, because that’s what I visually associate with the PDF source. But a test fixture only helps if it matches the contract of the stage you’re actually testing. Some stages in the pipeline don’t consume raw conversational text, they consume already-columnized CSV data. Feed the wrong-shaped input into the wrong layer and you’re not debugging the system anymore. You’re debugging a mismatch you created.

That was one of the real lessons of this log: if your mental model of the pipeline layers is even slightly off, you can do a lot of work that produces zero signal.

Tracing Back to the First Lie

The way it became solvable was tracing backward from the artifact I trusted until I found the first divergence.

Start with the cleaned PDF, numbering is wrong at the first turn. Work backward through the pipeline outputs and stage boundaries. At each boundary ask: is the numbering still correct here, or did it break here? The moment a layer is confirmed correct, stop blaming it and move earlier.

That tracing forced a clear outcome. The numbering wasn’t being broken by the analysis layer. It wasn’t something happening at the end. It was being introduced in the ingestion and cleaning step, the part of the system that writes the labels in the first place.

Root Cause

The offset wasn’t random drift. It was a systematic base shift baked in from the start.

I remembered why: sometimes when copying a thread, the first “You said:” label doesn’t exist the way the parser expects, so I had added logic to bootstrap the first input anyway. The intention was correct, recover from messy real-world formatting. But the implementation created a permanent misalignment. Input and output were being advanced out of sync at the very beginning, so everything after stayed consistently off by one.

The bug didn’t need to crash to be real. It just needed to violate the contract once.

The Fix

The fix was structural, not a patch. Instead of two separate counters drifting against each other, the labeling logic was rebuilt around a single turn counter that increments only when an INPUT is encountered or injected, labels OUTPUT using that same turn number, and ensures the edge-case injection doesn’t double-increment the first real turn. The goal was to make it structurally impossible for OUTPUT numbering to drift away from INPUT numbering, regardless of what the source formatting looks like.

Proof

I didn’t jump straight into a full run. I validated the fix in isolation first, a small harness that calls the labeling function directly against three cases: normal format, missing first label, and continuation from a higher turn number. Only after the harness proved the contract held did I rerun the full pipeline and spot-check the cleaned PDF from beginning to end.

The output stayed aligned. The labeling read sharper because it was finally consistent.

That’s what closed the loop: not “it seems fixed,” but the invariant proven in isolation, then proven again end-to-end.

What This Log Is Really About

This was a quiet failure mode, a system that runs fine, produces output, and misleads you the whole time.

The takeaway is simple: if an artifact looks slightly wrong, don’t argue with it and don’t patch randomly. Trace backward until you find the first layer where the contract breaks. Fix the smallest layer that owns the contract. Prove it in isolation. Then reintegrate.

That’s how you stop a system from merely running and start making it trustworthy.

Project

GitHub Repository:
https://github.com/Jt-Thompson

Debug Log #1 — The Pipeline That Looked Broken

Jovann Thompson — Tue, 26 May 2026 02:40:24 +0000

I had been building a local ETL pipeline designed to process long conversational PDFs into structured datasets. The system extracted dialogue, cleaned it, generated QA artifacts, and loaded the results into SQLite for downstream analysis.

By the time this debugging process started, the core extract-transform-load flow already worked. Data could move end-to-end through the system successfully.

The problems started showing up once I added the QA and diagnostics stages around it. During development, parts of those systems seemed to work. But when I came back and reran the full pipeline, execution would appear to stop somewhere around diagnostics. Long stretches of silence. No artifacts I could confirm. No database state I fully trusted.

At that point I knew something was wrong, but I didn’t yet have enough experience to understand what kind of wrong it was. I tried a few patches based on the first advice I got, potential path issues or output mismatches. None of them solved it. So the project sat for a while. I had to spend time away from the build learning how to read my own codebase, trace execution, and navigate the system well enough to come back and debug it properly.

Initial Understanding

When I came back, I started with a simpler question: was the pipeline actually broken, or did it only look broken because I couldn’t see what was happening?

Part of what triggered that question was staring at the terminal during long runs and realizing the process was still alive even though nothing visible was happening. At the time I was still learning basic operational ideas like what a “hang” even meant in practice. I had been treating long silence like proof that the system was dead, when in reality some pipeline states are just slow, blocked, waiting, or stuck behind expensive work.

That reframe changed the investigation immediately. Instead of treating it like one giant broken object, I started seeing it as a chain of expectations between stages. One script writes outputs. Another script expects those outputs somewhere specific. One stage assumes a schema already exists. Another assumes a naming format already matches. If those assumptions drift even slightly, the whole pipeline can look broken from the outside even when parts of it are still functioning correctly.

So the debugging process became: isolate the stopping point, check what was actually produced, compare it against what the next stage expected, then narrow the mismatch before changing anything.

Runtime Visibility

Earlier I had added a short timeout during debugging attempts, but I eventually realized the timeout logic was only surfacing a warning state, not actually terminating the process itself. The run would hit QA and diagnostics, I’d see the timeout, and everything after that seemed to disappear.

Instead of patching again, I started watching the runtime more carefully. I checked CPU and RAM usage to see whether the process was actually dead or just slow under load. I watched where execution appeared to stall. Then I did the thing I usually avoid during long runs: I waited long enough for the system to reveal more information on its own.

That changed the picture completely. The timeout message was not the same thing as “the pipeline stops.” It was just one event earlier in the run. Once I let the process continue, the pipeline moved into later stages successfully. The question stopped being “why does the pipeline die at QA?” and became “what is actually happening after QA finishes?”

Eventually the runtime progressed far enough for the real failure to surface:

sqlite3.OperationalError: no such column: missing_input

That changed the debugging process again. Now the issue was no longer vague runtime ambiguity. There was a concrete failure tied to a specific schema mismatch much later in execution. The pipeline was running farther than it looked. The runtime visibility had just been too weak to make that obvious earlier.

Instrumentation and Tracing

Once I understood the pipeline was continuing much farther than I originally thought, I stopped treating it like a mysterious crash and started measuring directly.

I added timing instrumentation at the stage boundaries. The ambiguity disappeared immediately. Diagnostics wasn’t “a little slow.” It was taking around fifty minutes:

[TIMING] diagnostics stage completed in 3036.21s

At that point the problem stopped feeling random. One part of the system was repeatedly doing expensive work. So I followed the runtime cost from stage to module to function to loop, until the slowdown had a specific address.

That narrowing led to write_missing_outputs_csv, which accounted for nearly the entire diagnostics runtime.

The Bottleneck

Tracing deeper into that function showed repeated calls to extract_pdf_context(...) inside the row-level loop. Following the call chain into quality_utils.py confirmed what was happening: the function was calling pdfplumber.open(pdf_path), iterating through every page, and rebuilding the extracted text from scratch, then doing the exact same thing on the next row.

Before changing anything, I added a counter to verify the scale of it directly:

extract_pdf_context called: 82

Then the math:

~3081 seconds total / 82 calls ≈ 37.5 seconds per call

At that point the issue stopped being a hypothesis. I knew exactly how to reason about a fix.

The Structural Refactor

The fix was about changing the shape of the system so the work couldn’t repeat itself.

The workload changed from:

rows × (open PDF + parse all pages)

to:

(open PDF + parse all pages once) + rows × (string search)

In practice: a new function extract_full_pdf_text() opens and parses the PDF one time before the loop and returns the full text as a string. A second function extract_pdf_context_from_text() takes that cached string and does a lightweight search against it, no file I/O, no page iteration. Inside the loop, only the search runs. I applied the same refactor pattern across both QA and diagnostics stages, then removed the temporary profiling scaffolding and kept the durable timing instrumentation that was still useful operationally.

First Clean Full Run

Up until this point, one of the hardest parts of debugging the pipeline was not being able to tell whether it was actually finishing at all. Long stretches of silence made the runtime feel ambiguous. It looked dead, stalled, or half-working. I couldn’t fully trust what I was seeing.

Then I finally got a clean full run. The important part wasn’t just that the pipeline completed. It was that the system explained itself clearly at the end:

811 total inputs
770 outputs
0 missing inputs
41 missing outputs
94.94% coverage

The SQLite load completed with 811 rows. The database-side missing outputs count matched what the diagnostics and CSV-side reporting were showing. Earlier in the investigation, different artifacts often contradicted each other and created more ambiguity. Now the outputs were reinforcing each other.

The full run took around 470 seconds, just under eight minutes. That reframed a lot of the earlier fear around the pipeline hanging. Now I had a runtime I could measure and reason about.

The pipeline was no longer a broken system somewhere in the middle of execution. It had become a system that could complete end-to-end, validate itself, load the database successfully, and expose the remaining problems as specific issues instead of vague uncertainty.

What Changed

Before this investigation, debugging felt mostly reactive. Change something, rerun it, hope the behavior improves. This process introduced a different sequence: wait long enough for the system to reveal itself, instrument at the boundaries, follow the runtime cost by layers, verify assumptions with concrete measurements, then refactor the structure rather than the logic.

The lack of observability wasn’t a side issue. It was half the problem. Once timing existed at stage boundaries, the pipeline stopped feeling like a black box and started feeling like something I could reason about directly.

Project

GitHub Repository:
https://github.com/Jt-Thompson