I tuned Hindsight for long conversations

Anjankumar LN — Mon, 23 Mar 2026 17:34:30 +0000

Last night I wondered if an agent could learn what not to recommend in a job-matching pipeline; by morning, ours was blacklisting patterns it had only seen fail once—and getting better every run.
job sense ai

Last night I wondered if an agent could learn what not to recommend in a job-matching pipeline; by morning, ours was blacklisting patterns it had only seen fail once—and getting better every run.

What I built here isn’t another resume–job similarity tool. It’s a loop: ingest resumes and job descriptions, generate matches, evaluate those matches, and then feed the failures back into the system so it stops making the same mistake twice.

At a high level, the repo is pretty simple:

main.py wires the pipeline together

matcher/ handles embedding + similarity scoring

agent/ wraps the LLM logic (ranking, reasoning, critique)

memory/ is where Hindsight comes in

evaluation/ defines what “bad recommendation” actually means

The interesting part isn’t matching. It’s what happens after a bad match.

The thing I got wrong about “agent memory”

My initial assumption was embarrassingly common: memory = more context.

So I started with something like this:

def rank_candidates(job, resumes):
context = build_context(job, resumes)
return llm.generate_rankings(context)

It worked fine for obvious matches. It completely fell apart on edge cases:

Overweighting keyword overlap (“Python” everywhere)

Ignoring disqualifiers buried in experience

Recommending overqualified or irrelevant candidates

My first instinct was to “improve prompts” and “add more examples.” That helped, but it didn’t stick. The same class of mistake kept coming back.

What I actually needed was not better context—but persistent negative feedback.

Turning mistakes into data (with Hindsight)

The shift happened when I integrated Hindsight on GitHub.

Instead of trying to prevent bad outputs upfront, I let the system fail—and then recorded why it failed.

The core idea: every bad recommendation becomes a structured memory.

def record_failure(job_id, resume_id, reason):
memory.store({
"type": "negative_match",
"job_id": job_id,
"resume_id": resume_id,
"reason": reason,
"timestamp": now()
})

This isn’t just logging. These records are indexed and retrieved later during ranking.

If you haven’t seen it, the Hindsight documentation explains the retrieval model pretty well—but the real insight is what you choose to store.

I don’t store full conversations. I store compressed lessons:

“Rejected: frontend-heavy profile for backend-only role”

“Mismatch: required 5+ years, candidate has 1.5”

“False positive due to keyword overlap (React vs React Native backend tooling)”

That compression step matters more than anything else.

Injecting “don’t do this again” into the agent

Once failures are stored, the next step is using them during ranking.

Here’s the shape I ended up with:

def rank_with_memory(job, resumes):
past_failures = memory.retrieve(
query=job.description,
filter={"type": "negative_match"},
top_k=5
)

context = build_context(job, resumes, past_failures)

return llm.generate_rankings(context)

The important part is that past_failures are semantically retrieved. I’m not just filtering by job ID—I’m asking:

“What past mistakes look similar to this job?”

This is where something like the Vectorize agent memory layer becomes useful. You’re not building a database—you’re building a memory system that can generalize.

What surprised me: the agent started arguing with itself

I didn’t expect this, but once failures were injected into context, the agent started doing something interesting:

It began preemptively rejecting candidates before ranking them.

I made that explicit by adding a critique step:

def critique_candidate(job, resume, failures):
return llm.generate({
"job": job,
"resume": resume,
"past_failures": failures,
"task": "Should this candidate be rejected? Why?"
})

Then the pipeline became:

Generate candidate scores
Run critique step
Adjust ranking or drop candidates

This effectively turned the agent into a two-pass system:

Pass 1: “Who looks good?”

Pass 2: “Why might this be wrong?”

That second pass is where most improvements came from.

The blacklist isn’t static—and that’s the whole point

At some point, I considered just building a rule engine:

If experience < required → reject

If domain mismatch → penalize

If keywords mismatch → drop

But that quickly turns into a brittle mess.

Instead, I let the blacklist emerge dynamically from failures.

A typical memory entry looks like:

{
"type": "negative_match",
"embedding": "...",
"reason": "Candidate has strong frontend experience but no backend systems exposure",
"job_features": ["backend", "distributed systems"],
"timestamp": 1710000000
}

Over time, the system builds a soft blacklist:

Not explicit rules

Not hard filters

But patterns the agent learns to avoid

This is the part that actually feels like “learning,” even though it’s just retrieval + prompting.

A concrete before/after

Before adding Hindsight:

Job: Backend Engineer (Go, distributed systems)
Candidate: React-heavy frontend dev with some Node.js

The system ranked this candidate #2.

Why? Keyword overlap: “JavaScript”, “APIs”, “microservices”.

After adding failure memory:

A previous failure stored:
“Frontend-heavy profile incorrectly matched to backend system role”

Now the same candidate:

Gets flagged in critique step

Drops to #7 or removed entirely

Nothing in the model changed. Just the memory.

Where the design broke (and what I changed)

Storing too much detail

At first, I stored entire LLM outputs as memory.

Bad idea.

Retrieval got noisy

Context bloated

Signal got diluted

Fix: store one-line reasons, not transcripts.

Over-retrieval

I tried feeding 10–15 past failures into context.

That made the agent overly conservative—it started rejecting everything.

past_failures = memory.retrieve(..., top_k=15) # too much

Fix:

past_failures = memory.retrieve(..., top_k=3)

Less context, sharper signal.

No feedback loop validation

Initially, every failure was treated equally.

But some “failures” were actually debatable (e.g., borderline candidates).

Fix: I added a lightweight scoring layer:

def validate_failure(reason):
return llm.generate({
"task": "Is this a valid rejection reason?",
"reason": reason
})

Only high-confidence failures get stored.

What this system actually feels like to use

It doesn’t feel like a smarter model.

It feels like a model that remembers being wrong.

That’s a subtle but important difference.

It still makes mistakes

But it rarely repeats the same mistake

And when it does, it’s usually because retrieval missed something

The behavior is closer to a junior engineer who keeps notes on what went wrong last time.

Lessons I’d carry forward

Memory is about compression, not accumulation
Storing everything is useless. Storing the right abstraction of failure is what matters.
Negative examples are more valuable than positive ones
“Don’t do this again” shaped behavior more than “do this well.”
Retrieval quality > model quality
A slightly worse model with good failure retrieval outperformed a better model with none.
Two-pass systems are underrated
Generate → critique is dramatically more stable than single-pass ranking.
Don’t build rules when you can build feedback loops
Static rules age poorly. Feedback systems adapt.

If I were to rebuild this

I’d double down on the memory layer earlier.

Specifically:

Better schema for failure types

Explicit clustering of similar mistakes

Decay or pruning of outdated memories

Right now it works—but it’s still naive.

Closing thought

If you’re building anything that ranks, recommends, or decides—don’t just ask:

“How do I make it better?”

Ask:

“How do I make it remember being wrong?”

That one shift changed this project from a brittle matcher into something that actually improves over time.

And if you want to go deeper into how this kind of memory layer works, it’s worth exploring:

The Hindsight GitHub repository

The Hindsight documentation

The agent memory approach from Vectorize

The implementation details matter—but the bigger idea is simple:

Don’t just build systems that predict.

Build systems that regret—and remember why.

Forem: Anjankumar LN

I tuned Hindsight for long conversations