<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Anjankumar LN</title>
    <description>The latest articles on Forem by Anjankumar LN (@anjankumar_ln_41a980a9fd).</description>
    <link>https://forem.com/anjankumar_ln_41a980a9fd</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3840577%2F285a1f2c-5f12-45f3-8fa6-5a3ce941142d.jpg</url>
      <title>Forem: Anjankumar LN</title>
      <link>https://forem.com/anjankumar_ln_41a980a9fd</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/anjankumar_ln_41a980a9fd"/>
    <language>en</language>
    <item>
      <title>I tuned Hindsight for long conversations</title>
      <dc:creator>Anjankumar LN</dc:creator>
      <pubDate>Mon, 23 Mar 2026 17:34:30 +0000</pubDate>
      <link>https://forem.com/anjankumar_ln_41a980a9fd/i-tuned-hindsight-for-long-conversations-46k4</link>
      <guid>https://forem.com/anjankumar_ln_41a980a9fd/i-tuned-hindsight-for-long-conversations-46k4</guid>
      <description>&lt;p&gt;Last night I wondered if an agent could learn what not to recommend in a job-matching pipeline; by morning, ours was blacklisting patterns it had only seen fail once—and getting better every run.&lt;br&gt;
job sense ai&lt;/p&gt;

&lt;p&gt;Last night I wondered if an agent could learn what not to recommend in a job-matching pipeline; by morning, ours was blacklisting patterns it had only seen fail once—and getting better every run.&lt;/p&gt;

&lt;p&gt;What I built here isn’t another resume–job similarity tool. It’s a loop: ingest resumes and job descriptions, generate matches, evaluate those matches, and then feed the failures back into the system so it stops making the same mistake twice.&lt;/p&gt;

&lt;p&gt;At a high level, the repo is pretty simple:&lt;/p&gt;

&lt;p&gt;main.py wires the pipeline together&lt;/p&gt;

&lt;p&gt;matcher/ handles embedding + similarity scoring&lt;/p&gt;

&lt;p&gt;agent/ wraps the LLM logic (ranking, reasoning, critique)&lt;/p&gt;

&lt;p&gt;memory/ is where Hindsight comes in&lt;/p&gt;

&lt;p&gt;evaluation/ defines what “bad recommendation” actually means&lt;/p&gt;

&lt;p&gt;The interesting part isn’t matching. It’s what happens after a bad match.&lt;/p&gt;




&lt;p&gt;The thing I got wrong about “agent memory”&lt;/p&gt;

&lt;p&gt;My initial assumption was embarrassingly common: memory = more context.&lt;/p&gt;

&lt;p&gt;So I started with something like this:&lt;/p&gt;

&lt;p&gt;def rank_candidates(job, resumes):&lt;br&gt;
    context = build_context(job, resumes)&lt;br&gt;
    return llm.generate_rankings(context)&lt;/p&gt;

&lt;p&gt;It worked fine for obvious matches. It completely fell apart on edge cases:&lt;/p&gt;

&lt;p&gt;Overweighting keyword overlap (“Python” everywhere)&lt;/p&gt;

&lt;p&gt;Ignoring disqualifiers buried in experience&lt;/p&gt;

&lt;p&gt;Recommending overqualified or irrelevant candidates&lt;/p&gt;

&lt;p&gt;My first instinct was to “improve prompts” and “add more examples.” That helped, but it didn’t stick. The same class of mistake kept coming back.&lt;/p&gt;

&lt;p&gt;What I actually needed was not better context—but persistent negative feedback.&lt;/p&gt;




&lt;p&gt;Turning mistakes into data (with Hindsight)&lt;/p&gt;

&lt;p&gt;The shift happened when I integrated Hindsight on GitHub.&lt;/p&gt;

&lt;p&gt;Instead of trying to prevent bad outputs upfront, I let the system fail—and then recorded why it failed.&lt;/p&gt;

&lt;p&gt;The core idea: every bad recommendation becomes a structured memory.&lt;/p&gt;

&lt;p&gt;def record_failure(job_id, resume_id, reason):&lt;br&gt;
    memory.store({&lt;br&gt;
        "type": "negative_match",&lt;br&gt;
        "job_id": job_id,&lt;br&gt;
        "resume_id": resume_id,&lt;br&gt;
        "reason": reason,&lt;br&gt;
        "timestamp": now()&lt;br&gt;
    })&lt;/p&gt;

&lt;p&gt;This isn’t just logging. These records are indexed and retrieved later during ranking.&lt;/p&gt;

&lt;p&gt;If you haven’t seen it, the Hindsight documentation explains the retrieval model pretty well—but the real insight is what you choose to store.&lt;/p&gt;

&lt;p&gt;I don’t store full conversations. I store compressed lessons:&lt;/p&gt;

&lt;p&gt;“Rejected: frontend-heavy profile for backend-only role”&lt;/p&gt;

&lt;p&gt;“Mismatch: required 5+ years, candidate has 1.5”&lt;/p&gt;

&lt;p&gt;“False positive due to keyword overlap (React vs React Native backend tooling)”&lt;/p&gt;

&lt;p&gt;That compression step matters more than anything else.&lt;/p&gt;




&lt;p&gt;Injecting “don’t do this again” into the agent&lt;/p&gt;

&lt;p&gt;Once failures are stored, the next step is using them during ranking.&lt;/p&gt;

&lt;p&gt;Here’s the shape I ended up with:&lt;/p&gt;

&lt;p&gt;def rank_with_memory(job, resumes):&lt;br&gt;
    past_failures = memory.retrieve(&lt;br&gt;
        query=job.description,&lt;br&gt;
        filter={"type": "negative_match"},&lt;br&gt;
        top_k=5&lt;br&gt;
    )&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;context = build_context(job, resumes, past_failures)

return llm.generate_rankings(context)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The important part is that past_failures are semantically retrieved. I’m not just filtering by job ID—I’m asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What past mistakes look similar to this job?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where something like the Vectorize agent memory layer becomes useful. You’re not building a database—you’re building a memory system that can generalize.&lt;/p&gt;




&lt;p&gt;What surprised me: the agent started arguing with itself&lt;/p&gt;

&lt;p&gt;I didn’t expect this, but once failures were injected into context, the agent started doing something interesting:&lt;/p&gt;

&lt;p&gt;It began preemptively rejecting candidates before ranking them.&lt;/p&gt;

&lt;p&gt;I made that explicit by adding a critique step:&lt;/p&gt;

&lt;p&gt;def critique_candidate(job, resume, failures):&lt;br&gt;
    return llm.generate({&lt;br&gt;
        "job": job,&lt;br&gt;
        "resume": resume,&lt;br&gt;
        "past_failures": failures,&lt;br&gt;
        "task": "Should this candidate be rejected? Why?"&lt;br&gt;
    })&lt;/p&gt;

&lt;p&gt;Then the pipeline became:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Generate candidate scores&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Run critique step&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Adjust ranking or drop candidates&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This effectively turned the agent into a two-pass system:&lt;/p&gt;

&lt;p&gt;Pass 1: “Who looks good?”&lt;/p&gt;

&lt;p&gt;Pass 2: “Why might this be wrong?”&lt;/p&gt;

&lt;p&gt;That second pass is where most improvements came from.&lt;/p&gt;




&lt;p&gt;The blacklist isn’t static—and that’s the whole point&lt;/p&gt;

&lt;p&gt;At some point, I considered just building a rule engine:&lt;/p&gt;

&lt;p&gt;If experience &amp;lt; required → reject&lt;/p&gt;

&lt;p&gt;If domain mismatch → penalize&lt;/p&gt;

&lt;p&gt;If keywords mismatch → drop&lt;/p&gt;

&lt;p&gt;But that quickly turns into a brittle mess.&lt;/p&gt;

&lt;p&gt;Instead, I let the blacklist emerge dynamically from failures.&lt;/p&gt;

&lt;p&gt;A typical memory entry looks like:&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "type": "negative_match",&lt;br&gt;
  "embedding": "...",&lt;br&gt;
  "reason": "Candidate has strong frontend experience but no backend systems exposure",&lt;br&gt;
  "job_features": ["backend", "distributed systems"],&lt;br&gt;
  "timestamp": 1710000000&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;Over time, the system builds a soft blacklist:&lt;/p&gt;

&lt;p&gt;Not explicit rules&lt;/p&gt;

&lt;p&gt;Not hard filters&lt;/p&gt;

&lt;p&gt;But patterns the agent learns to avoid&lt;/p&gt;

&lt;p&gt;This is the part that actually feels like “learning,” even though it’s just retrieval + prompting.&lt;/p&gt;




&lt;p&gt;A concrete before/after&lt;/p&gt;

&lt;p&gt;Before adding Hindsight:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Job: Backend Engineer (Go, distributed systems)&lt;br&gt;
Candidate: React-heavy frontend dev with some Node.js&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system ranked this candidate #2.&lt;/p&gt;

&lt;p&gt;Why? Keyword overlap: “JavaScript”, “APIs”, “microservices”.&lt;/p&gt;

&lt;p&gt;After adding failure memory:&lt;/p&gt;

&lt;p&gt;A previous failure stored:&lt;br&gt;
“Frontend-heavy profile incorrectly matched to backend system role”&lt;/p&gt;

&lt;p&gt;Now the same candidate:&lt;/p&gt;

&lt;p&gt;Gets flagged in critique step&lt;/p&gt;

&lt;p&gt;Drops to #7 or removed entirely&lt;/p&gt;

&lt;p&gt;Nothing in the model changed. Just the memory.&lt;/p&gt;




&lt;p&gt;Where the design broke (and what I changed)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Storing too much detail&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At first, I stored entire LLM outputs as memory.&lt;/p&gt;

&lt;p&gt;Bad idea.&lt;/p&gt;

&lt;p&gt;Retrieval got noisy&lt;/p&gt;

&lt;p&gt;Context bloated&lt;/p&gt;

&lt;p&gt;Signal got diluted&lt;/p&gt;

&lt;p&gt;Fix: store one-line reasons, not transcripts.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Over-retrieval&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I tried feeding 10–15 past failures into context.&lt;/p&gt;

&lt;p&gt;That made the agent overly conservative—it started rejecting everything.&lt;/p&gt;

&lt;p&gt;past_failures = memory.retrieve(..., top_k=15)  # too much&lt;/p&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;p&gt;past_failures = memory.retrieve(..., top_k=3)&lt;/p&gt;

&lt;p&gt;Less context, sharper signal.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;No feedback loop validation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Initially, every failure was treated equally.&lt;/p&gt;

&lt;p&gt;But some “failures” were actually debatable (e.g., borderline candidates).&lt;/p&gt;

&lt;p&gt;Fix: I added a lightweight scoring layer:&lt;/p&gt;

&lt;p&gt;def validate_failure(reason):&lt;br&gt;
    return llm.generate({&lt;br&gt;
        "task": "Is this a valid rejection reason?",&lt;br&gt;
        "reason": reason&lt;br&gt;
    })&lt;/p&gt;

&lt;p&gt;Only high-confidence failures get stored.&lt;/p&gt;




&lt;p&gt;What this system actually feels like to use&lt;/p&gt;

&lt;p&gt;It doesn’t feel like a smarter model.&lt;/p&gt;

&lt;p&gt;It feels like a model that remembers being wrong.&lt;/p&gt;

&lt;p&gt;That’s a subtle but important difference.&lt;/p&gt;

&lt;p&gt;It still makes mistakes&lt;/p&gt;

&lt;p&gt;But it rarely repeats the same mistake&lt;/p&gt;

&lt;p&gt;And when it does, it’s usually because retrieval missed something&lt;/p&gt;

&lt;p&gt;The behavior is closer to a junior engineer who keeps notes on what went wrong last time.&lt;/p&gt;




&lt;p&gt;Lessons I’d carry forward&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Memory is about compression, not accumulation&lt;br&gt;
Storing everything is useless. Storing the right abstraction of failure is what matters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Negative examples are more valuable than positive ones&lt;br&gt;
“Don’t do this again” shaped behavior more than “do this well.”&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Retrieval quality &amp;gt; model quality&lt;br&gt;
A slightly worse model with good failure retrieval outperformed a better model with none.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Two-pass systems are underrated&lt;br&gt;
Generate → critique is dramatically more stable than single-pass ranking.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Don’t build rules when you can build feedback loops&lt;br&gt;
Static rules age poorly. Feedback systems adapt.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;If I were to rebuild this&lt;/p&gt;

&lt;p&gt;I’d double down on the memory layer earlier.&lt;/p&gt;

&lt;p&gt;Specifically:&lt;/p&gt;

&lt;p&gt;Better schema for failure types&lt;/p&gt;

&lt;p&gt;Explicit clustering of similar mistakes&lt;/p&gt;

&lt;p&gt;Decay or pruning of outdated memories&lt;/p&gt;

&lt;p&gt;Right now it works—but it’s still naive.&lt;/p&gt;




&lt;p&gt;Closing thought&lt;/p&gt;

&lt;p&gt;If you’re building anything that ranks, recommends, or decides—don’t just ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How do I make it better?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How do I make it remember being wrong?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That one shift changed this project from a brittle matcher into something that actually improves over time.&lt;/p&gt;

&lt;p&gt;And if you want to go deeper into how this kind of memory layer works, it’s worth exploring:&lt;/p&gt;

&lt;p&gt;The Hindsight GitHub repository&lt;/p&gt;

&lt;p&gt;The Hindsight documentation&lt;/p&gt;

&lt;p&gt;The agent memory approach from Vectorize&lt;/p&gt;

&lt;p&gt;The implementation details matter—but the bigger idea is simple:&lt;/p&gt;

&lt;p&gt;Don’t just build systems that predict.&lt;/p&gt;

&lt;p&gt;Build systems that regret—and remember why.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
