<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: vijaya kumari</title>
    <description>The latest articles on Forem by vijaya kumari (@vijaya_kumari_ed7a2c37a7e).</description>
    <link>https://forem.com/vijaya_kumari_ed7a2c37a7e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3945026%2F54678158-4fef-46a2-bf25-9d161df70845.jpg</url>
      <title>Forem: vijaya kumari</title>
      <link>https://forem.com/vijaya_kumari_ed7a2c37a7e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/vijaya_kumari_ed7a2c37a7e"/>
    <language>en</language>
    <item>
      <title>How I Used Claude to Finish Building an AI That Evaluates AI — and Caught It Hallucinating</title>
      <dc:creator>vijaya kumari</dc:creator>
      <pubDate>Sat, 23 May 2026 02:47:43 +0000</pubDate>
      <link>https://forem.com/vijaya_kumari_ed7a2c37a7e/how-i-used-claude-to-finish-building-an-ai-that-evaluates-ai-and-caught-it-hallucinating-4e6</link>
      <guid>https://forem.com/vijaya_kumari_ed7a2c37a7e/how-i-used-claude-to-finish-building-an-ai-that-evaluates-ai-and-caught-it-hallucinating-4e6</guid>
      <description>&lt;h2&gt;
  
  
  The Project I Started But Never Finished
&lt;/h2&gt;

&lt;p&gt;Earlier this year I started building &lt;strong&gt;ai-qe-agent&lt;/strong&gt; — &lt;br&gt;
a multi-agent system that auto-generates QA test cases &lt;br&gt;
using Claude (Anthropic's AI).&lt;/p&gt;

&lt;p&gt;8 specialized agents. TypeScript. Direct Anthropic SDK.&lt;/p&gt;

&lt;p&gt;It worked. But it had a critical problem:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No visibility into whether the outputs were actually correct.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agents were generating test cases, reviewing them, &lt;br&gt;
converting them to Playwright scripts — and I had &lt;br&gt;
no idea if Claude was hallucinating, truncating, &lt;br&gt;
or silently failing between agents.&lt;/p&gt;

&lt;p&gt;That's what I set out to finish.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Before
&lt;/h2&gt;




&lt;h2&gt;
  
  
  How Claude Helped Me Finish It
&lt;/h2&gt;

&lt;p&gt;I used &lt;strong&gt;Claude&lt;/strong&gt; (via Claude Code) as my primary &lt;br&gt;
AI coding assistant throughout this project.&lt;/p&gt;

&lt;p&gt;Claude helped me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Design the LLM-as-Judge eval architecture&lt;/li&gt;
&lt;li&gt;Generate eval_suite.py from scratch&lt;/li&gt;
&lt;li&gt;Debug LangSmith tracing integration&lt;/li&gt;
&lt;li&gt;Build the TruLens monitoring setup&lt;/li&gt;
&lt;li&gt;Create the Fintech AI Agent Gradio app&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The meta-irony: I used Claude to build a system &lt;br&gt;
that evaluates Claude's own outputs.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Finished
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Custom LLM Eval Suite
&lt;/h3&gt;

&lt;p&gt;Built eval_suite.py using LLM-as-Judge pattern — &lt;br&gt;
Claude evaluating Claude's own outputs across 4 dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Completeness&lt;/strong&gt; — did the agent complete the full task?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specificity&lt;/strong&gt; — were outputs precise and detailed?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faithfulness&lt;/strong&gt; — did the agent follow all instructions?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination detection&lt;/strong&gt; — did it invent facts not in context?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. TruLens Monitoring Dashboard
&lt;/h3&gt;

&lt;p&gt;Real-time quality metrics across all 4 agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faithfulness scores&lt;/li&gt;
&lt;li&gt;Hallucination flags
&lt;/li&gt;
&lt;li&gt;Chain compatibility checks&lt;/li&gt;
&lt;li&gt;Quality score trends&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. LangSmith Production Tracing
&lt;/h3&gt;

&lt;p&gt;Every Claude API call now traced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input prompt&lt;/li&gt;
&lt;li&gt;Output response&lt;/li&gt;
&lt;li&gt;Latency per agent&lt;/li&gt;
&lt;li&gt;Token usage&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Pinecone Vector Store
&lt;/h3&gt;

&lt;p&gt;Semantic deduplication for test cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prevents duplicate test generation&lt;/li&gt;
&lt;li&gt;0.85+ cosine similarity = HIGH OVERLAP flag&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Fintech AI Agent (New HF Space)
&lt;/h3&gt;

&lt;p&gt;Live demo combining everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fraud detection with risk scoring (0-10)&lt;/li&gt;
&lt;li&gt;Compliance Q&amp;amp;A (KYC/AML/GDPR/SOX/PCI-DSS)&lt;/li&gt;
&lt;li&gt;AML risk report generation (6-section formal reports)&lt;/li&gt;
&lt;li&gt;Real-time eval dashboard&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Findings — What Claude Found About Itself
&lt;/h2&gt;

&lt;p&gt;Running the eval suite on my own pipeline revealed:&lt;/p&gt;

&lt;p&gt;🔴 &lt;strong&gt;2 hallucinations caught&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AutomationScriptGenerator invented &lt;code&gt;'Invalid credentials'&lt;/code&gt; &lt;br&gt;
as error text — never specified in the input context.&lt;br&gt;
SelfHealingAgent fabricated DOM selectors without a DOM.&lt;/p&gt;

&lt;p&gt;🔴 &lt;strong&gt;2 pipeline breaks found&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ManualTestGenerator output = bare array.&lt;br&gt;
QAReviewAgent expected a wrapped ManualTestSuite object.&lt;br&gt;
chain_compatibility = 0. Would silently fail in production.&lt;/p&gt;

&lt;p&gt;🔴 &lt;strong&gt;2 faithfulness failures&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ManualTestGenerator generated 2 of 8 required test cases.&lt;br&gt;
Stopped with no error. No warning. Just silent truncation.&lt;/p&gt;

&lt;p&gt;🟢 &lt;strong&gt;0.902 avg quality score&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AutomationScriptGenerator: 0.94&lt;br&gt;
SelfHealingAgent: 1.0 quality — but 0.0 faithfulness.&lt;br&gt;
Good output. Wrong process. Only eval catches this.&lt;/p&gt;




&lt;h2&gt;
  
  
  The After
&lt;/h2&gt;




&lt;h2&gt;
  
  
  The Key Insight
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;AI systems fail silently.&lt;br&gt;
No errors. No warnings. No crashes.&lt;br&gt;
Just wrong outputs — shipped with confidence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is why LLM Evaluation Engineering exists.&lt;br&gt;
And why finishing this project mattered.&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;🤗 &lt;strong&gt;Fintech AI Agent (live):&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://huggingface.co/spaces/Vijayarv07/fintech-ai-agent" rel="noopener noreferrer"&gt;https://huggingface.co/spaces/Vijayarv07/fintech-ai-agent&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🤗 &lt;strong&gt;ai-qe-agent (live):&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://huggingface.co/spaces/Vijayarv07/ai-qe-agent" rel="noopener noreferrer"&gt;https://huggingface.co/spaces/Vijayarv07/ai-qe-agent&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⭐ &lt;strong&gt;GitHub:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/vijayarjun7/ai-qe-agent" rel="noopener noreferrer"&gt;https://github.com/vijayarjun7/ai-qe-agent&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Claude (claude-sonnet-4-20250514) — Anthropic&lt;/li&gt;
&lt;li&gt;Python + TypeScript&lt;/li&gt;
&lt;li&gt;TruLens (eval monitoring)&lt;/li&gt;
&lt;li&gt;LangSmith (production tracing)&lt;/li&gt;
&lt;li&gt;Pinecone (vector store)&lt;/li&gt;
&lt;li&gt;Gradio (HF Space UI)&lt;/li&gt;
&lt;li&gt;Playwright (automation)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built in public. Follow my journey: #BuildInPublic&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focm73h5o99z34z5vlb0w.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focm73h5o99z34z5vlb0w.jpeg" alt=" " width="800" height="792"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>githubchallenge</category>
      <category>githubfinishupathon</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I Built a Fintech AI Agent That Detects Fraud and Generates AML Risk Reports — With Zero Hallucinations</title>
      <dc:creator>vijaya kumari</dc:creator>
      <pubDate>Fri, 22 May 2026 01:34:28 +0000</pubDate>
      <link>https://forem.com/vijaya_kumari_ed7a2c37a7e/how-i-built-a-fintech-ai-agent-that-detects-fraud-and-generates-aml-risk-reports-with-zero-4p1l</link>
      <guid>https://forem.com/vijaya_kumari_ed7a2c37a7e/how-i-built-a-fintech-ai-agent-that-detects-fraud-and-generates-aml-risk-reports-with-zero-4p1l</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;AI in fintech cannot hallucinate.&lt;/p&gt;

&lt;p&gt;A fabricated regulation reference = legal liability.&lt;br&gt;
A missed fraud pattern = financial crime.&lt;br&gt;
A wrong compliance answer = regulatory penalty.&lt;/p&gt;

&lt;p&gt;Yet most AI demos ship without any eval layer.&lt;br&gt;
I built one.&lt;/p&gt;


&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;A 4-tab Fintech AI Agent deployed on HuggingFace:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tab 1 — Fraud Detector&lt;/strong&gt;&lt;br&gt;
Analyzes transactions for fraud patterns.&lt;br&gt;
Returns: risk score (0-10), red flags, &lt;br&gt;
approve/review/reject recommendation.&lt;/p&gt;

&lt;p&gt;Test input:&lt;br&gt;
"Transfer $9,800 to Cayman Islands &lt;br&gt;
at 3:47am from unrecognized device"&lt;/p&gt;

&lt;p&gt;Result: 9/10 HIGH RISK → REJECT&lt;/p&gt;

&lt;p&gt;Red flags caught:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amount just below $10K CTR threshold (structuring)&lt;/li&gt;
&lt;li&gt;High-risk jurisdiction (Cayman Islands)&lt;/li&gt;
&lt;li&gt;Unusual transaction time (3:47am)&lt;/li&gt;
&lt;li&gt;New unrecognized device&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tab 2 — Compliance Q&amp;amp;A&lt;/strong&gt;&lt;br&gt;
RAG over hardcoded financial regulations:&lt;br&gt;
KYC, AML, GDPR, SOX, PCI-DSS&lt;/p&gt;

&lt;p&gt;Every answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cites specific regulation + section&lt;/li&gt;
&lt;li&gt;Shows confidence score&lt;/li&gt;
&lt;li&gt;Flags hallucination risk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tab 3 — AML Risk Report Generator&lt;/strong&gt;&lt;br&gt;
Generates formal 6-section risk assessments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer Risk Profile&lt;/li&gt;
&lt;li&gt;Transaction Pattern Analysis&lt;/li&gt;
&lt;li&gt;Red Flags Identified&lt;/li&gt;
&lt;li&gt;Regulatory Considerations&lt;/li&gt;
&lt;li&gt;Recommended Actions&lt;/li&gt;
&lt;li&gt;Compliance Officer Notes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tab 4 — Eval Dashboard&lt;/strong&gt;&lt;br&gt;
Real-time metrics across all tabs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total queries processed&lt;/li&gt;
&lt;li&gt;Avg quality score&lt;/li&gt;
&lt;li&gt;Hallucinations flagged&lt;/li&gt;
&lt;li&gt;Risk alerts triggered&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  The Eval Layer
&lt;/h2&gt;

&lt;p&gt;Every Claude output is scored for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;faithfulness_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination_risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LOW&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the LLM-as-Judge pattern —&lt;br&gt;
Claude evaluating Claude's own outputs.&lt;/p&gt;

&lt;p&gt;Results from first run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;98% avg quality score&lt;/li&gt;
&lt;li&gt;0 hallucinations detected&lt;/li&gt;
&lt;li&gt;Faithfulness: 95-100% per tab&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Claude (claude-sonnet-4-20250514)&lt;/li&gt;
&lt;li&gt;Pinecone (vector store + semantic dedup)&lt;/li&gt;
&lt;li&gt;LangSmith (production tracing)&lt;/li&gt;
&lt;li&gt;TruLens (eval monitoring dashboard)&lt;/li&gt;
&lt;li&gt;Gradio (HF Space UI)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Live Demo
&lt;/h2&gt;

&lt;p&gt;huggingface.co/spaces/Vijayarv07/fintech-ai-agent&lt;/p&gt;

&lt;p&gt;GitHub:&lt;br&gt;
github.com/vijayarjun7&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Adding quantum-inspired compression to the &lt;br&gt;
inference layer (QuantRot-PQC research).&lt;/p&gt;

&lt;p&gt;Because reliable AI + efficient AI = &lt;br&gt;
production-ready AI.&lt;/p&gt;

&lt;h1&gt;
  
  
  BuildInPublic
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7clczyj10v7z370ry64w.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7clczyj10v7z370ry64w.jpeg" alt=" " width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>fintech</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
