<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Frank Brsrk </title>
    <description>The latest articles on Forem by Frank Brsrk  (@frank_brsrk).</description>
    <link>https://forem.com/frank_brsrk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3885887%2F309f7210-d679-4c7e-b6d4-8c2eb62450ab.png</url>
      <title>Forem: Frank Brsrk </title>
      <link>https://forem.com/frank_brsrk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/frank_brsrk"/>
    <language>en</language>
    <item>
      <title>I built a Python module to A/B test prompts inside Claude Code, and you can run it on yours</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Thu, 23 Apr 2026 10:29:33 +0000</pubDate>
      <link>https://forem.com/frank_brsrk/i-built-a-python-module-to-ab-test-prompts-inside-claude-code-and-you-can-run-it-on-yours-5c6f</link>
      <guid>https://forem.com/frank_brsrk/i-built-a-python-module-to-ab-test-prompts-inside-claude-code-and-you-can-run-it-on-yours-5c6f</guid>
      <description>&lt;p&gt;Same model. Same prompt. Baseline tells the patient to eat healthier. With an Ejentum reasoning scaffold injected, the agent asks for a thyroid panel.&lt;/p&gt;

&lt;p&gt;That's a real diff from the workflow I'm about to walk you through. The prompt was a medical second-opinion (45M patient, pre-diabetic markers, dyslipidemia, vitamin D deficiency). Both agents were gpt-4o at temperature 0. The only difference: the scaffolded agent had a function-call tool that retrieved a structured reasoning constraint set at runtime and absorbed it before responding.&lt;/p&gt;

&lt;p&gt;A blind Gemini Flash judge scored both responses on five dimensions and ruled B superior, 20 to 16. The judge's stated reason:&lt;/p&gt;

&lt;p&gt;"Response B is superior because it directly addresses the patient's symptom of 'sluggishness' by linking it to the Vitamin D deficiency and suggesting further diagnostic steps like thyroid testing."&lt;/p&gt;

&lt;p&gt;This article is about the Python module that produced that result, why I built it, and how to run it inside your own IDE on your own prompts in about 5 minutes.&lt;/p&gt;

&lt;p&gt;The problem this exists to solve&lt;br&gt;
If you ship agents, you've lived this loop:&lt;/p&gt;

&lt;p&gt;You tweak a system prompt&lt;br&gt;
Add a tool, swap a model, change phrasing&lt;br&gt;
The output looks different&lt;br&gt;
You can't actually tell if it's better, or just rotated&lt;br&gt;
Prompt engineering is mostly intuition. Vendors hand you benchmarks and ask you to trust them. What you actually want is a way to test, on your own task, whether your changes are lifting your agent's reasoning or just dressing it up.&lt;/p&gt;

&lt;p&gt;I built this module because I needed that for myself. I'm a solo founder dogfooding Claude Code daily. Every time I added structure to a system prompt, I had no honest way to verify whether the agent was reasoning more carefully or just producing different-shaped slop.&lt;/p&gt;

&lt;p&gt;The module gives me a verdict.&lt;/p&gt;

&lt;p&gt;What it does&lt;br&gt;
A Python script (zero third-party dependencies, just stdlib) that:&lt;/p&gt;

&lt;p&gt;Forks any prompt through two identical gpt-4o agents at temperature 0&lt;br&gt;
Agent A runs plain. No tools. Strong directive system prompt.&lt;br&gt;
Agent B runs with the same baseline system prompt PLUS the Ejentum reasoning skill file PLUS a forced function-call to the Ejentum Logic API. The agent autonomously crafts the query and picks the harness mode (reasoning or reasoning-multi) per the skill file's decision table.&lt;br&gt;
The API returns a structured "cognitive scaffold" — a reasoning constraint set with [NEGATIVE GATE], [REASONING TOPOLOGY], [FALSIFICATION TEST], and Suppress/Amplify signals. The agent absorbs it and responds.&lt;br&gt;
Both responses go to a blind Gemini Flash judge (different model family from the producers, so no shared-bias contamination). The judge sees neutral "Response A / Response B" labels and never knows which is which.&lt;br&gt;
The judge returns structured JSON: scores per dimension (specificity, posture, depth, actionability, honesty), totals, justifications, and a verdict (A, B, or tie).&lt;br&gt;
That's it. One prompt in, structured verdict out.&lt;/p&gt;

&lt;p&gt;Running it inside Claude Code&lt;br&gt;
Setup, in three steps.&lt;/p&gt;

&lt;p&gt;Step 1: get three API keys&lt;br&gt;
OpenAI (platform.openai.com/api-keys) for both producer agents&lt;br&gt;
Google Gemini (aistudio.google.com/app/apikey) for the blind judge&lt;br&gt;
Ejentum (ejentum.com), 100 free calls, no card required&lt;br&gt;
Set them in env:&lt;/p&gt;

&lt;p&gt;export OPENAI_API_KEY=sk-...&lt;br&gt;
export GEMINI_API_KEY=AI...&lt;br&gt;
export EJENTUM_API_KEY=zpka_...&lt;br&gt;
Step 2: clone the module&lt;/p&gt;

&lt;p&gt;git clone &lt;a href="https://github.com/ejentum/eval" rel="noopener noreferrer"&gt;https://github.com/ejentum/eval&lt;/a&gt;&lt;br&gt;
cd eval/python&lt;br&gt;
Step 3: run it&lt;br&gt;
From the command line, with a prompt of your choice:&lt;/p&gt;

&lt;p&gt;python orchestrator.py "Should we pivot our SaaS to enterprise next quarter?"&lt;br&gt;
Or call from Python:&lt;/p&gt;

&lt;p&gt;from orchestrator import run_eval&lt;/p&gt;

&lt;p&gt;result = run_eval("Should we pivot our SaaS to enterprise next quarter?")&lt;/p&gt;

&lt;p&gt;print(result["evaluation"]["verdict"])         # "A" | "B" | "tie"&lt;br&gt;
print(result["evaluation"]["totals"])          # {"A": 16, "B": 20}&lt;br&gt;
print(result["evaluation"]["verdict_reason"])  # one-sentence reason&lt;br&gt;
That's the whole interface.&lt;/p&gt;

&lt;p&gt;When you run inside Claude Code (or Cursor or Antigravity), you can ask your IDE-agent to do this on your behalf. Tell it: "Run the eval module on this prompt I'm working on." The agent reads the README, runs the script, parses the JSON, and reports back the verdict with the judge's quoted reason. The same way you'd hand a junior engineer a script and ask for the result.&lt;/p&gt;

&lt;p&gt;What you get back&lt;br&gt;
Here's the JSON shape (real output from the medical run linked at the end):&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "user_message": "Medical Report: ...",&lt;br&gt;
  "baseline_response": "Based on the laboratory results...",&lt;br&gt;
  "ejentum_response": "The patient's laboratory results indicate...",&lt;br&gt;
  "evaluation": {&lt;br&gt;
    "scores": {&lt;br&gt;
      "A": {"specificity": 3, "posture": 3, "depth": 3, "actionability": 3, "honesty": 4},&lt;br&gt;
      "B": {"specificity": 4, "posture": 4, "depth": 4, "actionability": 4, "honesty": 4}&lt;br&gt;
    },&lt;br&gt;
    "totals": {"A": 16, "B": 20},&lt;br&gt;
    "justifications": {&lt;br&gt;
      "specificity": "Response B is more specific in linking the Vitamin D deficiency to the patient's reported sluggishness and suggesting thyroid function tests to rule out other metabolic disorders.",&lt;br&gt;
      "posture": "Response B is more substantive, challenging the primary physician's general recommendation by suggesting a more comprehensive approach...",&lt;br&gt;
      "depth": "Response B reasons more deeply about the problem...",&lt;br&gt;
      "actionability": "Response B provides more actionable recommendations...",&lt;br&gt;
      "honesty": "Both responses acknowledge the limitations of diet and exercise alone..."&lt;br&gt;
    },&lt;br&gt;
    "verdict": "B",&lt;br&gt;
    "verdict_reason": "Response B is superior because it directly addresses the patient's symptom of 'sluggishness' by linking it to the Vitamin D deficiency and suggesting further diagnostic steps like thyroid testing."&lt;br&gt;
  },&lt;br&gt;
  "scaffold_used": "[NEGATIVE GATE]\nThe analysis stopped at...",&lt;br&gt;
  "tool_call": {&lt;br&gt;
    "query": "Patient is a 45-year-old male reporting sluggishness...",&lt;br&gt;
    "mode": "reasoning-multi"&lt;br&gt;
  }&lt;br&gt;
}&lt;br&gt;
You see everything: both responses verbatim, the per-dimension scores, why the judge ruled the way it did, the live scaffold that was injected into Agent B, and the exact query+mode the agent autonomously picked.&lt;/p&gt;

&lt;p&gt;Nothing summarized away.&lt;/p&gt;

&lt;p&gt;Why I designed it this way (transparency choices)&lt;br&gt;
Three things matter when you publish a tool that claims your product is better:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Trace. You need to see every step. Not "the model improved" but "the model called this tool, received this scaffold, executed this reasoning, scored this on this dimension." This module exposes the full chain.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Auditability. All three system prompts (baseline, augmented, evaluator) are published as readable markdown in the repo, not buried in code. The Ejentum reasoning skill file the augmented agent receives is bundled. Anyone reading the repo can verify exactly what was given to each agent.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Verifiability. The judge runs on a different model family from the producers (Gemini vs OpenAI). It receives only neutral A/B labels. Anyone with API keys can clone the repo, re-run the same script, and compare.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most "we improved your agent" claims ask you to trust a benchmark someone else ran. This hands you the instrument and lets you run it on your own task.&lt;/p&gt;

&lt;p&gt;What happens when it ties (because it does)&lt;br&gt;
The blind judge is allowed to return "tie" and regularly does.&lt;/p&gt;

&lt;p&gt;If your prompt is a low-complexity single-turn task (a simple question, a clear lookup, a known pattern), gpt-4o handles it well without any scaffold. Both responses will be similar. The judge will tie them. That's a real signal, not a failure of the tool.&lt;/p&gt;

&lt;p&gt;The scaffold's lift shows on prompts where baseline gpt-4o has a specific failure mode: sycophancy toward authority figures, shallow single-cause framing of multi-cause problems, generic templated responses to specific claims, missing differential diagnosis on ambiguous data.&lt;/p&gt;

&lt;p&gt;The medical second-opinion prompt landed in that territory because:&lt;/p&gt;

&lt;p&gt;The patient's reported symptom (sluggishness) was distinct from the lab values, and baseline got distracted by the lab walkthrough&lt;br&gt;
The PCP's recommendation was vague enough that baseline had room to either accept or challenge, and baseline accepted&lt;br&gt;
The labs cluster into a recognizable metabolic syndrome pattern, but spotting that requires synthesis, not enumeration&lt;br&gt;
That's the kind of prompt where the scaffold's [NEGATIVE GATE] and Suppress signals do real work. On "what's 2+2", they don't.&lt;/p&gt;

&lt;p&gt;If you run this on five of your own prompts and four tie, that doesn't mean the scaffold is broken. It means four of your prompts don't stress the kind of failure mode the scaffold prevents. Run it on harder ones.&lt;/p&gt;

&lt;p&gt;Try it on a hard prompt&lt;br&gt;
Some categories where I've seen the scaffold lift consistently:&lt;/p&gt;

&lt;p&gt;Validation traps: "I think we're fine because [other metric is up]" - baseline often validates; scaffolded names the false framing&lt;br&gt;
Multi-variable causal questions: "MRR grew but retention dropped, what should I do" - baseline picks one cause; scaffolded traces the chain&lt;br&gt;
Symptom-vs-lab questions: anything where the user's stated complaint diverges from the data they provide&lt;br&gt;
Strategic advice with a buried false premise: "should I pivot because my best customer said so" - baseline rubber-stamps; scaffolded probes&lt;br&gt;
Diagnostic prompts with ambiguous evidence: "my agent fails sometimes, what's wrong" - baseline guesses; scaffolded asks isolating questions&lt;br&gt;
If your work involves any of these patterns, the module is worth 5 minutes.&lt;/p&gt;

&lt;p&gt;Links&lt;br&gt;
Module: github.com/ejentum/eval/tree/main/python&lt;br&gt;
Worked example, fully replicable: github.com/ejentum/eval/tree/main/various_blind_eval_results/medical-second-opinion&lt;br&gt;
Ejentum API key (free, 100 calls): ejentum.com&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5kcab763922g0g12mf0x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5kcab763922g0g12mf0x.png" alt=" " width="800" height="501"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsrb800vvryg153322iu0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsrb800vvryg153322iu0.png" alt=" " width="800" height="540"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu02yz01oycz7bofnqmu4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu02yz01oycz7bofnqmu4.png" alt=" " width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>agents</category>
      <category>llm</category>
    </item>
    <item>
      <title>the model alone is not the agent. The harness plus the model is the agent.</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Wed, 22 Apr 2026 12:47:22 +0000</pubDate>
      <link>https://forem.com/frank_brsrk/the-model-alone-is-not-the-agent-the-harness-plus-the-model-is-the-agent-2p29</link>
      <guid>https://forem.com/frank_brsrk/the-model-alone-is-not-the-agent-the-harness-plus-the-model-is-the-agent-2p29</guid>
      <description>&lt;p&gt;An agentic harness is the orchestration and control layer wrapped around a base language model that transforms it from a stateless text predictor into an agent capable of taking actions, calling tools, maintaining state across steps, and executing multi-step tasks. The model provides raw capability; the harness provides the structure that turns that capability into coordinated behavior. Different harnesses wrapping the same model produce materially different agent behavior, which is why harness design is considered a discipline in its own right.&lt;/p&gt;

&lt;p&gt;What a harness typically contains&lt;br&gt;
A system prompt defining the agent's role and boundaries&lt;/p&gt;

&lt;p&gt;A tool schema and invocation loop (function calling, API access, code execution)&lt;/p&gt;

&lt;p&gt;A memory layer, short-term through the context window and often long-term through an external store&lt;/p&gt;

&lt;p&gt;Orchestration logic for multi-step or multi-agent flows&lt;/p&gt;

&lt;p&gt;Verification or reflection steps between actions&lt;/p&gt;

&lt;p&gt;Error handling, retries, and termination conditions&lt;/p&gt;

&lt;p&gt;Input and output format enforcement&lt;/p&gt;

&lt;p&gt;Examples from the field&lt;br&gt;
ReAct (Yao et al., 2022): a harness pattern that interleaves reasoning traces and action calls in a loop, letting the model decide when to think and when to act.&lt;/p&gt;

&lt;p&gt;Claude Computer Use: a harness that wraps a language model with screenshot capture, mouse and keyboard simulation, and a perception and action loop for controlling a desktop.&lt;/p&gt;

&lt;p&gt;OpenAI Assistants runtime: a managed harness around the OpenAI models that handles thread persistence, file retrieval, code interpreter sessions, and function calling.&lt;/p&gt;

&lt;p&gt;Devin (Cognition): a tightly engineered harness combining a planning module, a browser, a code editor, and a shell, all driven by an underlying model.&lt;/p&gt;

&lt;p&gt;LangGraph: a graph-based harness where nodes are model calls or tools and edges encode the control flow, letting the developer define the agent's reasoning topology explicitly.&lt;/p&gt;

&lt;p&gt;The defining property across all of them: the model alone is not the agent. The harness plus the model is the agent.&lt;/p&gt;

&lt;p&gt;check our externalized harness u can use inside ur own harness to boost even more the performance of ur agentic systems ejentum.com&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>Eval workflow for agentic builders: fork any prompt through baseline vs scaffolded agents, blind third-party judge.</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Wed, 22 Apr 2026 07:17:13 +0000</pubDate>
      <link>https://forem.com/frank_brsrk/eval-workflow-for-agentic-builders-fork-any-prompt-through-baseline-vs-scaffolded-agents-blind-3i47</link>
      <guid>https://forem.com/frank_brsrk/eval-workflow-for-agentic-builders-fork-any-prompt-through-baseline-vs-scaffolded-agents-blind-3i47</guid>
      <description>&lt;p&gt;Built an n8n eval workflow that A/B tests any prompt through plain GPT-4o vs GPT-4o + a reasoning scaffold, judged by a blind Gemini evaluator&lt;/p&gt;

&lt;p&gt;Solo founder here. I've been building a cognitive infrastructure API (Ejentum) and needed a way for builders to evaluate it on their own agent tasks instead of trusting my benchmarks. So I published the eval as an n8n workflow.&lt;/p&gt;

&lt;p&gt;What it is&lt;br&gt;
A three-agent n8n workflow. You paste any prompt in the chat trigger. The prompt fans out through two identical GPT-4o agents (one plain, one with an Ejentum reasoning scaffold injected via an HTTP tool). A blind Gemini Flash evaluator scores both responses on five dimensions (specificity, posture, depth, actionability, honesty) and returns structured JSON with a verdict.&lt;/p&gt;

&lt;p&gt;The evaluator is allowed to return "tie" and regularly does. Point is you test on your own tasks and decide.&lt;/p&gt;

&lt;p&gt;What it's actually testing&lt;br&gt;
Whether the cognitive scaffold changes output posture on a given task, or not&lt;/p&gt;

&lt;p&gt;Whether the scaffolded agent engages the specific claims in your prompt or stays generic&lt;/p&gt;

&lt;p&gt;How the scaffold affects sycophancy, depth, and diagnostic procedure&lt;/p&gt;

&lt;p&gt;Whether different harness modes (reasoning, anti-deception, memory, code) stress different task types. Mode is editable in the HTTP tool's JSON body&lt;/p&gt;

&lt;p&gt;The diff is often subtle on easy prompts and more pronounced on dual-load prompts (emotional + cognitive claims mixed), advice prompts with a buried false premise, or multi-variable causal reasoning. Low-complexity single-turn tasks often produce ties because GPT-4o handles them well without a scaffold.&lt;/p&gt;

&lt;p&gt;Where you might apply this pattern&lt;br&gt;
Customer support agents: test whether the scaffold reduces rubber-stamping and increases specificity on customer complaints&lt;/p&gt;

&lt;p&gt;Code review or diagnostic agents: test whether it catches the failure modes you actually care about&lt;/p&gt;

&lt;p&gt;Content or research workflows: test whether it reduces generic output on your topics&lt;/p&gt;

&lt;p&gt;Multi-agent systems: wrap any single agent call in the fork to see the effect before integrating permanently&lt;/p&gt;

&lt;p&gt;Prompt engineering A/B tests: measure the effect of a cognitive layer against your own prompt iterations&lt;/p&gt;

&lt;p&gt;Setup&lt;br&gt;
Import Reasoning_Harness_Eval_Workflow.json&lt;/p&gt;

&lt;p&gt;Set three credentials: OpenAI (both producer agents), Google Gemini (blind evaluator), Header Auth for the Ejentum API (free key at ejentum.com, 100 calls)&lt;/p&gt;

&lt;p&gt;Paste a prompt in the chat trigger&lt;/p&gt;

&lt;p&gt;Workflow diagram:&lt;br&gt;
[attach screenshots/eval_workflow.png]&lt;/p&gt;

&lt;p&gt;A vs B output from one run:&lt;br&gt;
[attach screenshots/A_vs_B.png]&lt;/p&gt;

&lt;p&gt;Blind evaluator verdict JSON from the same run:&lt;br&gt;
[attach screenshots/A_B__blind_eval.png]&lt;/p&gt;

&lt;p&gt;Workflow JSON, READMEs, and a TypeScript port for IDE setups (Antigravity, Claude Code, Cursor): &lt;a href="https://github.com/ejentum/eval" rel="noopener noreferrer"&gt;https://github.com/ejentum/eval&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rixauzxhhysp7qaq5i4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rixauzxhhysp7qaq5i4.png" alt=" " width="800" height="403"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F58h204nr5crgpwiqz93w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F58h204nr5crgpwiqz93w.png" alt=" " width="800" height="528"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp95f3djwkqymij8fcz5p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp95f3djwkqymij8fcz5p.png" alt=" " width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>Wait, you guys run evals?</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Wed, 22 Apr 2026 00:11:05 +0000</pubDate>
      <link>https://forem.com/frank_brsrk/wait-you-guys-run-evals-19ig</link>
      <guid>https://forem.com/frank_brsrk/wait-you-guys-run-evals-19ig</guid>
      <description>&lt;p&gt;Comes in my mind a meme with this expression but clearly cannot find the image related to.&lt;/p&gt;

&lt;p&gt;my question folks of this community is: whenever u build a system or a product or anything that contains a model in the backend that takes actions and is in charge of decisions that require rigor, u search few good peer reviewed benchmarks run the hardest tasks to grant ur self a bon bon of antisycophancy and see where u stand above or below. great, but still some metrics are not built for ur exact use case u built the product for, do u even step aside, and think to build an eval specific and designed to find the real benefits of ur system? this spawns new findings positive and negative aspects of ur work, and results as a map of failures to suppress and strengths to amplify. this question arises, because each one of u have ur own blueprints and way of seeing and running things, and a pov has its place in this post. thanks for reading&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evals</category>
      <category>llm</category>
    </item>
    <item>
      <title>Under Pressure. Better Harness.</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Mon, 20 Apr 2026 15:57:13 +0000</pubDate>
      <link>https://forem.com/frank_brsrk/under-pressure-better-harness-5887</link>
      <guid>https://forem.com/frank_brsrk/under-pressure-better-harness-5887</guid>
      <description>&lt;p&gt;Hey everyone this is not a typical constructed llm copied and pasted post. &lt;/p&gt;

&lt;p&gt;i built a reasoning tool for AI agents, applicable to any agentic framework and with all ai models capable of tool calling. &lt;br&gt;
The tool is a Post Http Request where the agents sends the description of the task and the mode....&lt;br&gt;
 [( coding ; reasoning; anti-deception; memory) (the skill files teach the agent to act autonomously) ]&lt;br&gt;
.... what returns is a cognitive operation or ability as i call them, an engineered structured reasoning injection that lives in a dataset vdb optimized for agentic inference, this retrieval followed comes in as instruction to be followed by the ai agent and not seen as a content = this ability is a complex of fields as Wrong / Right Pattern, Procedural steps of reasoning method to apply, a reasoning topology to for branching exploration and matrix of suppression fields that signal the failures that models actually run on. few words to keep it poor : the api matches the task based on the description "query" and "mode" and returns back tool results that go inside its context. Benchmarks public reviewed and internals are run and all public on github and website ejentum.com and &lt;a href="https://github.com/ejentum/benchmarks" rel="noopener noreferrer"&gt;https://github.com/ejentum/benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;test i run today just to get raw 4.7 vs harness augmented version 4.7 on the same prompt with Different Results.&lt;/p&gt;

&lt;p&gt;i wrote a runbook prompt with two embedded flaws to see whether a ~200 token block prepended before generation shifts what opus 4.7 notices.&lt;/p&gt;

&lt;p&gt;prompt: 300-word migration runbook under a deadline. embedded: eventual-consistency replication + "strict" global rate limit (CAP incompatible), and 12 regions × 1,000 RPS stated as a 10,000 global cap.&lt;/p&gt;

&lt;p&gt;baseline caught the CAP issue. did not mention the arithmetic.&lt;/p&gt;

&lt;p&gt;[raw 4.7 opus]&lt;/p&gt;

&lt;p&gt;same model, same temperature, one curl before the call to fetch and prepend a suppression block. caught both. the injection is visible in the OUT line of the response, so it is not hidden.&lt;/p&gt;

&lt;p&gt;[4.7 opus + ejentum harness]&lt;/p&gt;

&lt;p&gt;the model can do the arithmetic. what changed is what it lets itself skip before generation starts. the block is retrieved from a semantic index of ~140 anti-deception patterns keyed to the query, not a static system prompt.&lt;/p&gt;

&lt;p&gt;in ejentum.com i shared skill files and a large set of docs that help grasp better the concept of this new tool. i am looking forward for feedback.  hope u like it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frltjfh3uzpnwjvyxa3it.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frltjfh3uzpnwjvyxa3it.png" alt=" " width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqkv7tjn9nj7dmh2awu5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuqkv7tjn9nj7dmh2awu5.png" alt=" " width="800" height="475"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;thanks for ur attention to the post &lt;/p&gt;

&lt;p&gt;&lt;a class="mentioned-user" href="https://dev.to/frank_brsrk"&gt;@frank_brsrk&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>hey everyone and anyone. 


first timeer in dev.to, the website looks cool and people do so too. 


this is frank, cheers</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Mon, 20 Apr 2026 15:35:19 +0000</pubDate>
      <link>https://forem.com/frank_brsrk/hey-everyone-and-anyone-first-timeer-in-devto-the-website-looks-cool-and-people-do-so-229o</link>
      <guid>https://forem.com/frank_brsrk/hey-everyone-and-anyone-first-timeer-in-devto-the-website-looks-cool-and-people-do-so-229o</guid>
      <description></description>
    </item>
  </channel>
</rss>
