Forem: Scotty G

Testing AI Agents Like Code: the `oa test` Harness

Scotty G — Thu, 23 Apr 2026 00:39:21 +0000

You wouldn't ship code without tests. But most AI agents ship with nothing — a handful of manual prompts in a notebook, a screenshot of "it worked once," and a prayer that production inputs don't look too different from the test ones.

OAS 1.4 ships oa test a test harness that runs eval cases against real models, asserts on output shape and content, and emits CI-friendly JSON. Your agents get tested like code, because they are code.

What a test file looks like

Tests live alongside the spec. One YAML file per agent:

# .agents/summariser.test.yaml
spec: ./summariser.yaml

cases:
  - name: summarises short documents
    task: summarise
    input:
      document: "The sky is blue. The grass is green. Water is wet."
    expect:
      output.summary: { type: string, min_length: 10 }

  - name: handles empty facts gracefully
    task: summarise
    input:
      document: ""
    expect:
      output.summary: { contains: "no content" }

  - name: smoke test only
    task: summarise
    input:
      document: "..."
    # no expect block — passes if the model returns anything valid

Three cases, one file. Each case targets a task in the spec, provides the input, and optionally asserts on the output.

The assertion vocabulary

oa test supports a small, practical set of assertions, enough to catch real bugs without turning tests into a DSL.

Assertion	Example	Checks
`contains`	`{ contains: "welcome" }`	Substring match (case-insensitive by default)
`equals`	`{ equals: "greeting" }`	Exact value equality
`type`	`{ type: array }`	Value type: `string`, `number`, `boolean`, `object`, `array`
`min_length`	`{ min_length: 1 }`	Length for strings or arrays
`max_length`	`{ max_length: 500 }`	Upper bound for strings or arrays

You can combine them:

expect:
  output.items: { type: array, min_length: 1, max_length: 10 }
  output.items[0].id: { type: string }
  output.summary: { contains: "sky", case_sensitive: false }

Paths support dotted access and array indexing (output.items[0].id). The parser is deliberately simple, if you need richer assertions, drop to a post-processing step in CI rather than extending the harness.

Running the tests

From the terminal:

oa test .agents/summariser.test.yaml

You get human-readable output — green ticks, red crosses, which case failed and why.

For CI, flip to JSON mode:

oa test .agents/summariser.test.yaml --quiet

{
  "spec": ".agents/summariser.yaml",
  "total": 3,
  "passed": 2,
  "failed": 1,
  "cases": [
    {
      "name": "summarises short documents",
      "passed": true,
      "duration_ms": 842
    },
    {
      "name": "handles empty facts gracefully",
      "passed": false,
      "reason": "output.summary: expected to contain 'no content', got 'The document is empty'",
      "duration_ms": 512
    },
    {
      "name": "smoke test only",
      "passed": true,
      "duration_ms": 654
    }
  ]
}

Pipe this into whatever CI system you use. The exit code is non-zero on any failure, so oa test plays nicely with standard test-runner conventions.

Testing in CI

Drop it into a GitHub Actions workflow:

# .github/workflows/test-agents.yml
name: Test agents

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pipx install open-agent-spec
      - name: Run agent tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          for test in .agents/*.test.yaml; do
            oa test "$test" --quiet
          done

Agents now have the same test discipline as the rest of your codebase. Break a prompt? The test case catches it before merge. Swap models? Run the suite and see what drifted.

What to actually test

Model outputs are non-deterministic, so your tests need to assert on shape and invariants, not exact strings.

Do test:

Output schema conformance required fields present, types correct
Structural invariants "the summary is always under 500 chars," "the category is always one of these enum values"
Refusal handling empty or adversarial inputs don't crash the pipeline
Tool interaction tool-using agents produce the expected tool calls for known inputs
Delegated spec integration a spec pulling oa://prime-vector/summariser still works after the registry updates

Don't test:

Exact phrasing — "the response should be 'Hello, Alice!'" — brittle and wrong
Creative output quality — that's a human eval problem, not a test-suite problem
Token counts or latency — monitor these in production, don't gate PRs on them

Test invariants, not novelty. That's where agent tests earn their keep.

The bigger picture

Agents-as-code only works if the agents are actually code-like. That means:

Version-controlled — ✅ YAML in your repo
Reviewable — ✅ prompts and schemas in a PR diff
Reusable — ✅ spec delegation and the OAS registry
Testable — ✅ oa test

oa test was the last piece missing. With it, agents get the same discipline as any other component of your system: change them, test them, merge them, deploy them.

Define what your agents do. Let the runtime be someone else's problem.

Getting started

pipx install open-agent-spec

# Add a test file next to your spec
cat > .agents/example.test.yaml <<'EOF'
spec: ./example.yaml
cases:
  - name: greets by name
    task: greet
    input: { name: "CI" }
    expect:
      output.response: { contains: "CI" }
EOF

# Run it
oa test .agents/example.test.yaml

One command. One YAML file. Your agents now have a test suite.

Resources:

Also in this series:

Open Agent Spec is MIT-licensed and maintained by Prime Vector. If you're running agents in CI, we'd love to hear what broke — issues welcome on GitHub.

Composable Agent Specs: Spec Delegation and the OAS Registry

Scotty G — Mon, 20 Apr 2026 00:31:19 +0000

Most agent frameworks solve reuse the way libraries do: write a class, import it, hope the abstractions line up. That works inside one codebase. Between teams or across organisations? It breaks down fast.

Open Agent Spec 1.4 takes a different approach.One agent spec can delegate a task to another spec, loaded from a local path, a URL, or a public registry:

tasks:
  summarise:
    spec: oa://prime-vector/summariser@1.0.0   # version-pinned, from registry
    task: summarise

One line. Version-pinned. No copy-paste. Your pipeline gets a battle-tested summariser without importing anything.

Why spec composition matters

An agent that fits in one YAML file is easy. But as soon as you build a second agent that needs the same summarisation step, or the same sentiment classifier, or the same document-extraction task — you start copy-pasting.

The usual fix is to wrap the shared logic in a Python function. But now the agent definition is split across YAML and Python, the function drifts from the spec over time, and you're back to the problem OAS was meant to solve: agent logic spread across framework abstractions you can't review in a PR diff.

Spec composition keeps the contract in YAML.

Three ways to delegate

A delegated task declares spec: + task: instead of its own prompts and output schema. The runtime loads the referenced spec, runs the target task, and surfaces the result transparently.

1. Local file

For intra-repo reuse:

tasks:
  summarise:
    spec: ./specs/summariser.yaml
    task: summarise

The path resolves relative to the calling spec's directory. Useful when you have a shared specs directory in a monorepo.

2. HTTP/HTTPS URL

For cross-repo reuse without a registry:

tasks:
  summarise:
    spec: https://raw.githubusercontent.com/your-org/agents/main/summariser.yaml
    task: summarise

Any spec reachable over HTTP works. Good for internal GitHub-hosted specs behind SSO.

3. Registry reference (`oa://`)

For public, versioned reuse:

tasks:
  summarise:
    spec: oa://prime-vector/summariser@1.0.0
    task: summarise

oa://namespace/name expands to https://openagentspec.dev/registry/namespace/name/latest/spec.yaml. With @version, it pins to an exact version. Drop the version for "latest" semantics — useful in development, risky in production.

What lives on the registry

The OAS registry is open for community contributions. A handful of Prime Vector-authored specs are already published as a baseline:

Reference	Purpose
`oa://prime-vector/summariser`	Document summarisation
`oa://prime-vector/classifier`	Multi-class document classification
`oa://prime-vector/sentiment`	Sentiment analysis with confidence scores
`oa://prime-vector/keyword-extractor`	Key phrase extraction
`oa://prime-vector/code-reviewer`	Code review task for PR diffs
`oa://prime-vector/memory-retriever`	Retrieve relevant context from a memory store

You can compose these into your own pipelines without writing the prompts yourself. Each is a single YAML file with a typed output schema — so when you call it, you know exactly what you're going to get back.

Composing delegated tasks with `depends_on`

Delegation stacks with depends_on. You can chain multiple delegated tasks together:

open_agent_spec: "1.4.0"

agent:
  name: document-processor
  description: "Extract, summarise, and classify a document"

intelligence:
  type: llm
  engine: openai
  model: gpt-4o

tasks:
  extract_keywords:
    spec: oa://prime-vector/keyword-extractor@1.0.0
    task: extract

  summarise:
    spec: oa://prime-vector/summariser@1.0.0
    task: summarise
    depends_on: [extract_keywords]

  classify:
    spec: oa://prime-vector/classifier@1.0.0
    task: classify
    depends_on: [summarise]

Three delegated tasks. Three different authors possible. One pipeline. The runtime handles the load → run → merge flow, and the result envelope tells you exactly which spec each task delegated to:

{
  "task": "classify",
  "delegated_to": "oa://prime-vector/classifier@1.0.0#classify",
  "output": {"category": "technical-documentation", "confidence": 0.94}
}

Safety: cycle detection

One risk with delegation: A delegates to B which delegates to A. An infinite loop waiting to happen.

The runtime detects this before any model call is made and raises DELEGATION_CYCLE_ERROR:

{
  "error": "Circular spec delegation detected: './summariser.yaml' is already in the delegation stack",
  "code": "DELEGATION_CYCLE_ERROR",
  "stage": "delegation",
  "task": "summarise"
}

This is specified in the formal OAS standard as a MUST requirement — any conforming runtime has to detect cycles before spending tokens.

Version pinning: why you should care

The registry supports both floating (oa://.../summariser) and pinned (oa://.../summariser@1.0.0) references.

In production, pin your versions. Same reason you pin npm or PyPI packages. A spec author can update prompts, tighten schemas, or change defaults in ways that look like minor improvements but subtly alter your pipeline's output.

Semantic Versioning applies:

Patch bumps (1.0.0 → 1.0.1) — prompt tweaks, typo fixes, no schema changes
Minor bumps (1.0.0 → 1.1.0) — added optional output fields, expanded input acceptance
Major bumps (1.0.0 → 2.0.0) — schema changes, prompt overhauls, model swaps

For CI agents or anything that fans out to production data, pin. Always.

Publishing your own

The registry is open for contributions. If you build a well-scoped agent spec that others could reuse — a good test generator, a bug-triager, a changelog writer — open a PR to add it.

Specs that make good registry citizens:

Single-responsibility — do one thing, not five
Stable I/O schema — typed inputs, typed outputs, documented fields
Engine-agnostic — work across OpenAI, Claude, Grok, or local models
Minimal prompts — the less cleverness, the less breakage

Getting started

Add a delegated task to any existing spec:

tasks:
  summarise:
    spec: oa://prime-vector/summariser@1.0.0
    task: summarise

Run it:

oa run --spec .agents/my-pipeline.yaml --task summarise \
  --input '{"document":"..."}' --quiet

The runtime resolves the registry reference, loads the spec, runs the delegated task, and returns the result. No extra install. No framework to adopt.

Resources:

Also in this series:

Open Agent Spec is MIT-licensed and maintained by Prime Vector. The registry is open — we'd love to see what you build.

Forem: Scotty G

Testing AI Agents Like Code: the `oa test` Harness

What a test file looks like

The assertion vocabulary

Running the tests

Testing in CI

What to actually test

Do test:

Don't test:

The bigger picture

Getting started

Composable Agent Specs: Spec Delegation and the OAS Registry

Why spec composition matters

Three ways to delegate

1. Local file

2. HTTP/HTTPS URL

3. Registry reference (oa://)

What lives on the registry

Composing delegated tasks with depends_on

Safety: cycle detection

Version pinning: why you should care

Publishing your own

Getting started

3. Registry reference (`oa://`)

Composing delegated tasks with `depends_on`