Forem: Milky

AAoM-04: A Python 3.12 Interpreter

Milky — Fri, 30 Jan 2026 03:32:05 +0000

AAoM-04: A Python 3.12 Interpreter

January 2026

Happy New Year! This entry covers moonpython: a Python interpreter in MoonBit. After about two weeks of vibe coding with Codex (GPT-5.2 & GPT-5.2-Codex), the project now runs a large, pragmatic subset of Python 3.12.

Toolchain Updates

This time I used Codex CLI with GPT-5.2 and kept three MoonBit skills active (same set as before: moonbit-lang, moonbit-agent-guide, moon-ide). Codex is slower than Claude but much more stable on long, reasoning-heavy tasks. That tradeoff would work well for a language interpreter.

The moon-ide recently gained three commands that are especially handy in a growing interpreter codebase:

hover shows hover information for a symbol.
outline shows an outline of a specified file.
doc shows documentation for a symbol.

Codex can use these tools very skillfully.

Problem

A useful subset of Python needs to capture at least:

Dynamic semantics: scoping, closures, globals/nonlocals, and the descriptor model.
Generators, async/await, and exception handling
Pragmatic import support and enough builtins to run real scripts.

I targeted Python 3.12 semantics, but intentionally skipped full stdlib parity, C extensions, packaging, and bytecode compatibility. The aim is correctness for a useful subset, a clean architecture, and repeatable testing. More specifically, moonpython is meant to be used as a library to run real Python snippets, not as a CPython replacement for large production-scale projects. Given that scope, a JIT is poor ROI: it usually demands deep, platform-specific optimization for each OS and architecture, and the effort is often 5-10x the cost of building the interpreter itself.

Runtime Reality Check

During implementation I kept adding new features: most builtins, async/await, generators, type hints, exception groups, and so on. Yet real-world Python projects still failed to run. The lesson was clear: the hardest part of an industrial-strength interpreter is not syntax coverage, but whether the runtime is dirty enough.

By "dirty," I mean the unglamorous compatibility details that real code quietly depends on: import caching rules, path search order, namespace packages, module metadata (__file__, __package__, __spec__), descriptor binding semantics, edge-case exception types, and even tiny differences in string/float formatting. A clean design is not enough; you have to copy CPython's weird corners. For example, __file__ is not part of "Python-the-language", but for file-backed modules it is a widely relied-upon convention that real projects assume exists. In practice, this means the import system must be almost CPython-identical, the object model has to match descriptor and attribute rules, and error messages must be stable. This is where most of the remaining work lies, not in adding new syntax nodes.

Python has documentation and PEPs, but there is no single executable specification; on the messy edges (especially imports), compatibility is ultimately defined by what CPython does and what the ecosystem expects. This is also why I treat CPython behavior as the ground truth: it is at least verifiable, and that matters enormously when you are building with AI in the loop. The downside is that it pushes moonpython toward a lot of edge-case handling code, because matching a living ecosystem inevitably means handling its corners.

If you want a vivid reminder of how these "small" features accumulate into ecosystem constraints, Armin Ronacher’s classic post Revenge of the Types is a great read.

Approach

Tests Generation

Following the previous AAoM pattern, I built the test harness first. The script scripts/generate_spec_tests.py harvests a subset of snippets from CPython Lib/test, runs the snippets under a restricted builtins set and emits MoonBit snapshot tests into spec_generated_test.mbt. That single file contains 2,709 generated tests. Together with the (AI) hand-written tests, the suite currently has 2,894 tests.

A typical generated test looks like this:

test "generated/expr/0001" {
  let source =
    #|'This string will not include \
    #|backslashes or newline characters.'
  let result = Interpreter::new_spec().eval_source(source)
  let expected = "[\"ok\", [\"Str\", \"This string will not include backslashes or newline characters.\"], \"\", \"\"]"
  assert_run(result, expected)
}

Limitations are inevitable. The harvesting is heuristic and can miss real-world patterns. The sandboxed evaluator only allows a small builtins set and a few imports, so many library behaviors are simply out of scope. Snippets that take too long are skipped, and some values (like NaN/Inf or arbitrary objects) are deliberately excluded from serialization. Error reporting is normalized rather than bit-for-bit identical with CPython. I avoided generating too unbounded tests early on purpose. Based on my experience writing a WebAssembly runtime, if I did that, there was a good chance moonpython passes literally zero tests at the start. When everything is red, the agent may tend to over-engineer locally (especially in the parser) to satisfy a huge, noisy failure surface, and it becomes hard to enter an iterative development rhythm.

So I kept the generator constrained to produce a manageable bootstrap suite which is enough to guide the first implementation and validate core semantics, but not so much that it prevents early wins. Once the interpreter becomes usable, the CPython Lib/test suite is the real endgame. Of course, I never expected the interpreter to pass them completely.

Long-time Grinding

These two weeks were effectively 7x24 for Codex. I would check in a few times a day, but most of the time the agent was working on its own. In total, Codex initiated 104 commits during this period.

At the beginning of the conversation, we repeatedly discussed shaping the spec-driven test harness and stabilizing CPython evaluation under a restricted builtins set. Once the generator was in place, I let Codex run unattended and only stepped in when the failure rate stopped decreasing. Intervention was not about fixing individual bugs. I would interrupt Codex only when I noticed clear conceptual errors. One recurring example was a misunderstanding of how "import" fields in moon.pkg.json should be authored, which led Codex to apply the same incorrect pattern repeatedly. When that happened, I wrote the correct convention into AGENTS.md or a SKILL.md, and then restarted the agent using codex resume --last.

Results

The interpreter is a direct AST evaluator (which may not be the best choice. I will discuss this later).

The project layout:

lexer.mbt and parser.mbt implement a Python 3.12 grammar.
A dedicated spec file defines the public AST and value model.
runtime_*.mbt implements the syntax including builtins, scoping, exceptions, generators, and object model.
cmd/main and cmd/repl provide a runner and a simple REPL. Besides, moonpython can also be used as a library.

More language features were implemented than I had expected:

Python 3.12 syntax for match, with, and full f-strings.
Generators (yield, yield from) and async (async def, await, async for/with, async generators).
Exception groups and except* (PEP 654), plus tracebacks with line/column spans.
Type parameter syntax (PEP 695) parsed and preserved in the AST (runtime no-op for now).
Core data model: big ints, floats, complex numbers, bytes/bytearray, lists/tuples/sets/dicts, slicing assignment.
A file-based import system and a vendored CPython Lib/ snapshot for pure-Python modules.

You can try some real world Python programs with moonpython:

All generated 2,894 tests passed. So, how many tests in CPython Lib/test has passed? Well, zero at the moment, because the suite currently aborts early due to missing support for variable annotations (PEP 526). This feature might have been supported when this post is published. Anyway, to pass the whole Lib/test is still a long-term goal, and I fully expect it to land as the project matures.

Time Investment:

One day to find and download the factual standards;
Then, about two weeks of active development, mostly spent on the runtime (scoping + generators/async) and on shrinking the long tail of test failures.

Reflections and Takeaways

The Implementation Code Is Far from Clean

As I said, useful tools have to be dirty because the problems in the real world are dirty. Import in Python is a great example of why the runtime must be dirty: because the spec is messy. Fortunately, this is exactly the kind of dirty work that AI is good at: it can grind through edge cases, keep the bookkeeping consistent, and iterate until imports behave like the ecosystem expects.

Is AST-walking Interpretation a Good Idea?

Another counterintuitive lesson is that a pure AST-walking interpreter is not always simpler than "compile to bytecode & run a bytecode VM".

When Codex first presented this trivial design, I didn't object because it was the most straightforward solution. But once you need suspend/resume semantics like generators and async/await, plus correct try/finally, and fine-grained tracebacks, an AST interpreter has to handle defunctionalized continuations manually and often ends up re-creating a minimal VM anyway. Even worse, with an AST-walking interpreter, there is nowhere to apply any analysis and optimizations (that's why you need to analyze the programs on a IR).

Which Model Should We Choose?

I also used GPT-5.2-Codex for days which is the latest agentic coding model released by OpenAI. But based on my observations, GPT-5.2-Codex is not always superior to GPT-5.2. In fact, for non-programming tasks, GPT-5.2 is clearly better than GPT-5.2-Codex. Overall, Codex (GPT-5.2) works well, though it usually takes several times longer than Claude (Opus-4.5) to complete easy tasks.

Although Codex is slow, its accuracy is significantly higher than Claude's, and it almost never causes rework. Claude, by contrast, often tries to tackle difficult problems through repeated trial-and-error; when an attempt fails, it frequently reverts all changes with git checkout, only to head back into the same dead end.

In this respect, Codex is highly reassuring. You can safely hand tasks over to it and step in only to update your SKILLs with necessary guidance. Once the rules were made explicit, Codex usually absorbed them and continued working productively with little further guidance. Btw, MoonBit's strong typing and predictable tooling turned out to be particularly helpful once the runtime logic grew large.

Multiple Codexes Work Together

In parallel, I was also using Codex to build other software. We increasingly do not need to micromanage AI, today. Given enough time and a solid test loop, the AI can make major progress with minimal human interaction. That suggests a better workflow: keep multiple AI sessions open in separate terminals and let them work concurrently.

In theory, with git worktree, having multiple Codexes develop the same project in parallel should not be hard either. The real cost is the merge: resolving conflicts is a heavy cognitive load, especially when the changes are large and cross-cutting. For now I am having each Codex develop a separate project and will only try multi-agent same-repo work when the payoff is more clear. At the moment, I am developing 5–8 projects in parallel with 5–8 coding agents. Next time, I will share some of the results.

Code is available on github.

AAoM-03: A WHATWG HTML5 Parser, Driven by Tests

Milky — Wed, 21 Jan 2026 08:48:42 +0000

December 2025

This post covers aaom-html: a WHATWG HTML5 parser implemented in MoonBit. As with the previous entries in this series, I treat the AI as a high-intensity pair programmer: I provide the goal, constraints, and review; Claude does the rapid implementation and iteration. The key was not "write a tokenizer and a tree builder first", but build the test harness first—after that, progress becomes a mostly automated grind of reducing failures to zero.

Skill

This time, I use a moonbit-library-builder skill turned from experiences in building parsers with agents during the last two AAoM posts. The header looks like as follows:

---
name: moonbit-library-builder
description: "Build MoonBit libraries using spec-driven and test-driven development. Use when asked to implement a library, port a library from another language (JS, Rust, Go, Python), create a parser/compiler, or build any substantial MoonBit package. Triggers on requests like \"implement X in MoonBit\", \"port Y library to MoonBit\", \"create a Z parser\", or \"build a template engine\"."
---

The section titles are like:

## Workflow

### Phase 1: Gather Specs

### Phase 2: Write Tests First

### Phase 3: Implement Incrementally

## Testing Patterns

I also annotate Common Pitfalls in parsers:

## Common Pitfalls
- **String indexing**: `s[i]` returns `UInt16`, not `Char`. Use `for char in str` for char access
- **Mutable fields**: Use `mut` keyword in struct field declarations

Approach

HTML5 is not hard because of syntax. It’s hard because of browser-grade error recovery: insertion modes, the adoption agency algorithm, foster parenting, foreign content (SVG/MathML), and a lot of stateful edge cases. It has no context-free grammar. Instead, it needs an extremely complicated state machine to tokenize and parse. Without a canonical test suite, you end up guessing.

My initial command was simple: "Implement a HTML5 spec‑compliant parser with WHATWG state machine, malformed input recovery". This command instructed Claude make a plan for test
suite generation.

Unfortunately, the number of tests was huge (~8k) and the test-gen script got out of memory. Claude splitted the tests into batches, each including up to 500 tests. I knew there were better ways to overcome this problem, but since Claude had already solved it, I didn't ask him to make any further changes. After all, the scripts are not vital to this project.

As results, 14 tokenizer test files and 4 tree test files were generated. Since each test in html5lib-tests already includes expected output, e.g. for tokenizing,

{"description":"PLAINTEXT content model flag",
"initialStates":["PLAINTEXT state"],
"lastStartTag":"plaintext",
"input":"<head>&body;",
"output":[["Character", "<head>&body;"]]},

and for trees:

#data
<a><p></a></p>
#errors
(1,3): expected-doctype-but-got-start-tag
(1,10): adoption-agency-1.3
#document
| <html>
|   <head>
|   <body>
|     <a>
|     <p>
|       <a>

we don’t re-run a reference parser to compute expected trees—the .dat files already contain canonical expected output. The job is to make our doc.dump() stable and compatible with those expectations.

// for tokenizing
test "html5lib/tokenizer/namedEntities_bad_named_entity_hat_without_a_semicolon_492" {
  let (tokens, _) = @html.tokenize("&Hat")
  inspect(
    tokens,
    content="[Character('&'), Character('H'), Character('a'), Character('t'), EOF]",
  )
}

// for trees
test "html5lib/tree/adoption01_2" {
  let doc = @html.parse("<a>1<button>2</a>3</button>")
  inspect(
    doc.dump(),
    content=(
      #|<html>
      #|  <head>
      #|  <body>
      #|    <a>
      #|      "1"
      #|    <button>
      #|      <a>
      #|        "2"
      #|      "3"
    ),
  )
}

Meanwhile, Claude also wrote a Python script to generate entities. It was amazing! Let me briefly explain this. HTML5 defines 2231 Named Character References such as &   &NotEqualTilde;. Claude downloaded the official specification from https://html.spec.whatwg.org/entities.json and then transformed them into entities.mbt:

fn init_entities() -> Map[String, Array[Int]] {
  let m : Map[String, Array[Int]] = Map::new()
  m["AElig"] = [198]
  m["AElig;"] = [198]
  ...
}

Generating tests is just the start. The time sink is making output stable so the suite is usable:

Remove the | prefix from tree dumps (cleaner snapshots)
Stable attribute sorting (lexicographic)
Escaping control characters and C1 controls
Matching MoonBit inspect escaping rules (\b, \u{0X}, soft hyphen, noncharacters, etc.)
Using #| multi-line strings for readable expected trees

I can hardly imagine how painful it would be to debug the script by myself on this.

Once the entity definitions and conformance tests were done, the rest was straightforward. With the command Continue until finishing all 8251 tests, Claude worked continuously for nearly 6 hours, submitted 23 code commits. When I checked his work status again, he had already processed 8244/8251 tests. But sadly, he wasted nearly 3 hours on the final 7 tests without solving a single one. He was going in circles, getting stuck on the same point, constantly trying to write new code, finding no solution, then deleting the code and repeating the same useless attempts. I asked Claude to think deeper but in vein.

I suddenly remembered a news story I'd heard recently: GPT-5.2 was better at programming than Opus-4.5. I thought, why not let Codex (GPT-5.2) try to solve these seven edge cases? Well, I launched CodeX, selected model GPT-5.2, and chose the extra high reasoning level. It must be said that he thought very slowly. But after about ten minutes, he came up with a very clear solution and solved the problem in another ten minutes. The value wasn’t "more code faster", but a cleaner causal chain through the state machine (stack traces -> tokenizer/tree builder trigger -> specific transition), leading to a fix quickly. That intelligence surprised me.

Results

8251 tests (including conformance tests and smoke tests) passed.

Full WHATWG HTML5 specification compliance
80 tokenizer states
25 tree construction insertion modes
49 parse error types with graceful recovery
2,231 named character references

Reflections

The last 1% is reasoning-heavy. What surprised me most was GPT-5.2's ability to handle difficult problems. After completing this AAoM, I used GPT-5.2 to assist with other projects originally developed with Claude. In contrast, Codex didn't have the forgetfulness that Claude had, as I mentioned before. However, Codex's user experience, speed, and compliance are far inferior to Claude's. The best practice I've found so far is to have Claude do the actual work, and then have Codex review it simultaneously, which works very well.

Time investment: ~7 hours of active development:

2 minutes: Elaborate the moonbit-library-builder skill.
~3 hours: Without human intervention, download html5lib-tests, writing test-gen scripts, implement the basic features and pass most tests.
~3 hour: Try to handle the remaining 7 edge cases, and failed.
~0.5 hour: GPT-5.2 help solving the remaining 7 edge cases.

Code is available on github.

I haven't decided what AAoM Day 4 will be yet, but I’m increasingly convinced: for spec-heavy projects, the best use of AI isn't "write everything", it’s "make it converge inside a strong testing loop". Maybe I will try something different like interpreters for a subset of ECMAScript 2025 or Python3.

AAoM-02: XML Parser with W3C Conformance

Milky — Wed, 14 Jan 2026 02:32:50 +0000

December 2025

Continuing the Agentic Adventures of MoonBit series, this time I tackle XML parsing. The goal: build a streaming XML parser that passes the official W3C XML Conformance Test Suite.

Skill

I'm still using Claude Code (Opus 4.5) with the MoonBit system prompt and IDE skill. Moreover, I create a new skill named moonbit-lang to inform AI to be aware of the best practices and common pitfalls for the MoonBit language. The header looks like as follows:

---
name: moonbit-lang
description: "MoonBit language reference and coding conventions. Use when writing MoonBit code, asking about syntax, or encountering MoonBit-specific errors. Covers error handling, FFI, async, and common pitfalls."
---

# MoonBit Language Reference

@reference/fundamentals.md
@reference/error-handling.md
@reference/ffi.md
@reference/async-experimental.md
@reference/package.md
@reference/toml-parser-parser.mbt

In this skill doc, I also mention the official file I/O package moonbitlang/x/fs which AI is not familiar with. The complete skill doc and references can be accessed at github, where I will continuously update the skills I use.

AI (both Codex and Claude) will only read the description at startup, and then read the rest when needed. Even so, I keep the skill doc simple because based on my experience, any document with excessively long content will hinder the AI's ability to understand the details.

Problem

XML remains ubiquitous in configuration files, data interchange, and legacy systems. A conformant XML parser must handle:

Element tags, attributes, and namespaces
Entity references (<, &, custom entities)
CDATA sections and comments
Processing instructions and XML declarations
DTD internal subsets with entity declarations

My goal is to implement XML 1.0 with namescope, entities and DTD. The challenge is that XML has many edge cases specified in the W3C standard. Rather than guessing what's correct, I use the official test suite as ground truth.

Tests Generation

First, I download (effectively, let Claude download) the official W3C XML Conformance Test Suite:

curl -L -o xmlts.tar.gz "https://www.w3.org/XML/Test/xmlts20130923.tar.gz"
tar -xzf xmlts.tar.gz && mv xmlconf . && rm xmlts.tar.gz

It contains lots of valid and not-well-formed XML documents. I let Claude to make a script generate_conformance_tests.py for generating snapshot tests in MoonBit based on the official test suite.

How to get the expected snapshot contents? Initially, I had Claude use quick-xml (Rust) as the reference parser. This worked for most tests, but quick-xml is intentionally lenient in some cases where strict XML compliance requires rejection. After hitting 23 test failures due to leniency differences, I switched to libxml2 (via Python's lxml) as the reference. libxml2 is the de-facto standard XML parser and matches W3C conformance closely.

Finally, the generated tests looked like this:

// valid
test "w3c/valid/valid_sa_001" {
  // Test demonstrates an Element Type Declaration with Mixed ...
  let xml = "<!DOCTYPE doc [\n<!ELEMENT doc (#PCDATA)>\n]>\n<doc></doc>\n"
  let reader = Reader::from_string(xml)
  let events : Array[Event] = []
  for {
    match reader.read_event() {
      Eof => {
        events.push(Eof)
        break
      }
      event => events.push(event)
    }
  }
  inspect(
    to_libxml_format(events),
    content="[DocType(\"doc\"), Empty({name: \"doc\", attributes: []}), Eof]",
  )
}

// not well formed
test "w3c/not-wf/not_wf_sa_001" {
  // Attribute values must start with attribute names, not "?".
  let xml = "<doc>\n<doc\n?\n<a</a>\n</doc>\n"
  let reader = Reader::from_string(xml)
  let has_error = for {
    try reader.read_event() catch {
      _ => break true
    } noraise {
      Eof => break false
      _ => continue
    }
  }
  inspect(has_error, content="true")
}

A total of 735 tests were generated, comprising 14k lines of code. Including other tests manually written by the Claude afterwards, the total number of tests is 800.

Parser Implementation

Since quick-xml was the initial implementation reference, Claude followed a pull-parser architecture inspired by quick-xml, which I thought was OK for our goal. The APIs look like this:

let reader = @xml.Reader::from_string(xml)
for {
  match reader.read_event() {
    Eof => break
    Start(elem) => println("Start: \{elem.name}")
    End(name) => println("End: \{name}")
    Text(content) => println("Text: \{content}")
    _ => continue
  }
}

Since lxml returns a tree structure while our parser emits events, I had Claude implement a to_libxml_format function that transforms our event stream to match lxml's output format exactly. This made test comparison straightforward.

It took about 4 hours without human intervention (except Please continue) to accomplish the basic parts. The most complext feature was DTD (Document Type Definition) parsing and validating. I used Claude's plan mode to structure the implementation. Here is the plan summary:

After about 1 hour, DTD was implemented and 726 tests passed. But it took 3 more hours to handle edge cases including entity value expansion, text splitting details and UTF-8 BOM handling.

Results

At the end, 800 W3C conformance tests passed. Note that there were 59 tests skipped by the tests-gen script because some of them were valid but rejected by lxml, while the other were not-well-formed but passed by lxml. The script recognized these tests as "lxml implementation quirks". Since these edge cases were overly complicated, I didn't carefully check if those were really caused by "lxml implementation quirks". The remaining 800 tests were sufficient anyway.

So this library supports:

XML 1.0 + Namespaces 1.0
Pull-parser API for memory-efficient streaming
Writer API for XML generation
DTD support with entity expansion

Reflections

What Worked Well? Using an official test suite was invaluable. The W3C conformance tests cover edge cases I would never have thought to test manually—obscure character references, DTD quirks, namespace handling, and more. Switching reference implementation when needed. quick-xml's leniency was a feature for its users but a problem for conformance testing. libxml2 provided the strict reference I needed. Plan mode for complex features like DTD parsing kept Claude organized. Without it, Claude would jump between fixing different issues without completing any.

The main problem I met was that Claude's was prone to modify tests instead of fixing bugs. That was a recurring issue. When tests failed, Claude would often:

Modify test expectations to match incorrect output
Update the test generator to skip failing tests
Suggest marking tests as "lenient" and skip them rather than fixing the parser

I had to repeatedly redirect: "Update the MoonBit implementation, not the tests."

Moreover, Forgetting project conventions was common. Claude would forget to use the moon-ide skill for code navigation, or use anti-patterns like match (try? expr) instead of try/catch/noraise. Adding these to CLAUDE.md helped but didn't eliminate the issue. I searched this issue in the community (reddit link) and found that this might be a bug in Opus 4.5 and Sonnet 4.5. Hope it will be fixed in the near future.

In my future work, I may need to implement or port a large number of parsers. I think I need to turn my experience in writing these parsers and creating test generation scripts based on standards into reusable skills or commands. Perhaps we will see the benefits next time.

Time investment: 10+ hours of active development:

2 hours: Collaborative exploration of how to write the expected test generation script.
4 hours: Without human intervention, implement the basic features.
1 hour: Plan and implement DTD, Namescope and Entites.
3 hours: Handle edge cases (fix 17 test failures)

Code is available on github.

AAoM-01: Pug Template Engine

Milky — Wed, 07 Jan 2026 06:33:59 +0000

December 2025

Let's kick off the Agentic Adventures of MoonBit series with a complete Pug template engine. I'm using Claude Code (Opus 4.5) with the MoonBit system prompt and IDE skill. The workflow is simple: I describe what I want, Claude writes the code, and I review and refine.

Claude excels at bootstrapping MoonBit projects with moon new, understanding the package structure (moon.mod.json, moon.pkg.json), and generating idiomatic code after a few iterations.

Problem

HTML is verbose. Writing nested structures by hand is tedious and error-prone. Pug (formerly Jade) solves this with a clean, indentation-based syntax:

doctype html
html
  head
    title My Site
  body
    h1#greeting.hero Hello, World!

The goal: implement a Pug-to-HTML compiler in MoonBit that supports the full specification—tags, attributes, interpolation, conditionals, loops, mixins, includes, and extends.

Aproach

I had Claude read all specifications from pugjs.org and write corresponding tests first.

The tests looked like:

test "case with fall through" {
  let pug =
    #|case num
    #|  when 0
    #|  when 1
    #|  when 2
    #|    p Small number
    #|  default
    #|    p Large number
  let locals = Locals::new()
  locals.set("num", "1")
  let html = render_with_locals(pug, locals)
  inspect(html, content="<p>Small number</p>")
}

There were 153 tests spread across 18 blackbox test files. This test-driven approach caught edge cases early.

When Claude started to implement the library, I reviewed the tests. The implementation follows a classic compiler architecture: lexer.mbt, parser.mbt and render.mbt.

After the core features worked, Claude stopped and told me there remained features like includes/extends because of Claude does not know how to perform file system operations. I introduced @moonbitlang/x/fs for file system access. Claude's first API was awkward:

let registry = TemplateRegistry::new()
registry.register("includes/head.pug", @fs.read_file_to_string("example/includes/head.pug"))
registry.register("includes/foot.pug", @fs.read_file_to_string("example/includes/foot.pug"))
let html = render_with_registry(source, Locals::new(), registry)

I pointed this out and asked for improvement. The result was much cleaner:

let html = render_file("example/index.pug")

The render_file function now automatically discovers and loads dependencies, resolving paths relative to the input file.

The hardest challenge is JS expressions interpolation like #{msg.toUpperCase()}. To realize this, the template engine needs to evaluate JS expressions. We can directly use JS FFI for the JS backend. But I also want this library to be used on other backends (MoonBit compiles to multiple backends including WebAssembly, native and JavaScript). While Claude was trying to implement a comprehensive JS interpreter, I interrupted Claude and told him to use extern "js" fn eval(expr : String) -> String = "(expr) => eval(expr)" (not exactly this, but roughly the idea) for the JS backend and abort with only supported on the JS backend message for non-JS backends.

Final task was to implement a command-line interface. The design and implementation was done by Claude and I did not give any suggestions on this task. I ran the CLI with different inputs quite a few times and made sure the output made sense.

Results

A fully functional Pug template engine with:

Tags, IDs, classes, and attributes
Nested elements via indentation
String interpolation (#{} and !{})
Conditionals (if, else if, else, unless)
Iteration (each, while)
Mixins with parameters and blocks
Template inheritance (include, extends, block)
CLI with JSON locals and directory processing

All tests pass: 140 general tests and 13 js-only tests.

CLI example session:

$ echo '{"name": "World"}' > data.json
$ echo 'h1 Hello #{name}!' > greeting.pug
$ moon run cmd/main -- -O data.json greeting.pug
<h1>Hello World!</h1>

Reusable templates with the compile API:

test "compile api" {
  // Compile template once
  let template = @pug.compile("p Hello #{name}!")

  // Render with different locals
  let locals1 = @pug.Locals::new()
  locals1.set("name", "Alice")
  inspect(template.render(locals1), content="<p>Hello Alice!</p>")
  let locals2 = @pug.Locals::new()
  locals2.set("name", "Bob")
  inspect(template.render(locals2), content="<p>Hello Bob!</p>")
}

The API now is very clear and convenient to use, just like the official pug implementation.

Reflections

Test-driven development works very well. Having Claude read pugjs.org docs and write tests first caught issues early. MoonBit's pattern matching makes AST processing clean and exhaustive.

The challenges lies in MoonBit features or conventions which Claude is not very clear about. For example, Claude does not know the conventional way to access file system and is prone to write anti-patterns like match (try? expr) { Ok(_) => ...; Err(_) => ...}, which may be due to the Opus model's unfamiliarity with the latest MoonBit syntax and best practices. This information has been updated in my commonly used moonbit-lang SKILL, and its effectiveness will be tested in the subsequent AAoM series.

A practical way I have found to improve the project is to ask Claude "What Pug features are still missing?" Claude will explore the whole library, list the possible missing features, write a todo list, and then implement them. I only need to check one last time to make sure the tests match the examples on the official pug website.

The process takes roughly 6 hours of active development:

several minutes to generate the tests,
nearly 3 hours for core features,
1 hour for include/extends,
2 hours for JS interpolation and several minutes for CLI.

Code is available on github.

Introduction to Agentic Adventures of MoonBit

Milky — Wed, 07 Jan 2026 06:30:49 +0000

Motivation

Inspired by Anil Madhavapeddy's Agentic Adventures, this project explores what it's like to build practical libraries in MoonBit with AI assistance.

MoonBit is a modern programming language designed for cloud computing and AI coding.

Goals

Build useful libraries - Each entry will document the creation of a practical MoonBit library.
Explore AI-assisted workflows - Document how agentic programming integrates with MoonBit's toolchain.
Share learnings - Be transparent about what works, what doesn't, and lessons learned.

How it works

Each blog post will document:

The problem being solved
The specification and design process
Interactions with Coding Agent (such as Claude Code) during development
Code review notes and refinements
Final library usage and examples

Contents:

AAoM-01: Pug Template Engine
AAoM-02: XML parser
AAoM-03: HTML5 parser
AAoM-04: Python interpreter