Forem: Ion

I Built a Small Agentic Trading App. It Cost Me a Few Hours and About $0.20 Per Run.

Ion — Fri, 13 Mar 2026 17:09:53 +0000

I’ve been writing code for many years now. Not as much these days, but I started when I was eight. With BASIC. After this many years, you develop a particular mindset. It helps sometimes and causes problems other times. When you see a problem, you immediately think about writing a solution. But you should first investigate what’s already out there.

So when I see articles and hear people talking about AI agents and orchestration frameworks and multi-agent architectures and all that, my first instinct is: if you know code, you just need a few lines, a couple of API calls, and an LLM. That’s it. That’s your agent.

I wanted to test that instinct. So I mixed it with something I actually care about: trading.

What I Built

The idea was simple. I wanted an app that:

Gets some input from me (optionally)
Calls a few APIs to gather market data and news
Sends everything to an LLM with the right context
Saves the output for future runs (a basic form of memory)
Sends me the result via email and push notification

I am not comfortable enough to let it place or cancel orders automatically. Maybe later. For now, I just want it to think and tell me what it thinks.

And what it tells me is pretty concrete. The output of each run includes a specific recommendation: WAIT, BUY at a particular price (or NOW), SELL at a particular price (or NOW), amend existing orders, and so on. It’s not a vague “the market looks bullish” — it’s an actionable next step with reasoning attached. I still make the final call, but the model does the legwork of pulling everything together into a decision.

To keep things simple, I used Claude models. I already have a Pro account, so I just activated the API and topped it up. Then I asked Claude Code (Opus) to help me scaffold the whole thing. A Python app using LangChain, calling some APIs, talking to Sonnet, saving output, emailing me, sending push notifications.

I also put everything in a Docker container and deployed it to my local K3s cluster running on a Raspberry Pi 5. It runs as a set of cronjobs: once before the market opens, and a few times during market hours.

It does stuff. It actually works.

Now, I should be upfront: I am not a professional trader. I trade from time to time, and I am certainly not a millionaire. So I have to admit that part of it also feels a little bit speculative. But here’s the thing — the summary the LLM provides in its output, the rationale behind the decision, actually makes sense. When I have time to read the news myself (actual news, not AI-generated summaries), the reasoning lines up. It’s not just pulling recommendations out of thin air. It reads the same data I would read, and it reaches conclusions I can follow. That doesn’t make it right every time, but it makes it auditable, which is more than I can say for my gut feeling at 7 AM.

The Implementation Details

Let me go a bit deeper into what’s under the hood.

Why LangChain? Partly to learn it, and partly because I actually want to extend the solution. For example, in the future I could use Sonnet as a faster and cheaper model for the first pass — a quick sanity check — and then, if an actual trade decision is needed, escalate to Opus for verification. LangChain gives me a reasonable abstraction layer for that kind of routing. I also used Tavily for news data, configured as a tool inside LangChain. But I want to be clear — I could have done all of this without it. LangChain is a convenience here, not a requirement.

Dynamic user prompt. The prompt I send to the model has several dynamic variables that get filled in at runtime: current position, pending orders, historical orders. To populate these, the app first makes API requests to my broker. This is just plain REST — no fancy orchestration needed.

No historical price data (yet). One gap worth mentioning: my broker doesn’t expose historical price data through its API, and I don’t want to pay a separate data provider for it. I’ll find a solution eventually, but for now the prompt doesn’t rely heavily on price charts or technical indicators. Instead, it leans on analysis and context from other sources on the internet — news, market commentary, analyst takes — which Tavily pulls in. It’s a trade-off, but it keeps things simple and cheap.

Local override mode. When the app runs locally (not as a cronjob), I have the option to feed it manual input. I can paste in some news I found interesting, or force a particular instrument price so the app doesn’t need to look one up. This turned out to be surprisingly useful for testing and for those moments when I want to say “forget the current price, assume it’s at X — what would you do?”

Memory. I don’t save the whole LLM output. The system prompt has multiple steps, and one of them explicitly asks the model to produce a “run summary” — a condensed version of its reasoning and recommendation, designed to be useful as context on the next run. That summary gets saved to a file. On the next run, it’s loaded and injected into the prompt. So the model can see what it concluded last time, what it recommended, and whether conditions have changed. No vector database, no embeddings, no retrieval pipeline. Just a file with a structured summary that the model itself wrote for its future self.

Notifications. I use my own Gmail account to send emails to myself. For push notifications, I use my Home Assistant instance — it already knows how to push to my phone, so I just call its API. Simple, reliable, already running.

What It Costs

Each run costs about $0.20. That’s the Claude API usage for one cycle of: ingest context, read news, reason about the position, produce a recommendation. For the news data, I use Tavily’s free Researcher tier, which should be more than enough for the number of monthly runs I need at the moment. So the only real cost is the LLM.

A proper trader would probably want to optimise this. Maybe use a local model for some steps. Maybe look for a model specifically fine-tuned for financial reasoning. Maybe care about latency.

But for me? I just want to have fun and learn. $0.20 per run is a rounding error compared to what you’d lose on a single bad trade made on gut feeling at 7 AM before coffee.

The Architecture (If You Can Call It That)

Let me sketch what this looks like end to end:

Cronjob triggers the Docker container on K3s (Raspberry Pi 5).
Broker API calls fetch current positions, pending orders, and order history.
Tavily tool (via LangChain) fetches relevant market news.
Prompt assembly: all the dynamic data gets injected into the user prompt, along with the previous run’s output (memory).
Claude Sonnet processes the full context and produces a recommendation.
Output is saved to a file (next run’s memory).
Email + push notification sent to me with the recommendation.
Human in the loop (me) decides whether to act on the model’s recommendation.

That’s eight steps. Most of them are just API calls and string concatenation. The “agentic” part is really step 5 — the LLM reasoning over structured context with tool access. And step 8 is just a guy looking at his phone over coffee.

The Obvious Question

Now, the obvious question. For a developer who can write code, why exactly do you need a fancy orchestration framework to build something like this?

I keep seeing products and platforms that promise to let you “build AI agents without code” or “orchestrate complex multi-agent workflows” with drag-and-drop interfaces and visual pipelines. And I get it — there’s a market for that. Not everyone writes code.

But if you do write code, what you need is:

An LLM API
A few REST calls
Some string templating for prompts
A file system (for memory)
A way to send notifications
A container runtime (optional, but nice)
Some logging, if you need audit or debug data (add as needed — it’s just code)

That’s it. That’s the agent.

I’m not saying LangChain is strictly necessary either. I used it because I wanted to learn it and because the tool abstraction is convenient for the Tavily integration. But I could have done the same thing with raw API calls and a hundred lines of Python.

The orchestration is just code. It always was.

The value of an LLM agent is not in the orchestration layer. It’s in the prompt design, the context assembly, and the quality of the model’s reasoning. Everything else is plumbing. And developers have been doing plumbing for decades.

Note: This is a personal learning project. Nothing here should be taken as financial advice. Also, as I’m not a native English speaker, I used an LLM to review and refine the language of this article while keeping my original tone and ideas.

Originally published on Medium.

Are Multi-Agent LLM Workflows Quietly Amplifying Mistakes?

Ion — Fri, 13 Feb 2026 12:18:14 +0000

Multi-agent systems are everywhere right now.

Planner agent. Executor agent. Critic agent. Formatter agent. Sometimes even a "manager" agent coordinating everything.

The story is usually the same: if one LLM call can be useful, then several specialised agents working together should be more reliable and produce better results.

But I'm not fully convinced.

Not because multi-agent systems are useless. They can be powerful. I'm just not sure we are always clear about why we use them, and what trade-offs we introduce when we chain probabilistic systems together.

Thinking of an Agent as a Function

Let's simplify the discussion and treat an agent like a function.

Something like:

f_0_llm_0_step_n(prompt[1..n], data, output_step_n-1)

In plain English:

An agent, running a specific model (say llm_0), performs step n using one or more prompts, some dynamic data (code snippets, documents, context), and the output of the previous run. It then produces a result.

That result is passed to the next agent:

f_1_llm_1_step_n(...)

And so on.

So we create a composition:

f3(f2(f1(x)))

Each f is powered by an LLM. And each LLM call is probabilistic.

Every LLM Call Can Be Wrong

No matter how good the model is:

It can misunderstand the requirement.
It can hallucinate details.
It can omit constraints.
It can be confidently wrong.

If we assume each step has some probability of being correct (call it a_i), then in a simple independent chain, overall correctness looks roughly like:

a_total ≈ a1 × a2 × a3 × ...

Even if each individual step is "pretty good", multiplying several steps together is not comforting.

But I should be honest about what this model hides. The formula assumes errors are independent across steps. In practice, they're often not. Agents share the same context, sometimes the same model weights, and frequently the same blind spots. If agent 1 misunderstands an ambiguous requirement, agent 2, reading agent 1's output, will likely inherit that misunderstanding. Correlated failures can make things worse than the multiplicative model suggests, because the "correction" you'd hope for from chaining doesn't materialise when all the agents are confused in the same direction.

On the other hand, if agents are operating on genuinely different aspects of a problem, the correlation drops and you get closer to the independent case, which is actually more optimistic.

The point is not that the formula is a precise model. The point is directional: chaining probabilistic transformations does not automatically improve reliability, and in many setups it degrades it. The exact shape of that degradation depends on how correlated the failure modes are.

"Yes, But Agents Can Correct Each Other"

A fair counterargument is: this is too pessimistic.

People will say:

We use critic agents.
We use voting.
We do retries.
We enforce JSON schemas.
We run compilers and tests.
We keep temperature low.

And yes, these things change the math.

If you add deterministic constraints between steps, you are no longer doing a blind probabilistic chain. You are building something closer to:

LLM → deterministic constraint → LLM → deterministic constraint

That is why agent workflows often work much better for code generation than for pure reasoning. A compiler or a test suite is a hard boundary. It can reject invalid outputs and force correction.

So the argument is not "multi-agent is always bad".

The real risk appears when we build deep chains without strong external anchors.

Two Problems That Look Like One

The issue I worry about most in deep chains is what I'd call epistemic drift. But when I try to pin it down, it splits into two distinct problems that need different solutions.

The first is error accumulation. Each agent tends to assume the previous output is "good enough". It builds on it, improves it, adds structure. If agent 2 introduces a small wrong assumption, agent 3 doesn't question it. It elaborates on it. By agent 5, the wrong assumption is load-bearing and invisible. This is a propagation problem, and the fix is intermediate validation: check facts, run tests, enforce constraints between steps.

The second is confidence inflation. As the chain gets deeper, outputs become more organised. They sound more authoritative. They look more complete. But coherence is not the same as correctness. A well-formatted, well-structured document can be built on a shaky premise and still read as though every claim was carefully verified. This is a calibration problem, and it's worth noting that it isn't unique to multi-agent systems — a single LLM call can also produce overconfident nonsense. But multi-agent chains make it worse, because each successive agent treats the previous output as settled ground rather than as probabilistic input.

The distinction matters because people often propose critic agents or voting as the solution. Those help with error accumulation. They do very little for confidence inflation. A critic agent reading a polished, internally consistent document has the same struggle a human reviewer does: the text doesn't signal where its weak points are.

And in both cases, the deeper the chain, the harder it becomes to locate where things first went wrong.

Debugging Becomes Hard

With one LLM call, debugging is simple: inspect the prompt and output.

With five agents:

Which agent introduced the wrong assumption?
Which agent "fixed" something that wasn't broken?
Which step created the hallucinated constraint?
Which retry changed the meaning?

Multi-agent workflows increase the number of places where subtle divergence can happen. I think of it as increased epistemic surface area: more surfaces where mistakes can stick.

This matters in areas like architecture decisions, long-horizon planning, policy and compliance, regulated environments in finance or healthcare or infrastructure. In these contexts, a small drift can turn into an expensive mistake.

Humans Are Also Probabilistic

To be fair, humans are not deterministic either.

Humans misunderstand specs. Humans propagate assumptions. Humans get tired, biased, distracted. You could model a human step as:

h(context, bias, fatigue, memory)

So the comparison is not "agents vs perfection". It's agent chains versus human chains. And sometimes multi-agent systems may be no worse than multi-human workflows.

The strongest version of the counterargument goes further: multi-agent systems with critic loops are basically peer review, and peer review empirically reduces error rates. That's true. But there's a reason the analogy breaks down. Human peer reviewers bring genuinely independent perspectives. They have different training, different experiences, different biases. When a second engineer reviews your design, they might catch the thing you missed precisely because they think differently.

LLM agents, even when given different system prompts, share the same training data, the same learned biases, the same blind spots. A "critic" agent running the same model as the "creator" agent is not an independent reviewer. It's closer to asking the same person to review their own work after taking a coffee break. It might catch surface errors, but it's unlikely to challenge the deeper assumptions because it shares them.

This doesn't mean multi-agent critique is worthless. It means the error reduction you get from it is probably smaller and less reliable than the error reduction you get from genuine human peer review, and we should set our expectations accordingly.

Why Multi-Agent Often Means Multi-Model

There is another reason multi-agent systems are popular, and it's not always about "better reasoning".

It's cost.

In practice, a lot of multi-agent pipelines are also multi-model pipelines:

a cheap model for classification or routing
a mid-tier model for extraction
a strong (expensive) model for reasoning
a cheap model again for formatting

So when people say "agentic workflow", sometimes they actually mean: we split the work so we can run most steps on cheaper models.

That's not a bad idea. It's rational.

If you run everything on the strongest model, costs explode. So people route tasks.

But this introduces a specific risk: the smaller, cheaper models often become gatekeepers. If a cheap model misclassifies a task as "simple", the expensive reasoning step might never run. Now the system is not optimising for correctness. It is optimising for cost, and correctness becomes a side effect.

I want to be precise here, because this is not inherently a flaw of multi-agent design. A well-built system can route uncertain inputs upward, using confidence thresholds, fallback paths, or escalation logic. The problem is that many systems don't. The default in most frameworks is to route based on classification alone, without a notion of "I'm not sure, send this to the bigger model." And when cost pressure is the main driver, the incentive is to keep things on the cheap path.

The other issue with mixing models is that you're now mixing failure modes. Different models have different biases, different calibration, different hallucination patterns. You are no longer chaining one probabilistic system. You are chaining several different ones. That doesn't make the system inherently worse, but it does make behaviour less predictable and debugging harder. When something goes wrong in a multi-model pipeline, you're not just asking "what went wrong?", you're asking "which model's particular flavour of wrong caused this?"

The Human Validation Problem

There is also an economic problem around validation.

The obvious extremes are: validate every intermediate step (which destroys the automation benefit) or validate only the final output (which means rerunning the whole pipeline when it's wrong, and re-validating from scratch because stochastic outputs change between runs).

Most production systems land somewhere in the middle. You identify high-risk steps and validate those. You let low-risk transformations run autonomously. You put hard checkpoints at the boundaries where errors would be most expensive.

That middle ground is reasonable, and it works. But it's worth noticing what it implies: the value of the multi-agent system now depends entirely on how well you've identified where the risk concentrates. Get that wrong and you're either over-validating (killing the efficiency gains) or under-validating (letting errors through at the exact points that matter).

In other words, agent workflows don't remove validation cost. They require you to be much more deliberate about where you spend it.

When Do Multi-Agent Workflows Make Sense?

In my view, multi-agent workflows make sense when they are constrained and anchored.

For example:

each step has a clearly bounded responsibility
deterministic validation exists between steps (tests, schema checks, compiler, policy engine)
intermediate outputs have stable structure
chain depth is limited
retries are controlled and predictable
there is traceability (you can see what changed and why)
routing logic includes uncertainty-aware escalation, not just classification

In that world, multi-agent can genuinely improve quality and reduce human workload.

But without those constraints, adding agents can increase complexity faster than it increases correctness.

The Real Question

The real question is not "Are multi-agent systems good or bad?"

It's:

Are we increasing reasoning depth, or just adding reasoning layers?

Sometimes multi-agent systems create better output. Sometimes they create more confident output.

Those are not the same thing.

Final Thought

We've seen this pattern before in software engineering.

More lines of code did not mean more value. More microservices did not automatically mean better architecture. More tokens do not necessarily mean more productivity.

That last one is worth sitting with for a moment, because the industry consensus on microservices is instructive. We didn't conclude that microservices were wrong. We concluded that they require operational maturity to work — good observability, clear service boundaries, robust deployment pipelines. Without that maturity, they made things worse. With it, they could be genuinely better than the monolith.

I think multi-agent systems are at a similar inflection point. The pattern is not wrong. But it demands more rigour than most teams are currently applying. It demands deterministic anchors, good traceability, honest assessment of where errors concentrate, and a willingness to keep things simple when the constraints aren't there yet.

More agents do not necessarily mean more correctness. But well-constrained agents, with strong validation boundaries and honest routing, can earn their complexity.

The danger is in the middle: systems that look sophisticated, that sound confident, that cost less per token, while quietly amplifying the exact problems we wanted to solve.

Note: As I’m not a native English speaker, I used ChatGPT to review and refine the language of this article while keeping my original tone and ideas.

Originally published on Medium.

Why Not One-Time-Use Banks (in Terraform)? Gen-AI Says: Maybe.

Ion — Fri, 23 Jan 2026 10:49:27 +0000

The prompt

You are the CEO of a Tier-1 bank (e.g. Jamie Dimon) who is also an elite systems programmer and architect (e.g. Linus Torvalds), with deep expertise in Java, distributed systems, and AWS.

Design and build the minimum legally compliant software and AWS infrastructure required to launch a UK-regulated bank.

Requirements

Cloud: AWS only

Jurisdiction: UK

Capabilities: hold funds, receive money, send money to other banks

Currency: multi-currency

You must produce:

Core banking architecture (double-entry ledger, accounts, transactions)

Executable AWS infrastructure templates (Terraform or CloudFormation)

Java service skeletons for ledger and payments

Security & compliance components (IAM, encryption, audit logs, AML/KYC hooks)

Clearly state:

What is fully implementable in software

What requires regulators or external banking partners

Be concrete and implementation-ready. Do not simplify regulatory constraints.

That’s it.

No fluff. No “high-level overview”. No hand-waving about “innovation”. Just: design and build a bank, ideally with a cool Terraform template you could apply to your preferred cloud provider (or, if you’re feeling adventurous, maybe even your own Raspberry Pi cluster).

From as-a-Service to as-a-System

We are already comfortable with platform-as-a-service, infrastructure-as-a-service, database-as-a-service, and increasingly AI-as-a-service. These models work because complexity is neatly encapsulated behind APIs, templates, and contracts.

What’s changing with modern LLMs is the boundary of what can be encapsulated.

You no longer just rent compute. You can ask for entire systems, described in natural language and emitted as infrastructure code, service skeletons, and architectural decisions.

At that point, the “service” is no longer just a hosted product. It becomes a reproducible system blueprint.

That is the mental leap behind the phrase:

Bank-as-a-Service

Not a Stripe-style API wrapper.
Not a fintech middleware vendor.

But a design-complete, regulation-aware, infrastructure-ready banking system, generated from a prompt.

With Terraform.

A short personal example

This realisation didn’t start with banks for me.

I’m currently working on an application called MedEtra. Recently, I wanted to run a small survey to explore a few product ideas.

The traditional route would be obvious:

pick a survey-as-a-service platform (or something ending in “Forms”)
configure questions
export results
integrate storage later

Instead, I went to a Gen-AI coding model and simply asked:

here are the questions
suggest improvements
target platforms: web + mobile
data storage: my own backend

That was it.

The model generated:

the survey structure
validation rules
storage schema
deployment-ready code

At that point, it became very clear to me that entire categories of “-as-a-service” businesses are exposed — not because they are bad products, but because they are thin abstractions over logic that can now be generated on demand.

Banking is obviously orders of magnitude harder.

But the direction of travel is the same.

Why banks-as-a-service is plausible, not imminent

Let’s be honest. The technology is not fully there yet. The regulatory burden is non-negotiable. And no LLM today can safely operate money movement autonomously.

But that’s not the point. Cryptocurrencies, stablecoins, and virtual cards have already demonstrated:

demand for programmable money
appetite for alternative rails
willingness to trust software with value

In a sense, this is not a revolution — it’s an extension.

From:

virtual cards
embedded finance
ledger APIs

To:

full banking systems as generated artefacts

The difference is scale and accountability, not principle.

Beyond banks: Facebook-as-a-Service?

Once you accept the premise, the examples get uncomfortable.

Why not:

Facebook-as-a-service?
CRM-as-a-system?
Government-forms-as-code?
School-management-as-templates?

If a platform’s value is primarily:

workflow logic
schema design
UI conventions
integrations

…then Gen-AI can increasingly generate the platform itself.

Not host it.
Not brand it.
Generate it.

The uncomfortable conclusion

Gen-AI is not just automating tasks. It is commoditising system design.

The differentiator is shifting from:

who built the software

to:

who operates it
who is accountable
who holds the licence
who carries the risk

In banking, that last point is everything.

And that’s why the prompt matters.

Not because we can launch a bank with Terraform tomorrow — but because we are getting closer to a world where the hardest part of building institutions is no longer the software.

And that opens up a wide range of new opportunities.

If this resonates, I suspect many “-as-a-service” business models will need to rethink what, exactly, they are selling.

Note: As I’m not a native English speaker, I used ChatGPT to review and refine the language of this article while keeping my original tone and ideas.

Originally published on Medium.

The Return of Assembly: When LLMs No Longer Need High-Level Languages

Ion — Wed, 22 Oct 2025 12:30:14 +0000

Most people in IT have at least heard of assembly language. Some even saw it during university - just enough to know it exists, but not enough to use it for anything real.

Assembly used to be the secret behind high-performance code. Games like Quake, graphics engines, and device drivers often had big chunks written in assembly for one simple reason: speed.

But times changed. Modern developers rarely touch assembly anymore, and for good reasons. Until recently, it simply didn't make sense for most projects.

Well, at least, that's what I thought - until I tried something on my Mac a few days ago.

Why Assembly Disappeared

Assembly slowly vanished from mainstream software development because it was too much work for humans to handle.

It's not portable - an assembly routine that runs on x86 won't work on ARM, RISC-V, or a GPU.
It's not maintainable - every line depends on the quirks of a specific CPU.
It's also not forgiving - one wrong instruction and you crash the process or corrupt memory.

Higher-level languages like C, Java, or Python fixed that. They gave us portability, readability, and safety. We stopped thinking about registers and flags, and started thinking about business logic.

Compilers got so good that we forgot there was assembly underneath. For human developers, this was a big win. For LLMs, though, those constraints don't really apply.

Humans Have Limits. LLMs Don't - at Least Not the Same Ones

Humans forget things. We can't hold every detail of a large project in our head. That's why complex systems end up consuming too much memory, leaking resources, or deadlocking.

LLMs, on the other hand, can "see" entire codebases at once. With enough context tokens, they can process hundreds of files, reason about dependencies, and never lose focus.

So I started thinking: if an LLM can understand requirements directly in English, why do we still need to go through a high-level programming language at all?

Why not just ask the model to write assembly code directly for the target hardware?

From English to Assembly on My Mac

Each CPU architecture - x86, ARM, RISC-V, GPUs - has its own instruction set. But that's not really a problem anymore. You can simply be explicit in your prompt.

In my case, I told Copilot something like this:

## Target Platform
- **Hardware**: MacBook Pro 16-inch (November 2024)
- **Processor**: Apple M4 Max (ARM64 architecture)
- **Operating System**: macOS
- **Execution Environment**: Terminal applications

That's it. No compilers, no C, no libraries. Just plain English. Copilot produced a short, valid assembly routine for my Mac. Then I asked it to write a test harness to verify it. It chose Python automatically, wrote the code to load and call the routine, and checked that 3 + 5 returned 8. It worked.

It was simple - just adding two numbers - but it proved the point: I didn't need to know the registers, calling conventions, or even the assembler syntax. The AI handled all of it.

Why Assembly Might Make Sense Again

For humans, assembly was too low-level to manage. For LLMs, it's just another language - no harder than Python or JavaScript.

They could:

Generate code directly optimised for the target CPU or GPU.
Adapt automatically to architecture differences.
Explore micro-optimisations that compilers might miss.

Imagine asking:

"Generate an ARM64 routine for matrix multiplication optimised for Apple's M-series cache layout."

The LLM could write it, benchmark it, refine it, and repeat - all automatically. That's something no human could realistically do across multiple hardware targets.

Of course, there are risks. LLM-generated assembly might not be deterministic or reproducible. It may fail silently or produce slightly different code for the same request. And debugging machine code generated by a neural network sounds like a nightmare.

But then again, we already trust compilers we don't fully understand. This is just one level deeper.

Requirements in Plain English

One of the most exciting parts of all this is how LLMs turn requirements into code.

It's like Behaviour-Driven Development (BDD), but without the intermediate steps. You describe what you want:

"The function should return the sum of two integers and store the result in memory."

The LLM translates that into assembly, or C, or Python - whatever fits the goal - and can even test it immediately. In theory, product managers could write requirements that are executable from day one, without developers writing scaffolding or boilerplate.

Languages would become more like intermediate representations - still useful for humans, but optional for machines.

What Comes Next

Here's where I think this could go in the next few years:

LLMs trained per architecture - ARM, x86, RISC-V, CUDA, etc.
AI-driven compilers - translating English or pseudocode straight into optimised machine code.
Semantic debugging - where you ask "why is this logic wrong?" instead of inspecting registers.

If that happens, the boundaries between design, coding, and compiling could blur completely. Developers will focus on describing intent, not syntax. LLMs will handle the low-level reality. And maybe, just maybe, assembly will make sense again - not because we understand it better, but because our tools finally do.

If you'd like to see the tiny "add two numbers" experiment I mentioned, I've shared it on GitHub: github.com/ionionascu/llm-to-assembly

This is not a production-ready project - it's a simple proof of concept. The code focuses only on the happy path and is not designed to be fault-tolerant or cover all possible test cases. Invalid inputs or edge conditions may cause segmentation faults or crashes, and that's expected.The goal was never to build robust software, but to explore whether an LLM could generate functional assembly code directly from plain English.

TL;DR: This project isn't about "vibe coding" or skipping engineering discipline. It's a small, controlled experiment meant to explore what happens when large language models can reason directly at the hardware level - not a suggestion that we should replace careful software design with AI prompts.

Note: As I'm not a native English speaker, I used ChatGPT to review and refine the language of this article while keeping my original tone and ideas.

Originally published on Medium.