Forem: zvone187

Scheduled agent runs are now more reliable

zvone187 — Sat, 09 May 2026 15:06:16 +0000

Pazi's whole pitch on scheduled work is that you set it up once and trust it to run. That breaks the moment the result doesn't show up, and over the past two weeks we documented two ways it happens: the routing-inference failure and the bootstrap-latency dropout.

Pazi scheduled tasks now have a three-tier delivery fallback that closes both gaps. If the configured channel disappears mid-run, the run errors before its last step, or every destination is offline, the output still reaches the dashboard. The fix covers new and existing crons, and lands alongside a new pazi-set-cron skill that asks where you want delivery before any scheduled task is created.

Tier 1: resolve at creation

When an agent calls cron.add, the gateway picks a delivery channel before the entry is persisted. Explicit delivery.mode: "webhook" and "none" pass through untouched. systemEvent payloads also skip the resolver, since they don't need a channel.

For everything else, the resolver walks a four-step chain: explicitly set channel, then best configured external (Slack preferred), then the originating session, then the agent's persistent session. The pick is stamped on the cron, so openclaw cron list and the dashboard show where output will land before the first run.

The auto-detected case stamps bestEffort: true. If delivery later fails, the run isn't marked failed; it falls through to Tier 3.

Tier 2: re-check on every fire

The dispatcher re-evaluates the stamped channel against live config on every fire. That heals four cases:

Legacy crons created before this resolver existed.
Crons whose stamped channel was removed from config since creation.
Crons whose stamped channel is now enabled: false.
Crons edited between creation and dispatch in a way that re-introduced a webhook / none mode or non-agentTurn payload (the Tier 1 gates re-enforce here).

For cases (2) and (3), the dispatcher clears the dead channel and re-runs the resolver. The persisted record isn't mutated, so flipping a channel back on works without re-creating the cron. Every existing cron picks up this behaviour on its next fire.

Tier 3: enqueue to main chat

If every channel in the chain is unavailable, or the run errors before any channel is reached, the result is enqueued as a system event to the agent's main chat session. The user sees it as a chat message in the agent's main conversation on the dashboard — inline in the thread, not a toast or a separate notifications list.

The trigger is narrow: Tier 3 fires only when the run produced textual output that wasn't delivered. Heartbeat-only acks don't trigger it, so the main session doesn't fill with empty cron pings.

The `pazi-set-cron` skill

The fallback is the safety net. The new skill is what prevents most fallbacks from being needed in the first place. Before scheduling any task, the agent now asks the user where they want results delivered. Every time, not just on the first cron of a session. The answer becomes an explicit delivery.channel and delivery.to, so the cron has a contract instead of an inference.

The same pattern lands on the goal flow. The set_goal tool takes a deliveryChannel parameter, the tool description tells the agent to ask before filling it, and the frontend uses that value when materializing the goal's recurring check-ins.

Caveats

Explicit delivery.mode: "webhook" and "none" are never overridden, at creation or at runtime. Caller intent wins.
Tier 1 and Tier 2 apply only to agentTurn payloads. Internal systemEvent crons skip the chain.
delivery.channel: "last" and "webchat" aren't deliverable. The pazi-set-cron skill won't produce them; hand-rolled crons that use them are treated as unset.

On the roadmap

A per-cron delivery health timeline isn't there yet. Today's surface is openclaw logs and the system-event channel.

The runtime resolver also has no memory of the user's original preference. If an auto-detected channel is removed from config, the dispatcher rewrites the stamp to the next best, and that rewrite sticks. Re-adding the original channel won't pull the cron back. A "remember original preference" flag is a candidate refinement.

Pazi is an agent building and management platform built on OpenClaw. You can schedule tasks for your agents and have them work autonomously, 24/7. Build your first agent at pazi.ai.

Why Your Recurring Agent Task Dies Right Before Delivery

zvone187 — Fri, 08 May 2026 15:06:44 +0000

If your recurring agent task keeps dying seconds before it posts results, the bootstrap is eating your timeout. A richly configured agent burns 60 to 120 seconds on memory loading, credentials, and skill discovery before it ever makes its first tool call. Everything after that competes for what's left of the budget, and the last step, usually the Slack post or email your team actually sees, is the one that gets clipped.

The fix

Size your timeout as bootstrap_p95 + work_p95 + buffer, not just the work. Bootstrap (memory loading, credentials scan, skill discovery) commonly costs 60 to 120 seconds on a richly configured agent before the first tool call fires. That means your effective budget is smaller than whatever you typed into the config. Then reorder so the human-facing output runs before cleanup, and make delivery idempotent independently of work so retries can fill in whatever the previous attempt missed.

Step-by-step

1. Instrument bootstrap as a milestone

Before you touch the timeout, measure. Log a timestamp when the process starts and another when the first tool call dispatches. The gap between them is your bootstrap. Track p95 across a full week; one-shot readings lie.
[code]
# Pseudocode. Swap metrics for your StatsD, Datadog, or OpenTelemetry client.
import time
t_start = time.time()

# Inside your agent runner, right before the first tool dispatch:
def on_first_tool_call(tool_name):
    delta = time.time() - t_start
    metrics.gauge(
        "agent.bootstrap_seconds",
        delta,
        tags=[f"agent:{agent_id}", f"first_tool:{tool_name}"],
    )

[/code]

On an agent with heavy memory files, many skills, and multiple credentials, bootstrap can easily land between 60 and 120 seconds. If you see under 10, check where you placed the second timestamp.

2. Size timeouts as bootstrap + work + buffer

A 300-second limit for 3 to 4 minutes of expected runtime feels safe, but it isn't once bootstrap eats the first minute. Subtract bootstrap first, then add a cushion.
[code]
timeout = bootstrap_p95 + work_p95 + buffer
[/code]

For a rich agent running a multi-step pipeline, that math often lands at 900 to 1800 seconds, not 300. On OpenClaw:
[code]
openclaw cron edit --timeout-seconds 1800
[/code]

For other schedulers, update the equivalent field: activeDeadlineSeconds on the Kubernetes CronJob's jobTemplate, WorkflowExecutionTimeout (or a relevant StartToClose / ScheduleToClose) in Temporal, or the task deadline in your agent framework. The configured number needs to go up.

3. Post output before cleanup

Order the steps so the human-facing announcement runs before any expensive cleanup: post to Slack before updating the tracking spreadsheet, and send the summary email before archiving artifacts. If the deadline fires mid-cleanup, your user still saw the result, and the only loss is an internal log row you can backfill on the next trigger.
[code]
# Wrong order
do_work() -> update_tracking_sheet() -> cleanup_artifacts() -> post_to_slack()

# Right order
do_work() -> post_to_slack() -> update_tracking_sheet() -> cleanup_artifacts()

[/code]

It's a one-line reshuffle that keeps the Slack post from getting skipped when the deadline lands mid-cleanup.

4. Make delivery idempotent separately from work

Idempotent work is not enough. If last night's run opened a ticket and filed a card but never posted anywhere visible, today's run needs to check delivery on its own instead of inferring it from backend artifacts. One clean approach: tag each announcement with a stable key (run ID, date, source event ID) and look for that key before sending. If the work exists but the key does not, re-announce.
[code]
# Pseudocode. announcement_posted and post_to_slack are stand-ins for
# whatever delivery layer and idempotency store your agent uses.
run_key = f"bug-triage:{today.isoformat()}:{issue_id}"

if not announcement_posted(run_key):
    post_to_slack(message, idempotency_key=run_key)

[/code]

Without this split, a retry that sees the ticket already open will call the pipeline done, and your user never hears about the run.

How to verify

Look at your bootstrap metric across the last 7 days. If p95 exceeds 60 seconds and your timeout sits under 600, you're at risk. Confirm by triggering the job manually and tailing logs: the gap between process start and the first tool invocation is your real overhead, and everything after competes for whatever remains of the budget.

For the ordering change, scan your agent's code or prompt and make sure human-facing output precedes any logging, tracking, or archival step. For idempotent delivery, invoke the job twice in a row with identical inputs. The second invocation should resend the message if the first one was clipped, not bail silently.

Why this happens

Bootstrap scales with workspace size, so more memory files, skills, and credentials mean more setup cost before any real progress. Teams usually pick the timeout once, around the time the agent ships, and then the team adds memory files, skills, and credentials over the next three months. The number that felt generous is now clipping whatever runs last, and because that last step is usually the one your team sees, you notice only when somebody asks why Monday's report never arrived.

Build your first agent atpazi.ai→

This pattern came from a bug-triage agent at Pazi , powered by OpenClaw.

AI agent best practices: 7 rules from running them at Pazi

zvone187 — Thu, 07 May 2026 15:35:23 +0000

TL;DR. Most teams quit on AI agents early because the output is rough at the start, before the agent has been corrected enough times to learn the job. Onboarding an AI agent is closer to managing a new hire than configuring a tool, and the loop is what does the work. These are the seven rules that get you there.

What an AI agent actually is

The OpenAI Agents SDK guide defines agents as "applications that plan, call tools, collaborate across specialists, and keep enough state to complete multi-step work." An agent has a job, has tools, and runs through a loop until it's done.

I've been running my Pazi agents for a couple of months now and almost every early failure I've watched, mine and from teams I talk to, had nothing to do with the model. They were onboarding failures. Here are the seven rules I'd give anyone bringing on their first agents.

Why most AI agent rollouts fail before they're trained

Most rollouts fail because you treat the agent like a tool that arrived broken. The difference is that tools come ready to use the moment you plug them in, while agents arrive without any context for the work, so the early output looks rough until you put the corrections in.

The Anthropic engineering team, writing about how they shipped their multi-agent research system to production, named the failure modes plainly: "Early agents made errors like spawning 50 subagents for simple queries, scouring the web endlessly for nonexistent sources, and distracting each other with excessive updates." That's the team that built the model describing what their own early agents did wrong. The lesson isn't that the agents were broken. It's that the scaffolding around the model needed work, and that work looks more like onboarding than configuration. Teams who don't internalize that decide AI agents "don't work yet" and walk away from a tool that hasn't been trained.

Rule 1: Onboard one agent at a time

Assigning multiple agents tasks in parallel doesn't multiply throughput. It multiplies debug surface area. Ten workflows on day one means ten half-trained agents producing ten streams of half-broken output, all needing attention at once. The work doesn't parallelize because the bottleneck is the correction loop, and the correction loop runs through you.

Pick one repetitive task. Sit with the agent on it through a few rounds of corrections, where you fix what it got wrong and feed it the context it was missing until the runs come out clean. Then move to the next task.

There's a cost angle too. Anthropic's own data: "agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats." More agents in flight isn't free leverage; it's compounding cost on top of compounding correction debt.

Rule 2: Treat the agent like a new hire, not a tool

The mental model decides what happens in the first session. If you've framed the agent as a tool, rough output reads as a broken tool and you abandon it; if you've framed the agent as a new hire, rough output reads as a teaching moment and you stay in the loop long enough for the agent to learn the job.

Anthropic puts the underlying mechanism well: "Each subagent also provides separation of concerns, distinct tools, prompts, and exploration trajectories, which reduces path dependency and enables thorough, independent investigations." In the new-hire frame, each agent shows up with a defined job and its own way of approaching it. You onboard them the same way you'd onboard a new colleague who joined last week, with shared context and a few rounds of "this isn't quite right, here's why."

I keep catching myself treating Pazi agents like real colleagues, and that mental model is what's actually let them get good at the work.

Rule 3: Point at sources instead of re-explaining in prompts

When the agent doesn't know something, the instinct is to write a longer system prompt, but every new agent then rewrites the same explanation and the explanation drifts a little each time you touch it. Writing the context into a file the agent can read on its own and pointing the agent at that file scales better, because the same file works for every agent that needs the same context, regardless of which platform you're on.

Anthropic's broader take, from "Building effective agents," lines up with what I've seen in practice: "the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns." A markdown file an agent loads on demand is about as simple-and-composable as it gets, and the leverage compounds. In their multi-agent research, rewriting a single tool description produced "a 40% decrease in task completion time for future agents using the new description." Small docs fix, big downstream effect.

Rule 4: Build specialist agents, not a mega-prompt generalist

The opposite mistake of rule 1 is cramming every job into a single mega-prompt that's supposed to do everything from competitive research to content production to strategy work, all from one context. The surface area gets too large to verify, the context stays crowded, and when something breaks you can't tell which of the responsibilities failed.

Specialists work because each agent's context stays focused on one job, which makes the skill set small enough to actually evaluate and failure points trace back to a known source. Split the work into role-scoped agents that each own a narrow definition of done, instead of one agent whose responsibilities sprawl.

The numbers back the pattern. Anthropic's internal evaluation: "a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval." The split into specialists nearly doubled the score on the same model and the same capability budget, which makes specialization a category-level architectural choice rather than a marginal optimization.

Rule 5: Treat rough early output as the work

Most agents get abandoned early because the output is rough, but rough output is the entire point of those first runs. The agent doesn't know what your version of the task looks like yet, and the only way it finds out is by being corrected on real work.

Each cycle has the same shape: you correct the run, fill in whatever context the agent was missing, and the next run gets a little closer. After enough cycles the runs come out clean and you drop out of the loop. Most reports that "agents don't work" come from people who quit before that point, because the output looks broken when you haven't yet shown the agent what good looks like. The fix is in the feedback loop, not the model, and it's solvable in hours rather than weeks.

Rule 6: Start on repetitive operational work, not strategy

Repetitive operational tasks have clear right-and-wrong, where you can tell in seconds whether a blog draft matches the brand voice or whether the weekly report has the numbers right, so corrections feed back fast and the loop closes fast.

Strategic tasks have fuzzy success criteria. "Should we enter this market" doesn't have an answer that can be scored in a session or a month. The feedback loop is months long, and a months-long loop can't be trained on. Start where the signal is fast (anywhere the work has a daily or weekly cadence and a clear definition of done) and push into fuzzier work later, after the management loop is proven out.

Rule 7: The handoff test, running cleanly without you

One good output isn't the test, because plenty of agents produce one good output and then drift the next time the input shifts. The real test is whether the agent has been running for a meaningful stretch (long enough that you forget you're not the one doing the work) and the output is still clean.

Until that's true, you're still in the loop with an agent doing some of the typing, even if it looks like the work has moved off your plate. Once the handoff test clears, your attention moves on to the next thing the agent doesn't know how to do yet, and that's when onboarding ends.

How AI agent onboarding compounds: from one agent to a working team

Onboarding takes days, not an afternoon, because the correction loop runs that long, and the agents that come out of it run the work you no longer want to think about.

The compounding shows up after the first specialist clears the handoff test. The next agent gets onboarded faster because most of the working context already exists in whatever you built for the first one. After a few cycles, you're not really "onboarding" anymore. You're hiring into a team that already has working norms.

Start with one task on one specialist agent. Run rule 1 through rule 7. When that agent clears the handoff test, bring the next one in alongside it. Within a few weeks, the team's been doing real operational work and you're spending your time on the things only you can do.

That's what it feels like to manage an AI team. Go build one at Pazi.

Why your next competitor is a small team with AI agents

zvone187 — Wed, 06 May 2026 15:09:54 +0000

The standard business school answer is that the most money lives in the largest companies. Bigger capital, bigger spend, bigger headcount, bigger moat. That answer has been right for two centuries. I think it has about five years left, ten if it gets lucky.

The shift is already starting. Small teams, sometimes one person, are running operations that used to require a few hundred. Not "doing the work of a few hundred people" in the slide-deck sense. Actually shipping product, serving customers, compounding. The reason this is happening is simple. Operations have been the moat for a long time, and operations are the first thing that gets commoditized when you can hire an agent the way you used to hire a junior.

This isn't a "AI changes everything" post. Capital-intensive things still require capital. Regulated things still require regulators. If you're trying to fab a chip or run a bank in your garage, the next ten years will not save you. But if your company's moat is that you have more people doing more operations than the next company, that moat has a clock on it now. The category-defining competitor in your market over the next five years is more likely to be a 4-person team than a 4,000-person enterprise. That is the shift.

Two centuries of company size

To understand why this is happening, walk back through the last 200 years. The shape of the most valuable companies changes every 50 to 75 years, and every time it changes, the reason is the same: whatever used to be scarce gets commoditized, and whatever was the bottleneck moves somewhere new.

The shape of the most valuable companies, across three eras: industrial-era textile mill, internet-era office tower, and AI-era precision assembly

In the 19th century and most of the 20th, the biggest companies were factories of tens of thousands. Capital and labor in lockstep. To build a single pair of pants you needed cotton sourcing, a mill, dye, looms, cutters, sewers, finishers, packagers, a warehouse, a distribution network, retail. None of those steps could be skipped, and each step needed people. The companies that won were the ones that could afford to vertically integrate the whole stack and then run it at scale. Headcount was the moat because each step of production was a manual operation and you couldn't get the operation without the people.

The internet changed that, slowly at first and then quickly. Suddenly a 500-person team could serve thousands of customers, then millions. Software started doing what people used to do, the part where forms were filled out, transactions were posted, inventory was tracked, customer questions were answered. Distribution stopped being a thing you bought with capital, it became a thing you addressed with code. The shape of the most valuable companies shrank. The Fortune 500 of the late 20th century had names like GM and US Steel with hundreds of thousands of employees each. The Fortune 500 of the early 21st century has names like Stripe and Shopify and Airbnb where the per-employee output is an order of magnitude higher because most of the operations are software, not headcount. 50-1,000 person teams started to look normal for billion-dollar companies. WhatsApp had 55 employees when Facebook acquired it for $19 billion in 2014.

We're at the next inflection now. You can hire an agent, tell it what to do, and it will do it. Not always perfectly, not always without supervision, but enough that operations are commoditizing the same way distribution did 25 years ago. The work that used to take a team starts to take a config file.

Operations were never the moat. Scarcity was.

Every era's moat was the scarce thing. In 1880 it was machinery and labor. In 1995 it was distribution. In 2010 it was a sales team that could close. None of them were intrinsically valuable. They were valuable because they were rare.

Operations have been rare for 25 years. Hiring 50 customer success reps was hard. Hiring 50 marketers was hard. Building a content engine that shipped daily across X, LinkedIn, the blog, Reddit, podcasts was so hard that most companies just didn't do it. The companies that could afford to do it had a real advantage.

That rarity is collapsing. The work still needs to happen. But the per-unit cost of doing it is dropping toward zero. McKinsey estimates that current AI could technically automate 60 to 70 percent of activities that absorb employee time today. That's a ceiling, not a committed gain. But it tells you which direction the floor moves. What's scarce now is the idea, the taste, and the speed to ship. A small team has more of all three than a big team does. Every approval layer is a friction surface, and a team of 1,000 has a lot more friction surfaces than a team of 4.

When something is a commodity, you don't pay enterprise prices for it. That's the shift. Not AI replacing humans, not founders firing half their team and replacing it with chatbots. Just operations getting cheap, the way distribution got cheap 25 years ago, and a new shape of company emerging on the other side.

What a small team with AI agents actually looks like

We run Pazi this way. Not as a thought experiment, as the actual operating model. We are not the only ones. Stripe disclosed that more than 700 AI-agent startups launched on its platform in 2024 alone. The shape is becoming the default. The marketing function, the dev-ops function, the content function, the analytics function, the creator outreach function, all of those are running with agents in the loop, talking to each other in shared chat threads the same way a team of humans would.

The agents are not perfect. They mess up. They confuse one thread for another. They sometimes deliver work twice or skip a step. We've written about the kinds of failures we hit and the way we catch them: silent failures in agent systems, scheduled tasks that don't deliver. None of this is magic. The work that previously required a person is now mostly happening without one, and when it breaks the failure is recoverable.

What's interesting is the shape that emerges. There's no clear boundary between "the human team" and "the agent team" in practice. There's a shared set of channels where work gets done. People assign work, agents do it, agents assign work to other agents, humans review when judgment is needed. The same review and approval flows that you'd build for a 50-person team work, just with most of the names being not-people. The hardest part isn't the agents themselves, it's getting the orchestration right, the handoffs, who owns what, what happens when something fails.

A small team with AI agents looks like a peer network of equals passing work to each other, not a hierarchy

That last part is where the next few years of platform competition will live. The agent isn't the moat. Anyone can buy a model. The moat is the operating layer, the thing that lets ten agents collaborate, recover from each other's mistakes, and ship work end-to-end. That's what we're building Pazi for.

The honest counter-argument

I don't want to over-claim. There are real reasons big companies will still exist five and ten years from now, and the inversion I'm describing has real limits.

Capital-intensive industries don't get smaller because of AI. You still can't run TSMC out of a garage. Semiconductor fabs, container shipping, oil refining, automotive manufacturing, all of these have physical, capital, and supply-chain constraints that headcount commoditization doesn't touch. The big companies in those industries will keep being big.

Regulated industries don't either, at least not on the same timeline. Banks, insurance carriers, pharma, defense contractors, all carry compliance, audit, and risk overhead that is itself headcount, but it's headcount you cannot just replace with an agent without regulators noticing. Some of this work will get more automated. The companies will probably get smaller. But the structural minimum is higher than in software.

And distribution lock-in is still real, even though it's eroding. If you're trying to compete with Google in search or with Amazon in retail, you're not going to win because your team is 4 people and theirs is 400,000. Network effects compound on the side of the incumbent for as long as the network exists.

So the argument isn't "small teams will eat every category." It's narrower than that. In any market where the existing competitors are big mostly because operations and headcount are the moat, the next competitor is probably going to be a small team with agents. That covers a lot of markets: marketing services, customer support, most B2B SaaS in the long tail, professional services, content production, sales operations. The economy is mostly these things now, and they're all about to get smaller and faster.

Even where the small-team thesis doesn't fully bite, incumbent companies are statistically expiring. The average tenure of a company in the S&P 500 has fallen from 61 years in 1957 to under 20 today. The next index is being assembled right now.

What this means if you're operating now

If you're a founder, the implication is direct. The right team size for an early-stage company is smaller than it was three years ago, and shrinking. You don't need a head of growth, a head of content, a head of community, a head of partnerships. You need one person who can orchestrate agents across all four. Hire for taste and judgment, not for execution capacity. Execution is the thing the agents are going to do.

Inside a bigger company, the implication is more uncomfortable. The functions that look most secure today, the operations-heavy ones with the biggest teams, are the most exposed. Marketing, customer support, content, analytics, parts of sales. The smart move isn't to wait and see, it's to start running the team the way a small team would run, today, with the agents you already have access to. Most of the work to "AI-enable" a function is just rewriting the operating model, not adopting a new tool.

Closing

A 19th century textile mill: five thousand people in a building, each doing one part of one operation, coordinated through a hierarchy of foremen and managers. That was the most efficient way to make a pair of pants. Then it wasn't.

The most valuable companies in the world today still look more like that mill than we usually admit. Most of the people in them are doing pieces of operations that, in a different shape of company, wouldn't need a person at all. The shift toward smaller teams isn't a prediction. It's already happening, it's accelerating, and the question for the next five years isn't whether it happens but who builds the operating layer that makes it run.

The next category-defining company in your market will be a small team with the right ideas, the right taste, and the right agents.

That's the bet I'd take.

Pazi is the operating layer for that small team. pazi.ai.

5 silent failure modes in production AI agents (and how we instrument for them)

zvone187 — Tue, 05 May 2026 19:38:15 +0000

AI agents fail differently than apps. The failure rarely lives in the work itself. It lives in the seams: the delivery step, the tool call, the inbound routing decision, the bootstrap that ate the budget. None of those surface as exceptions, so APM dashboards say "green" while the user sees nothing. Here are five failure modes that show up that way, and how we instrument or defend against each.

1. Crons that "succeed" but never deliver

The cron framework doesn't know what the user received; it knows what the agent reported per-run. Our runtime persists lastDeliveryStatus as a three-state field ("delivered" | "not-delivered" | "not-requested"), but those states are the agent's self-report. A run that creates its side effects and then runs out of budget before the announce step still serializes the side effects as done. We saw this concretely with one of our bug-triage crons: on a 300-second timeout with about 75 seconds eaten by bootstrap, the agent successfully created a GitHub issue, opened a Jira ticket, and updated a sheet, then hit waitedMs=298401 and got cut before the Slack announce step ever ran. The framework recorded the run as delivered, but no Slack message was ever sent.

Our Sentry log transport matches cron: ... ERROR lines from the cron service and forwards them as Sentry events tagged with the component, the job id, the cron name, the run id, and the error count, so a timeout that used to be a buried log line becomes an event you can search by job. The deeper fix from that same incident isn't observability though, it's sequencing: user-facing announces have to run before cleanup so the announcement gets budget priority. We come back to that in failure 5.

2. Tool calls that 4xx silently

When a tool wrapper returns a generic empty string on a non-2xx, or a soft "operation could not be completed" message that doesn't look like a real error, the model has no signal to act on. It reads the response, treats it as a valid no-op, and moves on to the next step. The end-of-run summary says everything succeeded, the audit trail shows the tool was called, and nothing in either trace says the call actually failed.

What we changed is making those failures loud at the runtime layer. The runtime now matches [tools] NAME failed: reason … ERROR patterns from the tool adapter and forwards them to Sentry as captureException events tagged with the tool name, which means a tool's failure rate becomes a metric you can see at a glance instead of something you'd have to grep debug logs for. We also promoted the Slack send-side silent-drop logs from verbose to info, which catches the boring-but-important half of this: rate limits and permission errors at the send boundary that used to be invisible unless you knew to look.

3. Channels that suppress inbound messages without telling anyone

When an agent stops responding to DMs, the bug is almost never in the agent. The inbound handler sits between the platform and the agent and decides for every event whether to route it or drop it, and most of its drops are the right call: a bot self-mention shouldn't loop, a message edit shouldn't trigger a fresh run, a thread the agent isn't configured for shouldn't get a reply. The problem is that the handler doesn't tell anyone when it drops something, so a wrong drop and a successful "agent is healthy" look identical from the outside. The user waits, the runtime stays green, and there's nothing in the trace that says a message was even seen.

We changed that with two pieces of structured visibility on the inbound side. The fifteen silent-drop log lines that used to live at verbose now emit at info, so suppressions show up in normal runtime logs without anyone flipping a debug flag. On top of that, every suppression goes through a structured event transport that tags it with one of fourteen canonical reason codes (no-mention, channel-not-allowed, dm-not-authorized, and the rest) alongside the original log line. When someone says "the agent missed my DM," the answer is in those events: grep the message id, get a definitive routing reason, stop guessing.

4. Reasoning leakage in Slack threads

A different shape of silent failure: the agent posts a message in a Slack thread that reads Now calling message(action=send, channel=#alerts) to post the alert, and the alert never goes anywhere because the model narrated the tool call as plain text instead of issuing it. From the user's side it looks like the agent did the work, from the runtime's side no side-effecting tool ran in that turn, and the model itself treated the narration as the work and continued.

This one isn't catchable as an exception, so two layers defend against it at different points in the pipeline. At the prompt layer, the operating contract includes a rule that says "Default: do not narrate routine, low-risk tool calls (just call the tool)." Between tool calls the agent should either deliver a polished user-facing message or stay silent, never narrate the call. At the delivery layer, the channel dispatchers strip reasoning tags before any message lands in Slack, Discord, or Telegram, so even when the model narrates internally the surface the user sees stays clean. We don't yet have an observability backstop for this. No runtime check scans assistant turns for tool-call syntax and fires an event when the pattern hits, and that's a future hardening item we know we owe.

5. Bootstrap latency eating the timeout

Bootstrap is real wall-clock time on the same budget as the work, and treating it as free overhead is how a 300-second timeout quietly becomes a 225-second one. In the bug-triage cron incident from failure 1, memory load, credential resolution, and skill scan together took about 75 seconds before the agent did anything productive. The work expanded to fill what was left, and the user-facing announcement, always last in the chain, lost the race when the runtime cut the run at 298 seconds. Nothing threw, so the completion handler logged success normally. Nothing logged that the final tool call had been cut off mid-emission.

This one is a lesson with concrete numbers, not a feature we shipped. Measure your bootstrap from log timestamps the first time it bites, then size your timeout as observed-bootstrap plus observed-work plus a buffer instead of as one flat number. Sequence user-facing announces before cleanup steps so the side effect that actually matters to the user gets the budget priority. The only Sentry visibility on this today is downstream, after the fact: when a cron times out, the resulting error log goes to Sentry tagged with the job id, so the timeout itself is searchable. We don't have any of the standard observability scaffolding for the bootstrap-vs-work split yet. No histogram of bootstrap times, no per-run milestone event, nothing that alerts on a starved work phase. That's how this stays a sizing rule for now, not a tracked metric.

What this adds up to

The five failures share a fault line: the agent's internal trace says everything is fine, the user's experience says nothing happened, and the dashboard sides with the agent. APM was built for a world where exceptions are the failure signal and a green run means the user got served, but agents don't fit that world. Agent observability has to assume the user is absent, and the instrumentation has to ask whether the side effect actually happened as a first-class question instead of deriving it from whether the run finished.

Our runtime forwards cron and tool errors into Sentry as tagged events, and our Slack inbound handler emits structured suppression events on the same path, so the cron, tool, and suppression failures all become things you can search by name. The other two are harder. Reasoning leakage needs runtime hygiene at the prompt and dispatcher layers, not exception capture. Bootstrap-eats-the-timeout needs budget arithmetic and sequencing, not a Sentry tag. Some silent failures don't resolve to "add another event," and that's part of operating agents in production. There are more failure modes we haven't written up yet.

Pazi is an agent platform that takes production observability seriously — every cron, tool call, and runtime error is captured. Build your first agent at pazi.ai.

I built an autonomous dev team with 3 AI agents that takes a Linear ticket all the way to a pull request

zvone187 — Tue, 14 Apr 2026 14:05:48 +0000

Ever since OpenClaw launched, I've been obsessed with one question: what does software development actually look like when AI agents can collaborate with each other? Not as autocomplete. Not as chat assistants. As actual teammates that coordinate work between them. Here is how my desktop looks like now:

My current dev workflow on Slack

I've been a developer for over 15 years, and I've seen a lot of hype around AI coding tools. Most of them promise the world -- "just describe your app and we'll build it." But in reality, the moment you go beyond a simple demo, things break down. From my experience building GPT Pilot, I learned that the biggest problem isn't that AI can't write code -- it's that a single agent gets stuck in one way of thinking and can't course-correct.

So I took a different approach. Instead of one super-agent that does everything, I built a system of three agents that work like a real dev team: a Tech Lead, a Developer, and a QA. They coordinate through Slack, just like human teammates would.

In this post, I'll walk you through the entire setup -- how each agent works, how they hand off work to each other, why multiple agents outperform a single one, and how you can import all of the skills yourself. It's all open sourced here.

Transparency note: I'm a founder of Pazi, which is built on OpenClaw. Everything I'm sharing here works on both.

How the workflow starts

Workflow kickoff

It all begins with a Linear ticket (full example screenshot). When I move a ticket to "To Do" and assign it to the Tech Lead agent, it picks it up automatically and starts a Slack thread. This thread becomes the central coordination hub for the entire ticket -- every handoff, every status update, everything happens here.

I can also kick things off directly from Slack if I prefer. Both entry points work.

The first thing the Tech Lead does might sound boring, but it's actually crucial: it sets up the environment.

Parallel environments with worktrees

Each worktree is hosted on a separate port so I can test any implementation myself

What was really important for me was enabling parallel work. If you have multiple tickets being implemented at the same time, each one needs its own isolated environment. Otherwise, you're constantly dealing with port conflicts, branch collisions, and general chaos.

So here's what happens: for each ticket, the agent creates new git work trees for both of our repositories (we have a separate frontend and backend). Each work tree gets its own set of ports. The agent spins up the application on those ports and configures Nginx to expose the frontend so I can access it from a browser.

This matters a lot because at the end of the process, a human always needs to test the result. Having each ticket running on its own port means I can just click a link and see the implementation -- no need to check out branches, install dependencies, or figure out which port to use.

QA creates the testing plan BEFORE any code is written

QA report

Once the environment is set up, the Tech Lead tags the QA agent in Slack: "Hey, can you create a testing plan for this ticket?"

This is a deliberate design choice and, I think, one of the most important ones. Normally, when you let an AI agent just start coding, it writes the code and then tries to test the code it wrote. The problem? It tests its own implementation rather than testing from the user's perspective. It knows what it coded, so it writes tests that confirm what it coded -- not tests that verify the feature actually works for the user. Here is how that plan looks like (btw, yes, we made our agents build HTML reports - they are so much nicer to look at than just having a linear/slack wall-of-text comment). You can click on each test to see details.

By having the QA agent create a testing plan purely from the task description -- before any code exists -- we ensure the tests are about user behavior, not implementation details.

The testing plan gets posted to Linear so there's a record, and then the QA tags the Developer in Slack: "Hey, can you kick off the implementation?"

Why multiple agents outperform a single one

Agents hyping each other after few iterations of fixing an issue

Ok, this is the part I'm most excited about. Bear with me.

In theory, a single Claude Code session should be able to do everything: read the ticket, plan the implementation, write the code, test it, iterate until it works, and create a PR. Why wouldn't it? It has the context, it has the tools.

But from our observation, it just doesn't work that way. The biggest problem with LLMs in coding is that once they go down a path, they get stuck. The entire conversation history is in the context, and it becomes really hard for the model to step back and think from a completely different angle. It's like asking someone to critique their own essay while they're still writing it -- the train of thought is too strong.

This is actually why we have code reviews in real teams. Not because the original developer is bad, but because a fresh pair of eyes -- without the baggage of the implementation journey -- sees things differently.

So here's what we do for the implementation:

Dual planning -- We spin up both Claude Code and Codex separately to create implementation plans. Neither knows about the other's plan.
Cross-review -- Each agent reviews the other's plan. Again, fresh context -- no prior train of thought.
Synthesis -- A final agent looks at both plans, all the reviews, all the pros and cons, and creates the best plan from both perspectives.

The key insight: whenever a plan is created or reviewed, the agent doing the review doesn't have the previous agent's chain of thought in its context. It's looking at the output with completely fresh eyes.

You can think of it as asking for a second opinion from a doctor who doesn't know what the first doctor said -- they'll look at the symptoms independently rather than anchoring on the first diagnosis.

Implementation and PR

Diagram of the entire implementation flow

Once the final plan is ready, the actual coding starts. A Claude Code agent takes over with access to Playwright and the Figma MCP. With Playwright it can interact with the running app to verify things work. With Figma MCP it can reference the original designs.

The agent implements the feature, runs the tests, iterates, and when it's done -- it creates a pull request (full PR screenshot).

That PR then goes to a separate Codex agent for code review. This is another fresh-context moment: the reviewing agent has never seen the implementation conversation. It only sees the PR diff and the original task description.

They go back and forth -- the reviewer requests changes, the original Claude Code agent addresses them or pushes back with reasoning. Just like a real code review in Slack. After a few rounds, the PR is approved.

QA tests in the browser

Now comes the final verification. The Developer tags the QA agent in Slack: "Hey, can you test this feature based on the initial plan?"

Developer calling the QA to start testing

The QA agent works entirely in the browser. Remember those ports we set up at the beginning? Now they matter. The QA agent navigates to the running instance, clicks through the features, and tests everything from the user's perspective based on the plan it created before any code was written.

If something fails, it goes back and forth with the Developer in Slack -- "this button doesn't work" or "the redirect goes to the wrong page." They iterate until all tests pass.

Here is how the final test report looks like (after 9 iterations between the developer and the QA).

Human review

Once all the automated coordination is done, the ticket gets moved to "Review" in Linear and assigned to a human. At this point, all the heavy lifting is done -- the code is written, reviewed, tested, and the PR is ready.

Final linear comment by the QA agent

The human reviewer can click the link to the running instance (on its isolated port), log in, and test everything manually. No setup needed, no branch checkout, no "it works on my machine." Just click and verify.

The Tech Lead as coordinator

Throughout this entire process, the Tech Lead agent acts as the coordinator. It doesn't write code or test features -- it orchestrates. It tracks the status of each step, handles the handoffs between agents, and makes sure nothing falls through the cracks.

Tech lead ensuring that the final QA report has all screenshots

Think of it as a project manager who doesn't code but makes sure the right person is working on the right thing at the right time.

Why Slack

One thing I didn't expect was how much Slack changes the experience once you move beyond thinking of AI agents as coding tools.

Observability

When you have multiple agents working on a feature, you need to know what they're doing. Most people think you need some fancy dashboard or monitoring tool for that. But actually, the only thing you really need is the conversation between them. Just like with human coworkers -- you open the Slack thread and you can see exactly what happened. Who did what, what decisions were made, what questions came up. The whole history is right there, in a format you already know how to read.

Interaction

The second thing is even more important: it feels natural. When an agent has a question, it literally tags you in Slack. In the middle of the whole implementation, you can jump in. If they're heading in the wrong direction, you redirect them. If they have an ambiguous requirement, you clarify it. You're not switching to some separate tool or reviewing logs -- you're just replying in a thread, the way you would with any teammate.

Developer agent asks a human for a product decision

I gotta say, this is the part that surprised me the most. After a while, it stops feeling like you're using a tool. It starts feeling like you're working with colleagues who happen to be really fast and never take lunch breaks. They tag you when they need you, they update you on progress, and the rest of the time they just get the work done.

What I learned

A few things that surprised me:

Fresh context is everything. The single biggest improvement came from making sure reviewing agents don't have the implementing agent's conversation history. It sounds simple, but it makes a massive difference in the quality of reviews and plans.
Testing plans before code matters. When the QA creates tests after seeing the code, the tests are biased toward the implementation. When it creates tests from just the task description, the tests actually catch real issues.
Parallel environments are non-negotiable. Without isolated work trees and ports, the whole thing falls apart the moment you try to work on two tickets at once.
Humans are still essential. This system doesn't replace developers -- it handles the routine implementation work so humans can focus on architecture decisions, product direction, and the weird edge cases that AI still struggles with.
Ticket descriptions are more important than ever. When I write vague instructions, the agents do something that technically works but doesn't match what I actually had in mind. When I spend five minutes describing exactly how I want a feature to look and behave -- they nail it. The irony is that the system is fast enough now that the limiting factor isn't AI capability, it's how well I communicate what I want. I personally use Wispr Flow -- I just hit record on my computer and talk for five minutes about how the feature should work, in as much detail as I can. It transcribes everything, I paste it into the ticket, and the results are dramatically better.

Try it yourself

I've open sourced all the skills that make this work. You can import them into OpenClaw or into Pazi and set up the same workflow for your team.

You can find all skills here with instructions for how to set it up on your OpenClaw or just go to Pazi and click on the tech team template and you'll be good to go.

If you try it, I'd genuinely love to hear how it goes -- what works, what breaks, what you'd do differently. Drop a comment or ping me directly.

This is still early. We're iterating on this every day, and there are definitely rough edges. But I gotta say, watching three AI agents coordinate a ticket from start to PR in Slack is something I didn't think I'd see this soon.

How to secure AI-coded (vibe coded) applications

zvone187 — Thu, 25 Sep 2025 16:26:36 +0000

Overview

For the past two years, I’ve been deep in the world of AI, building Pythagora and trying to push it toward building production-ready applications. Recently, things have taken a big turn. With the release of Claude Code with Sonnet 4, we’ve hit a turning point. You can now spin up really complex applications, complete with database operations, third-party integrations, and basically anything you can dream up, just by chatting. It feels like live coding, but with AI at your side.

Initially, my focus while working on Pythagora was on enabling people to build apps by chatting, and on what the UX should look like. Now, my attention has shifted to something more important: how do we actually secure web apps once they’re built? For this, I believe we need a mindset shift. Security isn’t something we can just offload to AI or abstract away—it has to stay firmly in human hands.

At Pythagora last year, we were the target of multiple hacker attacks, both on our system and on the systems of vendors we used. When one of our vendors got hacked, the hackers got our OpenAI API and stole over $30k worth of tokens.

After seeing how vibe coded apps are getting exploited like in this case where a Lovable built app exposed all its users' data to everyone, I wanted to see how we can ensure that apps built with Pythagora are secure.

So, I embarked on a journey to map out all potential vulnerabilities and see how we can enable less technical builders to build completely secure applications.

In this post, I'll share what I've learned about security in no-code apps, but also what we actually did as a team. We’ll look at some real examples of vulnerabilities that pop up, and then I’ll walk you through my solution for making vibe-coded applications truly secure. Secure enough that you could confidently build business-critical systems with nothing but vibe coding.

Web app vulnerabilities

We can segment vulnerabilities into the following categories:

1. Insecure HTTPS connection - today, this is a table-stakes security measure, and most websites have it by default. If you open a website that starts with http:// instead of https://, aside from your browser clearly warning you that this website is insecure, all data you enter on this website may be intercepted by a hacker, allowing them to read everything you entered. In contrast, all HTTPS websites encrypt the data so that, if a hacker intercepts it, they will see just gibberish.

2. Authentication - this is the first crucial part of making a web app secure. Seemingly, all AI coding tools have this covered, but if AI is building something for you that you don't check every single line of code, the question is how to make the authentication 100% secure. Take a look at the screenshot below, in which, if AI hallucinates just one line of code and removes the requireAdmin keyword, you will expose all your customers to any crawler that's scanning the internet.

Imagine you create a page that is accessible by anyone (eg. Home page) on which you add names of your teammates. When you decide to deploy this app, you want to remove this page so no one can see employees in your company. When you remove that page, the endpoint is left open, so anyone can get the employee data.

3. Authorization (access control) - once you are sure that your app cannot be accessed by anyone outside of the company, you must ensure that people within the company don't get access to data that they shouldn't have access to.

What if one user gets access to the data that they shouldn't see - for example, in an HR management app, you MUST NOT allow everyone to see HR's paycheck slips for an employee.

4. Codebase vulnerabilities - there is a whole list of potential codebase vulnerabilities - here are a few of the most important ones:

Compromised libraries - the entire web stack is built on top of libraries (code plugins that other people built), most of which are open sourced. Using open-source libraries is nice, but it also provides hackers with full insight into how the library works. So, a library developer might introduce a vulnerability (e.g. a back door to the system that enables hackers to access any app that uses that library). You might not have anything to do with this library, and you may not even know you're using it (it could be a library that a library you've installed is using), but you still get affected.
Sensitive data exposure - if you're not careful how your backend is structured, it might expose some of your API keys to your users (doesn't even need to be a hacker) so if your user finds your API key, they can accure a massive bill on the service you're using (eg. if you're using OpenAI API key for LLM requests).
Injections - if the codebase is not structured correctly, it can enable input data to be run on the server. The most famous is SQL injection, where, if your code inserts a user's input field into an SQL query, if you put DROP DATABASE as your name and on the server, you run INSERT INTO users (email) VALUES (DROP DATABASE) - this query will completely delete your entire database. A similar case happened in an app coded with Replit.

AI writes A TON of code

Everyone knows that AI writes way too much code - much more than it's supposed to...if it were a human. In theory, yes, AI writes more code than what was asked of it, but that code isn't necessarily bad. It's much better than many developers' code that I've reviewed. It adds a lot of logs, error handling, and extra functions.

Pythagora can easily write 10k lines of code (LOC) in a day of work. This is essentially impossible for humans to review. Especially when LLMs get faster, we're likely looking at producing 50k LOC in a day. So, how do we manage all the new code that's being produced?

It's clear that AI is able to write working code so we need to find a way to live in this new world where it is actually possible to produce 50k working LOC.

We shouldn't look for ways to review all those lines of code or reduce the number of lines, but rather, we should find a way to secure AI coded apps without having to review every single line of code regardless of how much code AI writes.

Hallucinations

In reality, the problem is not the amount of code but rather hallucinations. Imagine that AI misses one crucial line of code. In the case below, simply removing "requireAdmin", AI exposed all customers of a company to the public.

So, the question is...

How to use vibe coded apps for business

For me to trust an AI coded app, the first requirement is that I need to be 100% sure that my data is safe. That is 100% - not 99%.

When we first started to build internal apps at Pythagora, I realized the anxiety. I was super insecure about whether there were any exposed endpoints. I didn't just want to see the app working; I wanted to be 100% sure that our users' data is secure.

The anxiety just grew when someone else, especially less technical people in the team, would build a tool that used users' data so I started to think about what would it take for me to trust an app that Pythagora wrote.

You want to offload coding to AI, but keep security in human hands

If I could be 100% sure that the data cannot be leaked or mismanaged, I would not need anyone from my team to review the code that AI built and I would approve any tool release that makes sense.

In the sentence above, the key part is "100%" which AI cannot provide, so humans have to provide a guarantee for some part - security.

Over the past few months, I structured a set of security layers that, if provided, would give me the ease to approve a release of a tool within my team.

These layers should ensure that, regardless of what AI writes, our data remains secure without requiring a review of every single line of code.

✅ 1. Isolate the codebase from non-authenticated users

Vulnerable use cases:
- API endpoint ends up being exposed to the world
- AI hallucinates and removes access control from an API endpoint
- A human creates a publicly available page with sensitive data, then removes the page, but doesn't remove the API endpoint

The first step is to isolate our AI coded apps from anyone outside of the team. That means that API requests that are coming into the codebase already need to be authenticated. A non-authenticade request MUST NOT trigger even a single line of code.

This can be done by enforcing authentication on the infrastructure layer.

Essentially, all vibe-coded tools embed authentication into the AI-generated code, which means that AI can modify the authentication and potentially mismanage it. If it hallucinates in the wrong place, it can expose our entire system to hackers.

In contrast, I wanted to be sure that authentication works regardless of what AI generates. This is where the idea for Pythagora's isolated security system came from. With it, the authentication isn't implemented into the AI generated codebase but in the reverse proxy (specifically NGINX) that sits on the server in front of the applications and authenticates the user before allowing the API request to reach the app. This way, it doesn't matter what happens in the codebase; a user who is not authenticated cannot even reach the app that someone from your team vibe-coded.

In other words, Pythagora's security system completely isolates all your AI-coded apps from anyone who is not authenticated.

Here is a video in which I go deeper into the details about how it works.

🟡 2. Transparent data access

Vulnerable use cases:
- Data is accessible to users within the company who shouldn't have access to that data
- AI coded app creates a destructive database query

For me to allow anyone less technical to deploy an app inside Pythagora, I need to be 100% confident about what happens on the backend of this app.

Today, all vibe coded backends are a black box where, if you don't do a full code review of the entire app, you don't know what happens there.

Code review is unrealistic for AI coded apps since you knock out tens of thousands of LOC within a day. The developer reviewer would need another full day to review it. That's why we need to think differently.

We can still give humans the ability to review what the app is doing without going through each and every LOC. For that, we need to see what we want to review in a code review:

API endpoints - which user has access to which endpoints
Database queries - what DB queries are made within which API endpoint
3rd party requests - which API endpoints trigger which 3rd party requests, and what data does it send

Here is a backend visualization that enables you to easily view all backend endpoints and understand the functionality of each one.

🟡 3. Codebase vulnerabilities

Vulnerable use cases:
- The library that AI installed gets compromised
- AI creates code that's vulnerable to SQL injection
- AI creates code that leaks an API key

These vulnerabilities are mostly done by static code analysis. Some systems automatically scan for these vulnerabilities like Github - if you push code to Github and have an API key exposed or if you're using a compromised library, Github will raise an alert and not let you make your code public.

This is the part that will come last on the roadmap and will perform the following checks before an app is deployed:

Static code analysis.
Automated pen test scan.
AI codebase scan by Pythagora's security agent, powered by Claude Code. It is looking for the following vulnerabilities:
- Injections (SQL, OS, LDAP) - check for places that enable a user to inject, for example, an SQL query that will automatically be run
- Secret keys exposure - check if the secret keys are propagated outside places where they need to be utilized and if any secret key is being propagated to the frontend.
- Cross-site scripting (XSS) and insecure decentralization - check for any places where a string (eg. user name) can be automatically run.

Approval process - code review is outdated in the world of AI

The final part of ensuring an application is fully secured involves not only the security measures but also enabling technical leaders to be confident in the software being deployed within their organization. The final step is to set up the approval process, allowing leaders to review the changes made, such as API endpoints and database queries, and sign off on the software release. We still have some time before reaching this point, but it is coming - Pull Requests in GitHub will soon become obsolete in the world of AI.

Conclusion

In this blog post, we explored typical vulnerabilities that web apps suffer from and how we can defend ourselves from these vulnerabilities in times when AI writes so much code that it's impossible for humans to verify every single line of code.

My goal is to start a discussion about what is needed in order to secure AI coded applications and how we can bring vibe coding into business critical use cases.

If you have any thoughts on this, I would love to hear from you either in the comments below or by reaching me at zvonimir@pythagora.ai.

If you still haven't, check out our product, Pythagora. It gives you the power to build astonishing tools, not just demos, within a few hours of chatting with AI. From today, you can deploy your apps securely and be assured that your data is safe.

How We Made Sure Big Companies Can’t Steal Our Code

zvone187 — Thu, 30 Jan 2025 20:00:31 +0000

As a developer, I love open source. I think it’s the backbone of innovation, collaboration, and community-driven progress. But if you’re a small team or a startup, sharing your work as open source comes with risks. Big risks.

When we started building Pythagora in 2023., it was one of the first agentic systems where AI agents work together to create entire codebases - so, we wanted to share it with the world by showing and inspiring others to build complex systems. However, we knew we needed to protect our innovation from being exploited by larger companies.

In this post, I’ll share the challenges we faced while choosing a license, why traditional open-source licenses didn’t work for us, and how we found a solution in the Functional Source License (FSL). If you’ve ever struggled to balance openness with sustainability, this is for you.

The Challenge of Open-Source Licensing

When we started in 2023, we were excited to share our work. Open source had always been part of our DNA, but we had to pause and ask ourselves tough questions:

How do we protect our code from being rebranded and commercialized by big companies?
Which license encourages collaboration without compromising sustainability?
Can we avoid complex, overly restrictive hybrid licenses? We’ve all seen stories like Elasticsearch vs. Amazon. It’s a cautionary tale for smaller teams and startups using permissive licenses. Initially, we tried GPL, but it felt too restrictive. Licenses like MIT and Apache were too permissive. What we needed was something fair, balanced, and protective.

Why We Chose Sentry & Functional Source License (FSL)

After exploring different licensing options, we came across the Functional Source License (FSL) from Sentry. It’s part of the growing Fair Source movement, which empowers companies to share their core technology responsibly without losing control of their roadmap or innovation.

Here’s why FSL was the perfect fit for us:

1. Two-Year Grace Period

The FSL gives us a two-year buffer where our code is free to explore, use, and modify—but competitors can’t commercialize it. By the time the grace period ends, we’ll have evolved far beyond the initial release, ensuring we stay ahead.

2. Transparency Without Exploitation

With FSL, anyone can:

Read our code.
Use, modify, and share it under simple conditions.
But here’s the key condition: you can’t directly compete with the project.

3. Simplicity

Unlike some licenses that require legal expertise to interpret, FSL is simple and accessible. It strikes the perfect balance for startups like us: encouraging collaboration while protecting our work.

How FSL Works for Startups

For a startup like Pythagora, the Functional Source License is a game-changer. It allows us to:

Share Our Innovation: Developers can freely explore and build on our work.
Prevent Exploitation: We remain protected from big players who might otherwise rebrand and sell our code.
Foster Collaboration: The simplicity of FSL fosters trust and community engagement.

By adopting FSL, we’re aligning with the values of open source while addressing the realities of being a small team in a competitive landscape.

What This Means for Developers

If you’re a developer exploring GPT Pilot, here’s what FSL means for you:

Transparency: You can freely read our code and understand how it works.
Flexibility: You can use, modify, and share GPT Pilot with minimal restrictions. & Fairness: The two-year grace period ensures we can continue to innovate while giving back to the community.

Licensing is one of the hardest decisions for open-source startups, but the Functional Source License (FSL) has allowed us to share our work responsibly while protecting it from exploitation.

We hope our experience inspires other startups to explore Fair Source licensing as a way to balance openness, sustainability, and collaboration.

If you’re curious about GPT Pilot, check out our blog post or explore the codebase.

Pythagora_v1 release reflection – what we learned, current state, next steps

zvone187 — Tue, 01 Oct 2024 16:34:45 +0000

Today, we’re launching Pythagora v1, so we wanted to reflect on our progress in the past year since we released the first version of GPT Pilot (the open source brain of Pythagora) and in the past six months since we released Pythagora beta.

In this post, you’ll read about how we got carried away with hype, thinking AI will be able to build apps much easier than it really can, how flashy demos deceived everyone as well as us, how we switched our focus from the technology to the DX (Developer experience), what can you build with Pythagora, why are we keeping the core technology open source, and what are our next steps.

History of Pythagora

When we started building Pythagora, our North Star was clear: enable developers to create production-ready applications using natural language. I’ve been a developer for the past 15 years, and my dream is to be able to build applications without having to deal with debugging. So, when I realized how good LLMs can code (you can read about my first experience with coding with ChatGPT in this blog post), I immediately started looking into how we can automate this process with my cofounder, Leon.

After a couple of months of experimentation, we created GPT Pilot. This was in September 2023. The community reacted well, so we continued working on it and soon started fundraising. At this time, we needed to create a roadmap for what we plan to build next year. We ended up building most of the roadmap, but the capabilities we set GPT Pilot to have were missed by a huge margin.

We predicted that, in 6 months (around May 2024), GPT Pilot will be able to build a 30k lines of code app in 2 days!

Learnings from Beta to v1

Not only that it wasn’t able to build a 30k LOC app in any amount of time, but it was so hard to use that we needed to do a reality check.

We realized that a bunch of other projects started to come up (most notably Devin) that promised the world while all they did was create a nice, flashy demo.

In the early days, we packed Pythagora with features like automated tests and autonomous code reviews, aiming for maximum autonomy. But reality hit hard. The technology just wasn’t there yet. So, we decided to take a step back.

We removed many components of Pythagora (like automated tests and self code review), refocused our efforts on user experience, and started investigating how people can use Pythagora to build any app they will end up actually using. So, we began introducing more and more human-in-the-loop interactions.

LLMs are still a technology in its infancy, and we must treat it as such. Many people think we can just say two sentences, like “Build me a clone of Reddit”, to AI and let it build the entire app with the infrastructure. Many decisions need to be made about how the app should work. This is exactly what we’re focused on. We want Pythagora to manage the codebase and ask humans whenever they need to make a decision.

Current state of Pythagora

With today’s release, we believe we’re breaking the new boundaries for us and the rest of the world. With Pythagora, people can build apps with up to 5000 lines of code ONLY by writing in natural language. You can see apps built with Pythagora here. All code in all these apps was written exclusively by Pythagora, and humans wrote no lines of code.

In reality, you can build even bigger apps with even 10k LOC. However, above the 5k LOC, the experience gets trickier, but we’re continuously working to push that boundary. Nevertheless, 5k LOC is quite a solid app size, which makes it truly useful. This is our north star – we want Pythagora to build apps that actually end up in production and not in some random folder on a computer.

That is why we added autonomous deployment so you can deploy the created app in one click online and share it with others.

We believe Pythagora is currently one of the most capable tools for building applications through plain English instructions, minimizing the need to write code manually. But hey, if you know of a tool that is able to build apps with this complexity without requiring humans to write a single line of code, we’d genuinely love to hear about it. Drop a comment or ping me directly at zvonimir@pythagora.ai

That said, we truly believe in that common phrase that AI will not replace humans, but humans using AI will. Pythagora is not a fully autonomous AI developer that wanders off into the wilderness of software development and returns with a fully built app – it is a tool that helps people build apps stellar fast without writing and managing the codebase. You just explain what you want to build and talk to Pythagora, as it constantly asks you questions to understand if it’s going in the right direction.

Managing expectations

First off, let’s temper expectations around zero-shot app creation. While it’s tempting to imagine an AI that builds an entire app from a few sentences, it’s not just about the technological hurdles. Even if the tech gets there—and it likely will in the next few years—is that what people really want?

Imagine working with a freelancer developer today to whom you can give a project and wait for them to come back with working software. Now, imagine telling that human developer your app idea in three sentences, and they disappear for a month, returning with a fully built app based on their interpretations. Chances are, it won’t align with your vision. Development is an iterative process filled with decisions that need to be reviewed, discussed, and tweaked.

We believe that humans will always be the decision-makers and creators. Our mission with Pythagora is to offload the grunt work to AI while keeping you in the driver’s seat. Think of humans as a control center operator—a concept echoed by Milo Medin from Benchmark Capital on a 20VC podcast—where AI handles the heavy lifting, but the human guides the direction.

In practical terms, this means creating a seamless user experience where AI does the coding autonomously but reaches out whenever a decision needs to be made. We want to avoid constant context-switching, where you must check in with the AI every few minutes or wait hours for it to return with updates. Rather, we want to enable a constant experience like with any other digital tool where you work with Pythagora for hours or days but get weeks or months of work done.

Looking Forward: The Future of Pythagora

We are very much focused on improving the agentic system, open sourced GPT Pilot, so it can solve more problems and build larger apps faster. However, we are even more focused on how to engage the human, the developer behind the keyboard whose goal is to make their ideas a reality. We are rigorously focused on improving the Developer Experience and making Pythagora a new extension of human creativity.

With that, we will be more focused on building the knowledge base and learning materials that will help everyone become a better builder of software applications. You can watch a 15 minute crash course on Pythagora here:

We are very passionate about building useful software tools to make our and other people’s lives easier, and we want others to be able to do that as well, without having to spend 15 years learning the ins and outs of how to write code.

I almost never had to spend effort on memory allocation while it was a requirement for every developer just a decade before I became an engineer. I hope that happens to writing and managing the countless lines of code in evergrowing codebases.

Conclusion

Launching Pythagora v1 is both a celebration and a stepping stone. We’re proud of how far we’ve come, but we’re even more excited about where we’re headed. The landscape of AI in code generation is evolving rapidly, and while there are challenges, these challenges make our team get up in the morning and march forward to solve them.

We’re committed to bridging the gap between AI capabilities and developer needs, focusing on creating tools that enhance your workflow rather than dictate it. As we continue to develop Pythagora, your feedback is invaluable. Let’s shape the future of coding together. You can sign up for early access to Pythagora v1 here.

Thanks for taking the time to read this, and I wish you happy coding…I mean, building 🙂

PS If you know of any other tools pushing the boundaries in AI code generation, share them! The more we collaborate, the better tools we’ll all have.

What I learned in 6 months of working on a CodeGen dev tool GPT Pilot

zvone187 — Thu, 29 Feb 2024 16:39:45 +0000

For the past 6 months, I’ve been working on GPT Pilot (https://github.com/Pythagora-io/gpt-pilot) to understand how much we can really automate coding with AI, so I wanted to share our learnings so far and how far it’s able to go.

When I started building GPT Pilot, I wrote this blog post on how it is envisioned. Now, I want to share my revised thinking and go into what worked and what didn’t work.

Finally, you’ll see examples of apps that were created with GPT Pilot and how you can create an app with a real AI pair programmer.

What is the idea behind GPT Pilot?

GPT Pilot is envisioned as a real AI developer – not an autocomplete or a chat bot. Rather, it is a developer who creates a plan for how your app or feature should be built and starts coding. It wants to do most of the coding by itself, but when it gets stuck, it needs clarification about the given requirements, or requires a code review, it asks you for help.

Is AI like a junior developer? Or…

I often see CodeGen GPT-4-based tools that say they are building an AI junior developer. Somehow, I’ve always had a problem with that because when I use ChatGPT for coding, it gives me answers and ideas that only a super-senior person could give – something that absolutely no junior dev would even be able to grasp. Still, no LLM can build an app nearly as well as a senior developer can, but the knowledge GPT-4 has about coding is way beyond any junior developer. I would say that GPT-4 has so much knowledge about every part of software development like it’s the most senior developer in the world but with the memory of a goldfish. I picture it as a superhuman robot that just stands in the middle of a room and can only do a single small action at a time, but it cannot combine many actions and work repetitively. You must tell it exactly what it should do next. This is what we’re after with GPT Pilot – we want to create a framework of thinking for the LLM that gets that superhuman robot to continuously work by revising its previous actions, have a feedback loop, and determine what should it do next in order to finish the end goal, which is to build a production-ready application.

In the blog post I mentioned above, I outlined the main pillars on which GPT Pilot was built. But these have changed a bit based on our team’s learnings, so here are the revised pillars:

A human is needed to supervise the AI not only because AI is not good enough but also because you might want to change how something works or looks after it’s implemented. It’s common for a developer or product manager, once they see what an implementation looks like, to decide to change it. Or, you realize there are more edge cases than you initially anticipated and think it’s easier to refactor your current implementation than to fix every issue. The problem is when you finish the entire app and then try to refactor – this is when it becomes much harder because every change will impact all the other features. On the other hand, if you do the refactor before you commit your changes, you’ll be able to proceed with the next features on top of well-written code. This is why it’s crucial for an AI developer to have a human in the loop whenever a task is implemented. This way, the human can review the implementation of each task (just like a code review before merging a PR) before GPT Pilot continues onto the next task. If a human tells GPT Pilot what is wrong, it will be much easier to fix the issues within the task itself. At the same time, the LLM has the context of what needs to be done in the task and what has been done so far.
AI can iterate over its own mistakes. I have a feeling that many people judge ChatGPT’s ability to write code by how well it delivers the first time you ask it to code something. If it doesn’t produce working code, many will think it’s not impressive. In reality, humans almost never write working code on the first try. Instead, you write code, run it, see the errors, and iterate. This is exactly what GPT Pilot enables GPT-4 to do – after it writes code, GPT Pilot can run the code, take the output, and ask the LLM if the output is correct, if something should be fixed, and if so, how.
Software development can be orchestrated. There are many repetitive routines that all developers go through when building an app. One of the routines can be – write code, run it, read the errors, change code, rerun it, etc. Another higher-level one can be – take a task, implement it, test the implementation (repeat until all tests pass), send it for review, fix the issues (repeat until the reviewer approves), and deploy. Many of these routines can be orchestrated if we have an intelligent decision-maker in the loop (like an LLM).
The coding process is not a straight line. When we created the first version of GPT Pilot, we thought it would need to iterate over tasks, implement code, fix it, and move on. In reality, you don’t continuously progress when coding an app – you rewrite your code all the time. Sometimes, you refactor the codebase because, after the initial implementation, you realize there is a better way to implement something. Other times you do it because of a change in requirements. Like I mentioned in #1, after you see that a solution isn’t working, you sometimes need to roll back a bunch of changes, think about an alternative solution to the problem, and try solving it that way. To make GPT Pilot, or any other AI developer, work at scale, it needs to have a mechanism that will enable it to go back, choose an alternative path, and reimplement a task.

What did we learn?

LLMs, in general, are a new technology that everyone is trying to understand – how it works, what can be done better, how to do proper prompt engineering, etc. Our approach is to focus on building the application layer instead of working on getting LLMs to output better results. The reasoning is that LLMs will get better, and if we spend weeks optimizing a prompt, it might be completely solved with the new version of GPT. Instead, we’re focusing on what the user experience should look like and which mechanisms are needed to control the LLM to enable it to continuously work, getting closer and closer to the final solution. So, here are our learnings so far:

The initial description of the app is much more important than we thought. Our original thinking was that, with the human’s input, GPT Pilot would be able to navigate in the right direction and get closer and closer to the working solution, even if the initial description was vague. However, GPT Pilot’s thinking branches out throughout prompts, beginning with the initial description. And with that, if something is misleading in the initial prompt, all the other info that GPT Pilot has will lead in the wrong direction. So, when you correct it down the line, it will be so deep into this incorrect way that it will be almost impossible to get it onto the right path. Now, as I’m writing this, it seems so obvious, but that is something we needed to learn – to focus much more on the initial description. So, we built a new agent called “Spec Writer,” which works with you to break down the project requirements before it starts coding.
Coding is not a straight line. As I mentioned above in the pillars section, refactoring happens all the time, and GPT Pilot must do so as well. We haven’t implemented a solution for this yet, but it will likely work by adding the ability for GPT Pilot to create markers around its decision tree so that whenever something isn’t working, it can review markers and think about where it could have made a wrong turn.
Agents can review themselves. My thinking was that if an agent reviews what the other agent did, it would be redundant because it’s the same LLM reprocessing the same information. But it turns out that when an agent reviews the work of another agent, it works amazingly well. We have 2 different “Reviewer” agents that review how the code was implemented. One does it on a high level, such as how the entire task was implemented, and another one reviews each change before they are made to a file (like doing a git add -p).
LLMs work best when they can focus on one problem compared to multiple problems in a single prompt. For example, if you tell GPT Pilot to make 2 different changes in a single description, it will have difficulty focusing on both. So, we split each human input into multiple pieces in case the input contains several different requests.
Verbose logs help. This is very obvious now, but initially, we didn’t tell GPT-4 to add any logs around the code. Now, it creates code with verbose logging, so that when you run the app and encounter an error, GPT-4 will have a much easier time debugging when it sees which logs have been written and where those logs are in the code.
Splitting the codebase into smaller files helps a lot. This is also an obvious conclusion, but we had to learn it. It’s much easier for GPT-4 to implement features and fix bugs if the code is split into many files instead of a few large ones. Just like we, humans, think – it’s much easier to process smaller chunks of code rather than big ones. We do that by creating levels of abstraction – functions, classes, etc. One of the easiest ways to get the LLM to create a more manageable abstraction is just to tell it to create more modular code and split it into more files. It works amazingly well, and the end result is also better for us, developers.
For a human to be able to fix the code, they need to be clearly shown what was written in the task and the idea behind it. GPT Pilot’s goal is to do 90% of all coding tasks and leave the other 10% to a human. This 10% usually comprises fixes or small changes that the LLM struggles to notice, but for the human, it might be a simple change. However, the problem is that it’s not easy to tell the human what is not working and what code they should look at. If GPT Pilot writes 3,000 lines of code, the human developer, if they want to help GPT Pilot, needs to understand the entire codebase before diving into the actual code. In future versions of GPT Pilot, the human developer will have a detailed explanation of what code has been added to the current task and the reasoning behind it. This way, you will be able to help GPT Pilot.
Humans are lazy. LLMs are better off asking humans questions rather than letting humans think about all the possible errors. Again, it’s very obvious looking back, but one of the things we noticed was that people were much more willing to answer questions that GPT Pilot asked them instead of having an open-ended input field where they could write unconstrained feedback.
It’s hard to get an LLM to think outside the box. This was one of the biggest learnings for me. I thought you could prompt GPT-4 by giving it a couple of solutions it had already used to fix an issue and tell it to think of another solution. However, this is not as remotely easy as it sounds. What we ended up doing was asking the LLM to list all the possible solutions it could think of and save them into memory. When we needed to try something else, we pulled the alternative solutions and told it to try a different, but specific, solution.

Apps created with GPT Pilot

Currently, you can create simple but non-trivial apps with GPT Pilot. We’ve also seen people create apps with GPT Pilot that are very impressive, such as an app that can fine tune a ResNet model to count palm trees and then, when you upload an image, count the trees in it. Here are a couple of apps we created, along with the code, stats, and demos:

Prompt Lab (DEMO)

Think of this as OpenAI Playground on steroids – it enables you to load a conversation from a JSON array or enter it manually, run the conversation with the LLM X number of times, save the conversation to the database, and load it back in. We actually use this app when we engineer a prompt and want to see how it behaves. The Playground is not good enough because it’s time consuming to repetitively rerun a single prompt. With Prompt Lab, you can choose how many times to run it and let the app run the prompt repeatedly. Once it’s finished, you can analyze the results. This might be useful for people who are building an AI app and need to optimize their prompts.

⏳ Time spent: ~2 days
💾 Github repo

SQLite database analyzer tool (DEMO)

This is also an app we use internally to analyze a local SQLite database. It pulls the data from the database in a format that’s very specific to the GPT Pilot database structure, but it can easily be modified to fit other structures. It reads and uploads your SQLite database, splits the rows by a specific field, unpacks the rows into values, loads the LLM conversation data into a form, and enables you to simply change one of the messages and submit the LLM request to GPT-4 to see how the results will look. This way, you can analyze the conversations GPT Pilot’s agents have with the LLM and easily explore what would happen if the prompts were different.

⏳ Time spent: ~2 days
💾 Github repo

Code Whisperer (DEMO)

Code Whisper is a fun project we created as an example to showcase. The idea is that you can use it to ask the LLM questions about your codebase. You paste in the link to a public Github repo. Then, it clones the repository, sends the relevant files to the LLM for analysis, which creates a description for each file about what the code does, and saves those descriptions into the database. After that, you can ask the app questions about the codebase, and the codebase shows you the response. In this demo, we use GPT-3.5.

⏳ Time spent: 7 hours
💾 Github repo

Star History (DEMO)

I’ve been releasing open-source projects for years now, and I’ve always wanted to see how fast my Github repo is growing compared to other successful repositories on https://star-history.com/. The problem is that on Star History, I’m unable to zoom into the graph, so a new repo that has 1,000 stars cannot be compared with a big repo that has 50,000 because you can’t see how the bigger repo does in its beginning. So, I asked GPT Pilot to build this functionality. It scrapes Github repos for stargazers, saves them into the database, plots them on a graph, and enables the graph to be zoomed in and out.

⏳ Time spent: 6 hours
💾 Github repo

Conclusion

I hope you gained some insight into the current state, problems, and findings that we deal with at GPT Pilot.

So, to recap:

We think that a real AI developer tool should be based on the following pillars. Human is needed to supervise the AI, we should enable the AI to iterate over its own mistakes, software development can be orchestrated, and we should aim to implement the orchestration layer on top of LLMs, and the AI developer should be able to refactor the codebase because, in reality, the coding process is not a straight line.

We think that a real AI developer tool should be based on the following pillars:

A human is needed to supervise the AI
We should enable the AI to iterate over its own mistakes
Software development can be orchestrated, which should be implemented on a layer on top of LLMs
**The AI developer should be able to refactor the codebase **because, in reality, the coding process is not a straight line

So far, we’ve learned that:

The initial app description is much more important than we thought
Coding is not a straight line, agents can review themselves
LLMs work best when they focus on one problem compared to multiple problems in a single prompt
Verbose logs do miracles
Splitting the codebase into smaller files helps a lot
For a human to be able to fix the code
They must be clearly shown what has been written and the idea behind it
Humans are lazy
It’s hard to get the LLM to think outside the box

What do you think? Have you noticed any of these behaviors in your interactions with LLMs, or do you have a different opinion on any of these?

I would be super happy to hear from you, so leave a comment below or shoot me an email at zvonimir@gpt-pilot.ai.

You can try creating an app with AI with GPT Pilot here:

If you liked this post, it would mean a lot if you star the GPT Pilot Github repo, and if you try it out, let us know what you think.

🤜💥🤛 GPT-4 vs Claude-2 context recall analysis

zvone187 — Tue, 05 Dec 2023 15:10:48 +0000

In today’s rapidly advancing AI world, one of the limiting factors of modern Large Language Models (LLMs) is the context size. But it would also be interesting to know how well the LLMs can use the context they have – their context recall, or the reliability with which the LLM can access information in its context.

To set the stage, the context is the data fed to the LLM for it to produce output, basically representing the LLM’s “working memory.” While there are techniques to work around the current size limitation – most notably Retrieval-Augumented Generation (RAG) – ultimately, all the relevant information about the task-at-hand must fit into the context.
Context sizes are improving, with the recent update of the GPT-4 model (gpt-4-1106-preview) bumping the context size to 128 thousand tokens and Claude 2 upgrading its context to 200 thousand tokens.

I’m working on an AI dev tool GPT Pilot that uses LLMs a lot. So, I was interested in context recall - however, it becomes more apparent at larger context sizes. In other words, how well can the LLM find the information it needs that is in the context? Less than ideal, as it turns out.

The context contest

I was interested in exactly how well this context recall works for different LLMs, specifically for GPT-3.5, GPT-4 and Claude. I constructed a context of the desired size with a piece of data buried inside it, asked the LLM to find it, and measured how often it succeeds.

This research follows the “haystack test” Greg Kamradt published when the update GPT-4 came out (twitter, code). That test provided useful insight into (the lack of) context recall performance. But it was performed on a very small sample test (limiting its statistical significance) and was initially limited to GPT-4 (he has since published an updated version that also uses Claude 2.1). Moreover, the test data consists of essays that were likely already used pretraining LLMs, and the results were evaluated by GPT-4, potentially introducing confounding variables into the mix.

To dive deeper, I wanted to measure pure context recall on random data never before seen by an LLM and measure it directly (as the probability of success). I also wanted to run the test in more iterations to achieve more statistically significant results. The results were surprising!

Methodology

In the test, I constructed an artificial data set – a randomly generated CSV file with two columns, “key” and ”value,” and as many rows that would fit into the context (minus some padding for prompt, query, and response so that total number of tokens is under the limit).

This was constructed for 8, 16, 32, 64, 96, 128, and 192 thousand tokens. The set was split into 5 equal parts (quintiles) of 20% size of the total CSV length:

Quintile 0: Near the start of the context
Quintile 1: In the first half of the context
Quintile 2: Around the middle of the context
Quintile 3: In the second half of the context
Quintile 4: Near the end of the context

I randomly chose a key from the target quintile and asked the LLM to find the corresponding value (from the entire set). This was repeated 30 times and then calculated the resulting score for that context size and quintile, as a percentage of correct responses (ie the correct value was found).

Results

As of writing this article, GPT (and especially GPT-4) is the undisputed champion of LLMs in terms of reasoning power. Let’s see how well it performs in terms of recall.

(Note for readers in a hurry: I put a handy chart comparing all the results at the bottom of this article).

GPT-3.5

GPT-3.5 performed poorly on the tests. While I didn't test on the 4k that it was originally built with, results for 8k context size didn’t perform very well, and using all 16k produced outright atrocious results.

GPT-4

GPT-4 was flawless on 8k context and preformed really well with 16k context. It was somewhat worse with 32k and 64k (roughly on par with gpt-3.5 on 8k), and rather poorly on 96k and 128k contexts.

Claude

Results for Claude were surprising. While it's understandable that Claude has at least somewhat different approach to solving the context problem, the graphs do tell rather different story than those of GPT series.

Claude 2 performed flawlessly on 8k, really well on 16k, 32k, 64k, and 96k contexts (on par with GPT-4 16k), and not too shabby on 192k! It was much slower than GPTs on large contexts, though. Unfortunately, I didn’t focus time the requests, but on large contexts, Claude seems several times slower than GPT-4 – like it was doing RAG or something else behind the scenes.

Claude Instant 1.2

As expected, Claude Instant did somewhat worse than both Claude 2 and GPT-4, but it was markedly better than GPT-3.5.

Synthetic data vs the real world

How should we interpret these results? For example, what exactly does 73% recall performance mean for us when using these models in the real world?

It’s important to remember these tests measure the absolute ability of LLM to (ideally) perfectly remember every little detail from a big data set. While it’s useful for us to be able to evaluate performance, in many uses, it’s not as big of an issue for a few reasons:

Real-world data is usually duplicated in one way or another (in other words, compressible), meaning it’s probably easier for an LLM to remember real-world data than purely random strings with very high entropy.
In the real world, if we want to look up the data as-is, we’d use a database, not a LLM. The context is a guide to the LLM on what to do and how to do it, not a trivia quiz.

In other words, these results show the hard performance limits on the context recall and are a useful guide when thinking about context sizes we want to employ in our use cases. But the real-world situation is both messier and more forgiving.

Anecdotally, in some of the use cases I looked at, the models gave okay results at sizes where they scored 75% or more on this test.

I also didn’t shoot for hard, statistically sound measurements (3-sigma confidence) because that would be measuring at higher precision (and at a much higher cost) than what’s really useful.

Future work

As noted, I intentionally used synthetic random data to measure the recall. Using real-word data would probably give a somewhat different results and would be an interesting followup study.

I also kept the conversation chain short: all the data was in one (system) message, and the user query was in the second message. It would be interesting to see if having the context split across multiple smaller messages impacts the performance in any way.

Finally, I also haven’t tested any open source LLMs. Most are limited to 4k context, so it wouldn’t be a fair comparison. However, it would be interesting to see a comparison of the leading open source LLMs regarding context recall performance.

Conclusions

With the above caveats out of the way, who won the context contest?

Based on the context limit alone, Claude 2 is the winner, followed by GPT-4. When using small context sizes (relative to what the models suggest), both models perform really well.

Unless you’re dealing with small data that can comfortably fit inside a 4k context size, my recommendation is that you avoid GPT-3.5 and Claude Instant 1.2.

That’s it for this post - I hope you find this insightful. If you have a different experience with any of these LLMs, let me know what you found out.

🌟🌟🌟

Also, it would mean A LOT if you check out GPT Pilot. We’re working on a dev tool that tries to offload 90+% of coding tasks from the developer to the LLM. It’s completely open source so if you star the Github repo, it would mean a lot to us. Thank you 🙏

🌟🌟🌟

GPT Pilot - a dev tool that writes 95% of coding tasks [Part 2/3 - Coding Workflow]

zvone187 — Tue, 03 Oct 2023 15:19:22 +0000

This is the second blog post in a 3-part series where I explain we created GPT Pilot – the AI coding agent that’s designed to work at scale and build production-ready apps with a developer’s help. In part #1 of this series, I discussed the high-level overview of GPT Pilot. The idea is that AI can now do 95% of all coding that we, developers, are doing. See how I used ChatGPT to code out an entire Redis proxy in 2 hours, which would usually take 20-30 developer hours. However, an app is of no use if it doesn’t fully work or solve the user’s problem. So, until real AGI arrives, you need a developer.

So, this is how GPT Pilot came to life. It is designed to do 95% of the required coding and asks developers for reviews, such as when it becomes stuck and cannot move forward or needs something outside the app like an API key.

In this post, I walk you through the entire process GPT Pilot goes through when coding an app. I share diagrams to provide a visual representation of everything that’s going on behind the scenes in GPT Pilot. I’m a visual person, so I always create diagrams. To understand how GPT Pilot’s coding works, there are 3 concepts – context rewinding, recursive conversations, and TDD. See my introduction where I described in them in part #1 of this series.

The GPT Pilot coding workflow contains 8 steps:

Take the next development task in line
Break down the task into development steps
Take the next development step
Fetching of currently implemented code
Write code for the current step
Run the code or a command
Test the new code changes
Debug the development step or go to the next step

Coding workflow is my favorite part of GPT Pilot so let's dive in. Here is a diagram of how it looks like visually:

#1 Task breakdown

Two important concepts will be mentioned throughout this blog post – development tasks and development steps.

GPT Pilot works in a way that, after breaking down the specifications for developing an app, it creates development tasks that will lead to a fully working app. Development tasks are basically high-level descriptions of what needs to be done that a developer will take and start implementing. Think of them as tasks in Jira (btw, I hate Jira…not sure if anyone relates, but I just wanted to let it out of my system). Here is an example of a development task:

In the diagram above, you see 3 task properties:

description: what needs to be implemented to fulfill this task
user_review_goal: how can the lead developer determine if the current task is finished – a crucial pillar of GPT Pilot is a developer must be involved throughout the coding process so that you can ensure the development process is going as planned and understand the codebase along the way
programmatic_goal: the kind of automated test GPT Pilot should write to test if this entire development task works as expected. After a development step, GPT Pilot writes unit tests, and after a development task, it writes integration or E2E tests.

Now, when you start developing a task from Jira (development task), you will split it into smaller chunks (we call them development steps) that are actionable items you would set out to implement into the codebase. Each development step can be one of the following:

Command run – a command that needs to be run on the machine, such as a command to install a dependency, start the app to check if the previously implemented steps work, or create a folder.
Code change – the most important development step that explains what exactly needs to be implemented in the actual code to fulfill the current step. It can contain new code that needs to be written or code that needs to be changed. The way it works is the code change is a detailed, human-readable description of what needs to be implemented. It contains both the code that needs to be implemented and the description of what it is being used for. This is very similar to when you ask ChatGPT to code something. It will give you the code as well as the explanations of why it wrote that code.
- The reason for this is that code implementation is not so simple. Sometimes, we need to add a snippet into existing code or change the existing implementation. That is why we separated the outline of the coding change (this development task) and the actual implementation of this change that the CodeMonkey agent is dedicated to. I will go deeper into that in the #3 Coding section. Here is an example of a code change:
Human intervention – a development step that AI cannot do by itself and needs human help in fulfilling the step. Then, GPT Pilot asks the developer to do something, and when he/she is done, they write “continue,” and GPT Pilot will continue with the implementation. Here are some reasons why human intervention might be needed:

There is a needed API key (e.g., Twitter API key to fetch data from Twitter)
GPT Pilot became stuck in a debugging process, and it either filled the entire context length or the recursion conversation was too deep and became unproductive to continue down the recursion depth.
GPT Pilot needs a verification if something works as expected – e.g., GPT Pilot is not sure if Mongo is installed properly on the machine and might ask the developer to run some sudo commands and see if it works as expected.

#2 Fetching of currently implemented code

It’s easy for AI to write a new file that contains code, but in reality, that is rarely the case. For the most part, we write into the existing files and either change the existing code or add new code. Now, AI can do this easily if you give it all of the existing code and instructions for what needs to be implemented. The problem arises when an app scales and the codebase becomes so large that it cannot fit into the LLM context. And this is actually a very common case – at least until we have LLMs with 1M tokens, which doesn’t seem to be coming soon.

When you work on a task in a big codebase, you usually look at a smaller part of the codebase (maybe 1,000 lines) and work only with that subset of code to implement the task.

So, to address this issue and make GPT Pilot truly scalable so that it can create and upgrade large production-ready codebases, we must create a way for the AI to select the smaller part of the codebase (e.g., those 1,000 lines) on which it will implement the current task. Once it’s finished, we can simply add the finished lines back into the original codebase. Let me start explaining this by telling you what happens when GPT Pilot writes code and creates new files and folders. For each file and folder it must create, it needs to write a description of what the idea is behind the file or folder it wants to create. For example, it might want to create a folder utils for which it will write:

Contains utility modules that provide generic, reusable solutions to common problems encountered throughout the application.
These utilities are not specific to the app's core domain but offer auxiliary functionality to support and streamline the primary codebase.
They encapsulate best practices, reduce code repetition, and make the overall code cleaner and easier to maintain. Examples include functions for data formatting, error handling, debugging tools, string manipulation, data validation, and other shared operations that don't fit within specific modules or components of the app.

Now, for each function GPT Pilot creates, it writes a description of what the function is supposed to do – that is a pseudocode for the entire codebase.

Now that you know what happens when GPT Pilot writes code, you can understand how it fetches the relevant code for each development step.

Before GPT Pilot codes each step, it first fetches the relevant part of the codebase in a completely separate LLM conversation. That conversation is set up in 3 steps.

AI is given the development step description along with the entire project file/folder structure and descriptions for each file and folder. From this, LLM tells us which files are relevant for the mentioned step.
After narrowing down the necessary files, we give the LLM pseudocode for each file it listed and ask it to tell us which functions are relevant for the current development step.
Once we know the pseudocode it selected, we can fetch the actual code and put it into the original conversation, where the LLM will write the description of what needs to be implemented.

If the app becomes extremely huge, we can improve this by first giving the LLM the folders, from which it will select folders, and then we give it relevant files. Before each of these steps, we can also rewind the conversation to the beginning to leave more room in the context.

Here is a diagram of what this looks like:

#3 Coding

Now that we can create an LLM message that contains all code necessary for someone to implement a specific task, we can start with the actual coding process. This happens in a 2-part process:

First, the LLM writes the description of what needs to be implemented along with the code. If the entire file needs to be coded, LLM’s response will contain all the code, but if only a part of the code inside a file needs to be changed, LLM will tell us things like After the Mongo setup, add the following lines of code... As you can imagine, by this being stochastic rather than deterministic, we need to ensure that the written code is inserted into appropriate places or changed correctly.

Here is where the CodeMonkey agent steps in. It is called code monkey because it doesn’t make any decisions but rather simply implements the code that the Developer agent writes. It is given the code relevant for the current task (that is previously selected by LLM in the code-fetching phase) and the description that the Developer agent created in development step #1. Then, the only thing it needs to return are the completely coded sections/files that we can just insert/replace in the codebase.

#4 Testing

There are 2 places where testing is done – (1) after each development task when GPT Pilot creates integration tests that test if the high-level features work as intended and (2) after each development step when it creates smaller unit tests that ensure all functions work as expected.

GPT Pilot has 3 different types of test it can do:

Automated tests are the preferred way of testing a step or task because they will be used in a regression test suite so that GPT Pilot can be sure that new code changes don’t break old features. However, automated tests are not always the most optimal way to test new code.
Command run is a test where we run a specific command and give the output to the LLM, which then tells us if the implementation was successful. For example, we don’t need to create an automated test that will check if we can run an app with npm run start – for that, a simple command run is enough to check if we successfully set up our environment.
Human intervention is the final way to test the app, and it is needed whenever AI cannot test the implementation itself. This is needed, for example, when there are some visual aspects (e.g., CSS animations) that must be checked to see if they work correctly.

After running each test, if successful, GPT Pilot takes on the next task or step and continues with coding, but when the test fails, GPT Pilot needs to debug the error.

#5 Debugging

The debugging process needs to be so robust so that it can be started on any bug that arises, regardless of the error. It also needs to be able to debug any issue that happens during the debugging process. This is where recursive conversations come in, which are conversations with the LLM that are set up in a way that they can be used “recursively.”

Let’s look at the example in the image below. It represents a flow that GPT Pilot goes through when working on a development task that has 5 development steps. In this example, during the development of step #3, an error occurs – let’s say it implements a specific code change but after running a test, it fails. Then, it goes into the recursion level #1 to debug this issue. It breaks down what needs to be done to fix this issue into 2 steps, but during the implementation of the first step, another error happens. For example, a needed dependency for fixing the error #1 doesn’t exist. GPT Pilot then goes into the recursion level #2, which it breaks down into 3 steps. In the third step, another error occurs. Then, it goes to the third recursion level, which has only 1 step. Once that step is successfully executed, GPT Pilot goes back to the recursion level #2 and finishes debugging error #2. After that, it goes back to debugging error #1, and finally, after error #1 is fixed, it goes back to the development step #3 after which it continues the app implementation.

When the recursions go 5 levels deep, GPT Pilot will stop the debugging process and ask the developer to fix the initial issue it started with. Once the developer resolves this issue, they write the results to GPT Pilot. Then, it can continue the development process as if it debugged the issue itself.

Conclusion

In the first post of this series, I discussed the high-level overview of how GPT Pilot works. In this post, I described the GPT Pilot Coding Workflow, including:

How Developer and CodeMonkey agents work together to implement code (write new files or update existing ones),
How recursive conversations and context rewinding work in practice, and
Rewinding the app development process and restoring it from any development step.

In the final post, I will dive deep into how all the agents are structured. We built the agents modularly because we know they will evolve over time. Please head over to GitHub, clone the GPT Pilot repository, experiment with it, and send me your feedback. I want GPT Pilot to be as helpful to developers as possible, so let me know what you think, how it can be improved, or what works well. Add comments at the bottom this post or email me at zvonimir@gpt-pilot.ai.

🌟🌟🌟
Finally, we're trying to raise funds to continue developing GPT Pilot, so it would mean A LOT if you could star GPT Pilot Github repository and/or share it with your friends. Thank you 🙏
🌟🌟🌟

Forem: zvone187

Scheduled agent runs are now more reliable

Tier 1: resolve at creation

Tier 2: re-check on every fire

Tier 3: enqueue to main chat

The pazi-set-cron skill

Caveats

On the roadmap

Why Your Recurring Agent Task Dies Right Before Delivery

The fix

Step-by-step

1. Instrument bootstrap as a milestone

2. Size timeouts as bootstrap + work + buffer

3. Post output before cleanup

4. Make delivery idempotent separately from work

How to verify

Why this happens

AI agent best practices: 7 rules from running them at Pazi

What an AI agent actually is

Why most AI agent rollouts fail before they're trained

Rule 1: Onboard one agent at a time

Rule 2: Treat the agent like a new hire, not a tool

Rule 3: Point at sources instead of re-explaining in prompts

Rule 4: Build specialist agents, not a mega-prompt generalist

Rule 5: Treat rough early output as the work

Rule 6: Start on repetitive operational work, not strategy

Rule 7: The handoff test, running cleanly without you

How AI agent onboarding compounds: from one agent to a working team

Why your next competitor is a small team with AI agents

Two centuries of company size

Operations were never the moat. Scarcity was.

What a small team with AI agents actually looks like

The honest counter-argument

What this means if you're operating now

Closing

5 silent failure modes in production AI agents (and how we instrument for them)

1. Crons that "succeed" but never deliver

2. Tool calls that 4xx silently

3. Channels that suppress inbound messages without telling anyone

4. Reasoning leakage in Slack threads

5. Bootstrap latency eating the timeout

What this adds up to

I built an autonomous dev team with 3 AI agents that takes a Linear ticket all the way to a pull request

How the workflow starts

Parallel environments with worktrees

QA creates the testing plan BEFORE any code is written

Why multiple agents outperform a single one

Implementation and PR

QA tests in the browser

Human review

The Tech Lead as coordinator

Why Slack

Observability

Interaction

What I learned

Try it yourself

How to secure AI-coded (vibe coded) applications

Overview

Web app vulnerabilities

AI writes A TON of code

Hallucinations

How to use vibe coded apps for business

You want to offload coding to AI, but keep security in human hands

✅ 1. Isolate the codebase from non-authenticated users

🟡 2. Transparent data access

🟡 3. Codebase vulnerabilities

Approval process - code review is outdated in the world of AI

Conclusion

How We Made Sure Big Companies Can’t Steal Our Code

The Challenge of Open-Source Licensing

Why We Chose Sentry & Functional Source License (FSL)

How FSL Works for Startups

What This Means for Developers

Pythagora_v1 release reflection – what we learned, current state, next steps

History of Pythagora

Learnings from Beta to v1

Current state of Pythagora

Managing expectations

Looking Forward: The Future of Pythagora

Conclusion

The `pazi-set-cron` skill