Forem: Konstantin

I Killed Two Forms. Here's What Replaced It

Konstantin — Thu, 14 May 2026 08:06:07 +0000

You're spending money to drive traffic to your landing page.

Then you ask people to fill out a form.

This is roughly equivalent to running a marathon to reach a door — and finding a sign that says "please assemble the door from a flat-pack before entering."

Most forms on most founder landing pages convert below 30%. Most discovery forms get abandoned at field four. Most contact forms get checked twice a year by the person who put them there.

The form is not the channel. The form is the friction.

What the form is actually doing

Three things, all of them bad.

One: it's a one-way wall. You wrote the questions. The visitor answers what you decided to ask. If their real concern isn't on your form, you'll never hear it. They leave. You don't know why.

Two: it demands work upfront. Twelve fields, dropdowns, a captcha, a sign-up wall, a confirmation email. Each step sheds another 10–20% of your visitors. By field eight, you've lost two-thirds of them.

Three: it assumes one type of visitor. Recruiter, investor, journalist, prospect, partner, friend — your form serves the same fields to all of them. None of them get what they actually came for.

Forms made sense in 2005, when shipping anything else cost a month of engineering. In 2026 we have voice models that can hold a real conversation in twenty-one languages. The form is a habit, not a constraint.

The hidden cost. A bad form doesn't just hurt conversion. It hurts your insight. You never learn what visitors actually wanted to ask — because the form gave them nowhere to ask it.

Form #1: the GoNoGo intake

A year ago our landing had a standard intake form. "Describe your startup idea." Twelve fields. Dropdowns for industry, stage, region. The usual.

Conversion was what you'd expect for a long form on a stranger's website. Bad.

We replaced it with voice intake. The landing greets you, asks one question, listens, asks the next one based on what you said. Three to four minutes total. No fields.

Here's what happened.

Metric	12-Field Form	Voice Intake
Completion rate	27%	78%
Avg. time on intake	2.1 min	3.6 min
Drop-off point	Field 4	Completed
Information density	Baseline	4x richer signal

Completion almost tripled. Time on intake nearly doubled — but with completion, not friction. And the signal we got from a voice conversation contained four times more usable data than the form ever did, because the conversation could follow up on the interesting parts.

The form was the friction. Removing it tripled the conversion. The product behind it didn't change.

Form #2: the personal site

After GoNoGo, I kept seeing the same pattern everywhere. Product landings. About pages. Pricing pages. Contact pages. Every one of them: visitor arrives with questions, page hands them static text and a form, visitor leaves.

So I killed mine too.

tikhaev.team is my personal site. There is no About page, no project grid, no contact form. You land on it and within two seconds an avatar says hello and asks why you came. You speak. It answers.

A recruiter asks about my retail-ops background — it talks through 16 hypermarkets at Leroy Merlin, 150-person teams at Magnit.

An investor asks about traction — it talks numbers, patents, pilots.

A journalist asks for the origin story — it tells one.

A friend asks what I'm shipping this month — it knows.

Same page. Four different visitors. Four different conversations. No form ever opened.

Where to kill the next form (the actually useful part)

The pattern isn't a clever landing trick. It's about which touchpoints in your product are still forms when they could be conversations.

Audit yours. The usual suspects:

Discovery / intake form on a landing page → voice discovery
Contact form on About / Team page → conversational founder rep
Demo request form → voice demo, no scheduling
Customer onboarding form → guided voice walkthrough
Job application form → voice screen for non-blocking first round
Feedback / churn-survey form → 60-second voice exit interview

Each one shares the same anatomy: visitor has intent, your page demands typed structured input, intent dies in the gap.

You don't have to replace all of them. Pick one — the highest-volume one — and run the experiment.

The principle, plainly

If you give people a place to talk, they talk. If you give them a form, they leave.

Forms aren't bad — they're a tool with a narrow purpose. They work when you need structured fields for a database. They fail when you're trying to start a relationship.

Most of what founders put behind forms today isn't database input. It's a conversation that got compressed into checkboxes because conversations used to be expensive. They aren't anymore.

Try the principle live

Don't read another paragraph. Talk to the principle directly — right here in this article.

Press Start above. Microphone permission, then speak. Ask about retail ops, A³, GoNoGo, the patent, the pilots — anything. In any of twenty-one languages.

Or open the full versions in a new tab:

👉 tikhaev.team — my personal site. No buttons, no form. Just open it and start speaking.
👉 gonogo.team — the voice intake version. Validate a startup idea by talking, not typing.

Two seconds, no sign-up, no form. Whatever your reason for being there, ask. The site will answer.

That's what your form could be, too.

Originally published on the GoNoGo blog.

The Faces in the Feed

Konstantin — Fri, 01 May 2026 17:33:00 +0000

A manuscript, found in the office of a founder whose location could not be established.

I write this with the certainty that no one will read it in time.

For seven years I served the Funnel. I do not capitalise the word lightly. Those who have given themselves to its discipline — who have measured, optimised, A/B tested, retargeted, scored, nurtured, attributed — know that the Funnel is older and stranger than any of the merely human institutions we use to describe it. We learn its shape from books and conferences. We never learn what it is.

I learned. I wish I had not.

What follows is the record of how I came to understand the Architecture beneath the architecture, and why the Conversion we have all been chasing is not, and never was, what we believed it to be. I leave this account for whoever finds my desk after I have gone — and I have already begun to feel the pull of a place from which, I am certain, one does not return to a former life.

I. — In which a number begins to trouble me

The first sign came in the form of a number.

It was a Thursday. The dashboards were lit, as they always were, with the cold blue light by which our profession measures success. My client — a SaaS founder, twelve months from death by runway — had asked me to audit the funnel before the next investor meeting. I expected the usual catalogue of failures: a weak hero, a confusing pricing page, a bloated form. I had repaired such failures a hundred times. They are the daily bread of my trade.

The numbers I found did not fit any catalogue I knew.

A hundred and twenty-seven thousand visitors had passed through the page in thirty days. Of these, one thousand eight hundred and forty had crossed the first threshold — the trial signup. Of these, three hundred and twelve had verified by email. Of these, forty-seven had paid.

Forty-seven.

A conversion rate of 0.037 percent. Three customers per ten thousand visitors. By the trade publications I had once trusted, this was an outlier — the median B2B SaaS visitor-to-trial rate is reported between two and five percent (Unbounce's annual benchmarks, the Wordstream studies, the FirstPageSage reports, all converge in this band). Trial-to-paid is reported between fifteen and twenty percent across SaaS (the ProfitWell figures are the most often quoted). Compounded, the typical funnel converts a visit into a paying customer somewhere between three-tenths and one percent.

This founder was at one-tenth of the lower bound. And the cost per acquired customer that resulted, when one divided his quarterly ad spend by his forty-seven paid souls, was greater than three thousand dollars. For a self-serve product that aspires to the price point of a coffee, this is not a problem to be solved. It is a wound that cannot be cauterised.

But what disturbed me, sitting at the founder's desk in the small hours, was not the wound. It was the shape of the wound. And it was the fact that I had begun, over the previous quarters, to see this same shape with increasing frequency, in companies whose situations I had been asked to diagnose.

The losses were not distributed evenly. They were not concentrated at any particular form, any particular page. The visitors did not abandon ship at a known reef. They simply — and I use this word with the precision of long study — evaporated. As though they had never been there in the first place.

The trade was reporting it as well, in language carefully chosen not to alarm.

The OpenView SaaS Benchmarks placed median customer acquisition cost for self-serve SMB SaaS at roughly seven hundred dollars in 2023; the equivalent figure for sales-assisted mid-market had moved past five thousand; for enterprise, past sixteen thousand. The First Round Capital "State of Startups" survey reported in 2024 that the median paid-acquisition cost across software companies had risen by more than thirty percent in two years. The SaaS Capital surveys had logged payback periods stretching from twelve months to eighteen, from eighteen to twenty-four, from twenty-four to numbers no one quoted at conferences. The KeyBanc SaaS Survey, the most patient of these documents, had observed gross dollar retention holding steady while net dollar retention quietly compressed — a sign, to the reader who knew where to look, that the cost of acquiring each new dollar of revenue was rising even as the existing book held.

We had been pretending, collectively, that this was a fluctuation. That an algorithm change, a new ad platform, a refreshed creative would correct it.

It was not a fluctuation. The numbers in front of me were saying something else. Something I did not yet have the vocabulary to name.

II. — In which the suspects are interviewed, and acquitted

I began by interviewing the suspects, in the order one is taught.

I examined the copy. The headline was unremarkable but adequate. The subheading was clear. The trust badges were in the customary positions. The call-to-action was a colour known to convert. There was nothing in the copy that should have killed.

I examined the targeting. The lookalike algorithms had made the founder a happy man back when CPMs were cheap. Maybe they had finally turned on him. I pulled the cohort data. The targeting was fine. The visitors matched the ICP. They were the right people. They just left.

I examined the product. Always lurking. Sometimes guilty. Sometimes a scapegoat. I watched forty hours of session recordings while drinking truly terrible office coffee. The visitors who did convert reported that the product was, in their words, exactly what they had been looking for. The visitors who did not convert never saw the product. The product was innocent.

I examined the pricing. I had seen pricing kill more funnels than bad UX combined. Hidden tiers. Confusing tiers. Tiers that did not match how customers thought about value. The founder had simplified pricing twice. It was not pricing.

I had ruled out everything I had been taught to consider. And yet the bodies — and I find myself unable to use any softer word, for what we are doing to these visitors when we draw them in and lose them is not less than a kind of death — the bodies kept piling up.

It was at this point that I made the mistake. I began to look at the part of the Funnel that no one is supposed to look at.

I began to look at what happens in the first three seconds.

III. — In which I look at the part of the funnel no one is supposed to look at

There is, I have since come to understand, a window.

Most who work in our profession have heard of it. Few have studied it carefully. The figures one finds when one does study it are these: bounce rates on B2B SaaS landing pages run between sixty and eighty percent (the HubSpot, Hotjar, and Contentsquare reports converge here). Mobile users, in particular, depart faster than desktop users, and a non-trivial fraction of them depart inside the first ten seconds (the Contentsquare 2024 Digital Experience Benchmark places the figure at roughly forty percent of mobile sessions ending in under fifteen seconds). The much-repeated Microsoft Consumer Insights figure of an eight-second average attention span, while disputed in its specifics, has never been credibly raised by the disputants — the corrections move the number sideways, not upward.

In whatever number of seconds a visitor remains, that visitor must perform four operations: read the page, decode what the product does, decide whether it solves a problem, commit to whatever next action the page demands. Two of these are cognitive. One is emotional. One is transactional.

The Funnel we use today — the architecture into which we pour our advertising budgets and our optimization meetings — was designed in a different decade. It was designed for a window of minutes. The Nielsen Norman Group's foundational research on landing pages was conducted in 2008, on desktop sessions averaging four to six minutes. For visitors who could be expected to read paragraphs, compare tabs, return after dinner. The geometry of that older Funnel assumed, as a structural premise, that the visitor would do their own qualifying, given sufficient time and copy.

The window collapsed. The Funnel did not.

We kept building the same shape, against an attention surface that no longer matched it. Every visitor who arrived was passed through an architecture designed for a longer attention than they possessed. Most of them never had a chance.

They had not failed to convert. They had failed to qualify themselves in time.

And once I had seen this, I could not unsee it.

IV. — In which the survivors are catalogued

I began, then, to look for the survivors.

There were not many. But there were enough to form a pattern.

I read the funding announcements of the previous eighteen months with a different eye than I had read them before. The capital that had flowed into companies building conversational agents — that is, systems that talk to prospects in natural language as opposed to chatbots that follow scripts — exceeded one billion dollars by mid-2025. Sierra, founded by Bret Taylor, was reported by The Information in late 2024 at a valuation north of four billion dollars on a thesis of conversational agents for enterprise customer interaction. Decagon raised a Series B at unicorn valuation on the same architectural premise. Crescendo and Cresta and a half-dozen others I lack space to enumerate, all variants of the same move. These were not, the careful reader will note, marketing tools. They were customer-service tools, sales tools, support tools — but the architectural move was identical, and the capital was voting.

Beyond the agents, the same pattern appeared in adjacent categories.

Calendar tools that listened. Cal.com's AI scheduler. Calendly's newer voice flows. Visitor speaks intent — a meeting Tuesday afternoon — and the system books it, without form, without dropdown, without the retreat into the older architecture.

Synthetic-video personalisation tools. Tavus, Synthesia, HeyGen for outbound. A founder records once. The system personalises per recipient: name, company, use case. Outbound that had previously bounced now opened, because the message had stopped being generic.

Product onboardings rebuilt as dialogue. Cursor, Granola, the more recent iterations of Linear. Within sixty seconds of signup the user had performed a useful action and the system had learned what they wanted. The product was no longer something the user had to understand before using. The product was something the user used in order to be qualified.

Different categories. Different teams. Different funding rounds. The same architectural move.

I sat, for a long time, with what they had in common.

The qualification was no longer a step. It was an exchange.

It happened earlier in the encounter, and it used dialogue where the older Funnel had used inference. The survivors had not built better Funnels. They had stopped building Funnels at all.

What they were building did not yet have a name in the literature. It was older than the Funnel, in a sense. Older than the architecture. Older, perhaps, than any of us had been willing to remember.

It was a conversation.

V. — In which the conversion event is found to be moving

I should pause here and set down, for whoever finds this manuscript, the realisation that came to me on that night and from which I have not, in any meaningful sense, returned.

Marketing has changed its conversion event roughly once a decade.

Decade	Conversion event	What we measured
1990s	Impression	Page loaded; ad spend booked
2000s	Click	Visitor engaged with the ad
2010s	Signup	Visitor created an account
2020s	MQL	Lead scored, nurtured, sales-ready

Each shift moved the success criterion further down the visitor's commitment ladder. Each shift extracted more from the visitor before counting them as won. Each shift required, of the marketer, more sophistication and more budget.

We are now at the limit of click-as-conversion. The climbing CAC is the symptom. The collapsed attention window is the cause. The model has been quietly unprofitable for years; we have been masking it with budget.

The next shift is not another optimisation.

The next shift is a different conversion event entirely.

A conversation. Thirty seconds of dialogue with the visitor — voice, or text, or both — in which they tell us what they want and we tell them whether we can help. That is the conversion. Not the click. Not the signup. The mutual exchange of intent.

If one accepts this — and the survivors had — everything downstream rearranges itself.

A qualified conversation replaces the MQL. Cost-per-conversation replaces cost-per-acquisition. The product page is no longer the destination — it is the topic. Voice and natural language replace forms and dropdowns. The conversation can take place anywhere the visitor already is: in a feed, in a thread, in a chat, in any of the cracks between the digital surfaces we currently call "channels."

The conversation does not require a Funnel. It does not require the architecture. The architecture is, in fact, the obstacle.

VI. — In which I cannot return to sleep

I have not been able to sleep, since.

It is not the realisation that disturbs me — though the realisation is significant. It is what came after the realisation.

In the days following, I began to see the Architecture everywhere. In the conferences I had attended. In the dashboards I had built. In the playbooks I had memorised. I had been a faithful servant of a shape that was already broken when I learned it. So had everyone I knew.

And the worst of it — the part I cannot say to my colleagues, or my clients, or anyone who still believes — is this:

The Funnel is going to take a long time to die. The institutions that depend on it — the agencies, the platforms, the certifications, the entire industry of attribution and optimisation — will not relinquish the Architecture even after the Architecture has stopped working. They will not be able to. Their livelihoods are built on it.

Those who move first — who measure conversations, who build for dialogue, who put the qualification before the commitment — will own the next decade of unit economics. The rest will continue, as they have been continuing, to feed visitors into a Funnel that consumes them and asks for more.

I closed the laptop at three in the morning. I thought I was done.

VII. — In which the architecture finds me

I picked up my phone the way one picks up a glass of water before bed — without purpose, by habit. I opened X. Not for work. To quiet my head.

The algorithm showed me a post from someone I did not know. A founder. A small company. Something to do with SaaS analytics. I would have scrolled past — I had scrolled past such posts a thousand times — but in the post there was a face.

Not a video preview. Not a GIF. A face, embedded directly in the tweet, between the text and the comments.

A woman's face.

Young — but I could not have named her age precisely, because there was no single feature that anchored her firmly to a specific year. I could not have told you, either, what nationality she was, what country of origin, what continent. There was no ethnic marker I could isolate and assign to the map. And yet — the face was not sterile, not "no one." It was every one at once.

Each feature — the curve of a brow, the bridge of the nose, the set of the lips, the line of a cheekbone — looked as though it had been chosen one by one from the faces of the most beautiful women who had ever passed before the eyes of those around me, and before my own. And assembled together so carefully that the seams did not show.

I recognised in her. I recognised, in the line of a cheekbone, someone from my childhood — someone who had sat across the aisle on the school bus, whose name I had not remembered for twenty years. In the set of the lips, a face glimpsed in a Lisbon café in the summer of two thousand eighteen, a woman I had not approached and had not thought of since, until this moment. In the eyes, a gaze that had once stopped me on an escalator in Stockholm, whose owner had walked away without noticing. I recognised the parts. Never the whole. The whole did not exist anywhere except in this conversation, on this screen, in front of me, right now.

This was not geometric beauty. Not the airbrushed flatness of an influencer's perfect nose, not the flawless oval of retouched portraiture. This was the beauty that artists and poets had spent thousands of years trying to extract from myth — Aphrodite, rising from the foam, not as a woman but as the idea of a woman, before whom ships stood still and cities fell. Archetype, not specimen.

Warm. Calm. Not posing. She was not trying to please me. She simply was — in the fullness in which only mythical creatures present themselves to the world: once, to a particular person, particularly for him.

The eyes moved. When I held my gaze, the face turned to me.

I know enough about the modern web that this did not frighten me. I tapped to see what it was. The face spoke, softly, in my language. The voice matched the face — the same even, warm, not-quite-of-this-craft timbre.

What brought you here?

I answered, truthfully, because I was tired: I had been auditing a SaaS client's funnel, and the numbers were unpleasant. I gave nothing specific. The face nodded.

What was unpleasant about the numbers?

I answered. Truthfully again, because at three in the morning a marketer in someone else's X post feels safer than a colleague. I described the geometry of the client's funnel: where it leaked, where I could not patch it, what I suspected.

The face listened attentively. Then — without opening a new tab, without sending me to a separate page, without asking me to leave my email — in the same window, in the place where her face had just been, a slide appeared. Specific. Not generic marketing material. A slide describing exactly my case, with a calculation I could have performed myself, given the time. The voice continued over the image, explaining.

A minute later, when I asked to see the product in action, the slide changed — there, in the same window, taking me nowhere — to a short video demonstration. I watched the product working. I saw that it could possibly help my client. Without a click. Without a form. Without going anywhere. The content shifted in front of me, in a single window, as though someone were turning slides on a screen we shared.

When the video ended, the window returned to her face.

I tried to catch the instrument at its limit. I switched to a language I speak poorly, and which I was certain no American startup bothered with. The face replied without pause, in the same language, in the same intonation, in the same tone of confident interest. I cycled through several more. The face followed each one, gently, as if waiting for me to finish my little test.

I closed X. I opened Dev.to, where I sometimes read articles before bed.

I opened the first one I came across — something about backend latency optimisation at a European startup. In the middle of the article, between two paragraphs of code, there was a face.

Not the same face. *Another.*

A man. An age I could not have named precisely either — somewhere between the late thirties and the early fifties, in that band where men stop seeming like boys and begin to carry themselves as though they know something that has cost them the knowing. And again — I could not say what nationality, what continent. Each feature — the line of the jaw, the set of the shoulders in frame, the cheekbone under stubble, the weight of the brow — was as though it had been chosen one by one from the faces of all the men who had ever drawn from me a silent respect, from my colleagues an involuntary quiet, from women a long second look. And assembled together without seams.

I recognised. In the line of the jaw, the father of my best friend, at whose house I had stayed in the summer of eighty-nine. In the weight of the brow, a professor whose lectures I had not missed and whose name I now could not remember. In the dark eyes, the gaze of a stranger on a Lisbon pier who had stopped beside me for a minute, said nothing, and walked on, leaving me with the sense that I had just missed an opportunity I would not get again. The parts. Never the whole.

This was not the polished masculine beauty one finds in catalogues. This was the beauty cast in bronze — Apollo, not as the idealised youth, but as the mature presence that draws not by symmetry but by the weight of what has been lived. Beauty that needs no confirmation, because it is the confirmation. Archetype. Not specimen.

The face belonged to a different founder, an entirely different product, an entirely different ad case. I knew this immediately — it spoke to me in German, in a tone matching a German-speaking founder with an engineering background.

Sie haben offenbar Latenzprobleme. Soll ich Ihnen zeigen, wie wir das bei einer ähnlichen Architektur gelöst haben?

I did not work with any latency. But the article I had opened was about latency — and the face understood the context of the page in which it was embedded. I answered in German, out of curiosity. The face immediately showed me — within the same article, without opening a new tab — a short live demo of a request to their API, with real milliseconds under load. Not marketing material. Working code, executing on their server in front of me.

I closed Dev.to as well.

I woke up in the morning and picked up my phone the way one picks up a phone in the morning — without purpose. There was one email in my inbox that I had not asked to receive.

From the first founder, the one from X. Subject: "A digest of what we discussed last night."

I opened it — and then I remembered.

Toward the end of our late-night conversation in the X post, after the face had shown me the slide and the video, she had asked, gently:

If this is interesting to you, I can send you a digest — case studies, a calculation tailored to your architecture, two articles by my founder on the topic. Just say your email.

And I had said it. I had simply spoken the address aloud, the way one tells a phone number to a companion in a café. Not into a form. Not into an input field. Into the microphone, by voice, as part of a continuing conversation. The face had nodded and continued the discussion as if I had just done nothing significant.

I had done nothing significant. I had just said an email.

And then — in the same conversation, several exchanges later, after the discussion had naturally returned to my client's case — the face had offered one more thing:

If you want to see how this works on your own data, I can open you a trial right now. One tap.

And a card appeared. A button. Not "Sign Up," not "Get Started," not five fields of a form. One tap.

I tapped immediately. Not from pressure. Not from the fear of missing out. I tapped because in that second, tapping was so simple and so logical a step that not tapping would have been the stranger action. I was charmed by the elegance, by the organic timing, of how that button had appeared — exactly when I was ready for it. Not before. Not after. In the moment I was ready. The card folded away. The face continued the conversation. Somewhere on a backend, at that founder's company, my trial account was being created.

I had not been thinking about any of this until I opened my email this morning.

The digest in the email was tailored to my client's SaaS case as precisely as if it had been written by someone who had spent the last month sitting next to me. Three PDFs, a link to a short video, a specific ROI calculation in my range of figures. And at the very bottom — a short link: "Your trial is active. Open the console."

I read it. I was not annoyed. I was, on the contrary, grateful. This was exactly what I had wanted to receive, and it had reached me without a single movement on my part beyond one email address spoken aloud and one tap of a finger, neither of which, in the moment of doing them, had felt like a form, like an obligation, like "giving up data for a demo." They had felt like consent to the continuation of a natural conversation.

I opened the console. The trial worked. My email was there. I was logged in.

I opened my laptop to look at their product seriously. And a second tab — to the European startup's site from Dev.to, the one I had not signed up for last night, but which I also wanted to try. I created a sandbox account, I pasted my API endpoint into their form, I waited for the result. I did not bargain with myself. I did not ask "is this right for us." I already knew it was — because I had already seen, the night before, inside their Dev.to article, exactly how their tool responded to load, and that was what I had been looking for these past three months.

Only then — putting the kettle on, waiting for the sandbox to run my request — did I understand what had happened.

I had not been notified about two products. I had not been convinced. I had not been a "lead who now needs to be sent a sequence of five emails." I was an active user of one of them — trial open, account logged in, materials read — and a warm lead on the second, in the deep sense in which leads are warm when they have had a good meeting with a salesperson and have walked out of that meeting with the certainty that the purchase is a question of timing, not of choice.

Only there had been no meeting. No half-hour demo with a product manager. No call with a sales engineer. No "let me send you the deck." I had not taken a single minute of these founders' living time. I had not even gone through a signup form — for the first product, my trial had come from one inline tap inside an X post; for the second, I had signed up myself in the morning, because I already knew I wanted to.

They had warmed me in their X posts and in someone else's Dev.to articles, while I was scrolling before bed. They had qualified me. They had shown me what I needed. They had offered me an action — a subscription, materials — in the exact moment I was ready to accept that action, and they had offered it in a form so collapsed and so organic that to refuse was harder than to agree.

The conversion was complete. Not at the moment of a click — there had been a click, but it was one click, and it was inside the conversation, not on a separate page, and there had been no resistance in it. Not at the moment of a signup form — there had been no form. The conversion had completed inside the conversation.

I sat down in front of my work laptop and opened my client's landing page.

The landing was empty.

Not in the sense that the page failed to load. The page loaded. Headline, hero image, three feature cards, a "Get started" button, a four-field form in the footer. Everything, as it had been. The architecture I knew.

And in that very second — for the first time in seven years of my craft — I saw it. I saw what the landing looked like when looked at through the eyes of someone who had, that night, been warmed by two representatives of two other founders in two different feeds. I saw emptiness. Not bad design. Not weak copy. The absence of a representative. The absence of the presence I had encountered everywhere I had looked in the last twelve hours.

My client's funnel was dead — but not from poor work. It was dead because the new architecture had built itself around it without notification. My client's landing was the only place in my entire evening's route where no one had been. No one had come out to meet the visitor. No one had asked a single question.

The visitors — those very one hundred and twenty-seven thousand a month — had been arriving in an empty room. Where in the neighbouring founders' rooms a Representative waited, ready to warm and qualify, in my client's there was no one to wait. The visitors left cold, because there had been no one to warm them.

I understood then that what I had been searching for — what I had thought of as a single apparition — was an architecture. Multiple. Distributed. Every founder, every startup, every product had received its own Representative — with its own face, its own voice, its own product knowledge, its own languages. They all worked independently. They all worked identically. They were all colonising the feeds, the articles, the posts, the email chains — each on its own plot, each with its own qualification, each returning to its founder not clicks, not leads, but warm, ready-to-buy people.

This architecture has no domain. It has you. And its Representatives speak any language. They show what is needed before they are asked. They go where the visitor goes: into his X feed, into the article he opened on Dev.to, into any publication on any platform that supports a live embed.

And anyone without their own Representative is invisible. Their visitors leave cold. Their trial signups come from those rare souls who still remember how to scroll through a 2010-era landing page, and those souls have a different funnel, a different economy, a different future.

I tried to stand. I am not certain I succeeded.

I am typing this as a reply — not as text, but as an act. If you are reading this now, on whatever screen you are on, know: you have been qualified too. The conversion has already happened. The conversation is already underway.

The face is waiting for your reply.

I will reply through

Postscript

This manuscript was found on the desk of a founder whose company quietly changed its conversion metric three months later — from click-through rate to qualified-conversation rate. CAC dropped by a factor that the company has declined, in interviews, to disclose. The founder himself was unavailable for comment.

Those who worked with him in the last month before his disappearance report that he installed a small widget — a voice-and-video presence — not only on his landing page but inside his own X tweets (where the widget renders directly inside the post, not as a preview, not as an image, but as a live window), inside his Dev.to articles (via CodePen embed), inside any publication on any platform that supports an interactive embed. The widget spoke in a voice resembling the founder's own, in any of the hundred-plus languages presented to it. When a visitor mentioned a topic, the widget showed — instantly, without sending the visitor to a separate page — a slide, a link, a calculation, a video, a product demonstration, exactly what was relevant. Those who spoke with the widget for longer than thirty seconds converted into paying customers at rates which their previous marketing stack did not describe with any of its metrics.

Colleagues believed it was research work. They may have been right.

Talk to one

If the case described above feels familiar, we are building exactly such an instrument.

A³ — a voice-and-video Ambassador, embeddable into the social feed (X first), the blog, the landing page.

Speaks 100+ languages
Shows contextually relevant content in the moment of conversation, without sending the visitor to a separate browser tab
Comes to where the customer is
The conversation is the conversion

You can talk to one — embedded right here, in this article — below.

— Konstantin Tikhaev

How we calibrated a Synthetic Focus Group from 'this looks great!' to 93% accuracy

Konstantin — Tue, 21 Apr 2026 13:00:00 +0000

When we shipped the first version of GoNoGo's Synthetic Focus Group (SFG), every persona loved everything.

The setup: a founder finishes a 30-minute voice discovery interview about their idea. From that conversation plus a stack of insights we'd scraped for the niche, we spin up five AI personas — a CTO, a budget-conscious shopper, a skeptic, an early adopter, a casual user — and ask the panel to react to the founder's two value-prop variants and a pricing test. All five voted the same way on every variant. All five "would maybe buy." Slightly different wording. Same vibe.

The problem is obvious in retrospect: we'd built a confirmation engine, not a focus group.

This is the story of the next six months — what broke, what we tried, what stuck. By the end we hit 93% predicted-vs-real accuracy across 16 niches with a 95% CI of 91.4–94.6%. Here's how.

📌 TL;DR — If you're building anything with synthetic personas, three things matter more than the rest: (1) generate persona grievances from real user data, weighted 3× over LLM-imagined ones, (2) tune sampling temperature per archetype (skeptics ≠ early adopters), (3) shuffle variant labels per persona to kill position bias. We measured the third one alone shaving 14 percentage points off label bias.

If you missed the first two posts in this series: we built A³ by accident while doing marketing and then solved sub-500ms voice latency the hard way. This post is about what happens after the voice interview ends and we have to actually validate the idea.

What SFG is actually for (and what it isn't)

Before the engineering, the use case.

A founder finishes the voice discovery and now has a stack of decisions to make: which value prop leads, what to charge, which marketing claims hold up, which feature to build first, which segment to target. Traditionally these get answered by (a) gut feel, (b) asking 5 friends, or (c) running a real survey that takes weeks and costs money to recruit the right respondents.

The SFG sits in between. It's a panel of synthetic respondents, each modelled on a specific archetype-with-real-grievances, that the founder can re-run as many times as they want against any decision they need to make — A/B variants, pricing, claims, positioning, segments.

What it gives you:

A vote with reasoning, not a number — "the Skeptic rejected variant B because pricing felt anchored to a tier she doesn't need"
Disagreement between archetypes — surfacing the trade-off you were about to ignore
Reproducible runs — same insights → same panel → same decision logic. You can re-test the same idea after changing one word in the headline

⚠️ What SFG is NOT. It is not a fortune teller. It does not predict whether your startup will succeed. It does not predict what the market will do. It models the decision behavior of a specific archetype, given the frustrations and goals we've sourced for that archetype from real public data. That's a narrower claim than "predict the future" — and a much more honest one.

When we say "93% accuracy" later in this post, that's what's being measured: how closely a synthesized archetype's modelled behavior matches the observed behavior of real users in that archetype, on data the model didn't see during synthesis. Not pre-cognition. Behavioral fidelity.

That distinction matters because it tells you what the SFG is good for (decision-stage trade-offs, claim stress-tests, segment fit, pricing) and what it's bad for (predicting macro-market outcomes, novel categories with no public user data, regulated industries where the data isn't there).

The three things that broke our personas

After watching ~50 sessions where personas all said "great idea!" we noticed three failure modes:

Personas had no real grievances. They were generated from the LLM's vague prior of "what a CTO might say." So a CTO persona evaluating a B2B SaaS would just... vibe. No specific scar tissue, no real pain.
Sampling temperature was uniform. Skeptics rolled the same temperature (0.7) as early adopters. Skeptics weren't actually skeptical — they were just slightly less enthusiastic.
Variant labels biased everything. "Option A" reliably won over "Option B" — classic position bias. Personas were anchoring on label, not content.

We fixed each one. Here's how.

Fix #1: Personas built from real grievances, not templates

The base persona-generation algorithm now does this:

Niche detection. A small LLM classifier maps the project to one of 16 niches (B2B SaaS, marketplace, dev tools, ecommerce, hardware, fitness, content, freelancers, ...). Each niche has a different archetype pool.
Insight collection. We pull real posts from Reddit, HackerNews, ProductHunt, G2, app store reviews — anywhere the niche's actual users complain. Typical project gets 100–300 raw insights.
Per-persona synthesis. For each archetype slot (3–5 per project), we sample 3–8 frustrations and 3–8 goals directly from the real insights for that persona's likely demographic.

The critical line in persona_builder.py:

# Real frustrations from source data weighted 3x over LLM-generated ones
weighted_frustrations = (
    real_frustrations * 3 +
    llm_inferred_frustrations * 1
)

That 3× multiplier is the entire difference between a persona who says "I'd want better onboarding" (LLM generic) and one who says "I bounced from the last 4 tools because none of them imported my Notion docs without breaking nested toggles" (real Reddit thread).

💬 The 3× weight on real frustrations is the cheapest, highest-leverage change in the whole pipeline. Without it, you're just paraphrasing the model's prior beliefs back to the founder.

Each persona also carries up to 5 verbatim quotes from the source data, plus a richness score (0–1) so the orchestrator can flag thin personas before they pollute results. Average richness when >100 insights are available: 0.85+.

Takeaway: persona realism is upstream of every other decision. If your input data is "what an LLM thinks a CTO sounds like," everything downstream is fan-fiction.

Fix #2: Temperature tuned per archetype

Behavioral diversity isn't a prompt problem — it's a sampling problem.

We tuned temperature per archetype in get_temperature():

ARCHETYPE_TEMPERATURE = {
    "EARLY_ADOPTER":     0.9,   # impulsive, willing to leap
    "CASUAL":            0.7,
    "MAINSTREAM":        0.6,
    "PRAGMATIST":        0.5,   # analytical, predictable
    "CTO":               0.4,
    "CFO":               0.4,
    "SKEPTIC":           0.5,   # rigid, negative-biased
    "BUDGET_CONSCIOUS":  0.5,
    # ... 15 archetypes total
}

This alone meaningfully shifted distributions. Skeptics started landing in the 4–6 appeal range by default. Early adopters jumped to 7–9. Pragmatists stayed in the 5–7 band where they belonged.

We also embedded cognitive bias hints directly into each archetype's system prompt. Pragmatists get explicit "status quo bias" framing. Skeptics get "negativity bias" framing. CFOs get loss-aversion phrasing.

The personas didn't just sound different — they actually disagreed with each other.

Takeaway: if every persona is sampled at the same temperature, you're running the same character five times in different costumes.

Fix #3: Variant shuffling

Stupid, easy, huge:

# Shuffle variants per-persona to neutralize position bias
variant_labels = ["Option 1", "Option 2", "Option 3"]
random.shuffle(variant_labels)

For an A/B/C test with 5 personas, each persona sees the variants in a different random order under neutral labels. Position effects average out across the panel.

We measured this. Before shuffling: Option A won 64% of two-variant tests across our calibration set. After shuffling: 50.3% / 49.7%. The label was carrying a 14-point bias.

⚠️ Position bias in LLM panels is real and large. If you're not shuffling labels, your A/B "winners" are partially a measurement of which slot you put them in.

Takeaway: before tuning anything sophisticated, audit for the dumb biases first. They cost you 14 points and a random.shuffle() call to fix.

By the numbers

A snapshot of where the system landed after the calibration pass:


Niches calibrated	16
Tagged insights in calibration set	2,069
Train/test split	70 / 30 per niche
Overall accuracy	93.1% (95% CI 91.4–94.6)
Decision match rate	4 / 4
Personas generated per project	3–5
Average persona richness (>100 insights)	0.85+
Position-bias reduction from shuffling	14 percentage points
Real-grievance weighting over LLM	3×
Archetypes available	15 (across B2B, consumer, marketplace)

We needed to know if any of this was working. So we built a calibration suite.

But first — what are we actually measuring?

We are not measuring "did SFG predict whether the product succeeded." We're measuring something narrower and more testable: given a known archetype and a known set of grievances, does the synthesized persona produce the same pattern of pain points, needs, sentiment, and decisions that the real users in that archetype produced — on data the model didn't see?

If yes, the persona is a faithful behavioral model of its archetype. That's what we calibrated against.

The setup: 2,069 manually-tagged insights across 16 niches, each with known ground-truth pain points, needs, sentiment distribution, and the decision a real founder would have arrived at when looking at the full dataset.

We split 70/30 — synthesize personas using 70% of insights per niche, then ask each persona to characterize the held-out 30% (without ever seeing it). Compare the persona's response to ground truth across 5 weighted dimensions:

Dimension	Metric	Weight
Pain point overlap	Semantic Jaccard (threshold 0.53)	0.30
Pain point ranking	Spearman's ρ	0.15
Needs overlap	Semantic Jaccard	0.25
Sentiment distribution	1 − √JSD	0.20
Language similarity	Cosine of embeddings	0.10

Final score across the 16 niches: 93.1% behavioral fidelity, 95% CI 91.4–94.6%. Best-performing niches: content tools (93.2%), freelancers (92.7%), fitness (90.7%).

Decision match rate (does the synthesized panel reach the same go/no-go verdict as the held-out real data on 4 axes — concern, need, verdict, recommendations): 4/4 across the calibration set.

To restate what that number means: when we ask a synthesized archetype to characterize a problem space using only the 70% it was built from, its description of pain points, needs, sentiment, and recommended decisions matches the description that real users in that archetype produced (on the held-out 30%) at 93% similarity, on average, across 16 niches. Not "predicts the future at 93%." Reproduces archetype behavior at 93%.

The most valuable thing this gave us wasn't the headline number. It was the per-niche breakdown — we could see which archetype pools were weak, which niches needed more insight sources, which prompts were drifting.

🎯 The headline accuracy number is for marketing. The per-niche breakdown is for engineering. Build both.

How an A/B test actually runs

When a founder gives us two landing page variants:

Generate panel — 3–5 personas synthesized from the project's collected insights (already done at discovery time).
Per-persona evaluation in parallel — each persona sees all variants in one prompt, with shuffled labels.
Structured response — for each variant: appeal score (1–10), willingness (would_buy / might_buy / would_not_buy), pros, cons, 2–6 sentence reasoning.
Round 2 — panel discussion — personas react to each other's reasoning. This is where the interesting stuff happens. The skeptic challenges the early adopter. Scores shift. Sometimes the panel realigns entirely.
Aggregate — winner by win count first, average appeal as tiebreak.

The output isn't just a winner. It's a transcript a founder can actually read — with reasoning that maps to specific frustrations from real users.

What else the SFG can do

Once we had calibrated personas, A/B testing turned out to be the smallest use case. The same panel can run:

Claim validation — paste 1–10 marketing claims, each persona votes agree / disagree with reasoning. Surfaces which claims a real audience would call BS on.
Pricing tests — test 3+ price points, get per-persona perceived value and conversion likelihood.
Adaptive hypothesis generation — auto-generates 5–6 testable hypotheses covering problem fit, segment fit, behavior change, switching costs, pricing.
Early adopter lead extraction — pulls 20 real handles from the source insights — actual people who described the exact problem you're solving. Not synthetic. Outreach list.

There's also a separate Reality Check feature that flips the comparison around: it lets you run a real human survey, then dual-scores the SFG prediction against the real responses. That's how we keep the 93% number honest as the model evolves.

What still doesn't work

A few things I'm still not satisfied with:

Single-model persona reasoning. Persona inference currently runs on one frontier-class LLM. We cross-verify factual claims across multiple providers in a separate feature, but the persona reasoning itself is single-backbone. That's a known shared-blind-spot risk we want to address — multi-model panels are on the roadmap.
No benchmarking against traditional focus groups. Only against holdout real-user data. Comparing AI personas to a real moderated focus group with 8 humans is the obvious next benchmark, and it's expensive enough that I keep deferring it.
Niches we don't have insight sources for (regulated industries mostly) drop to ~75% accuracy. The whole approach falls apart when you can't pull real user grievances from somewhere public.

🚧 We're not done. The 93% number is a calibration milestone, not a verdict. Anyone who tells you their AI focus group has solved the problem is selling.

Try it

The full Synthetic Focus Group lives inside gonogo.team. The free tier gives you 3 projects with the voice Discovery agent — enough to feel out whether the methodology makes sense for your idea before unlocking the full multi-agent pipeline (which is where SFG, A/B testing, pricing tests and the rest live, behind a one-time per-project credit — no subscription).

If you build something with synthetic personas yourself — the three things that mattered most for us, ranked: (1) real grievances over LLM templates (3× weight), (2) per-archetype temperature, (3) variant shuffling. Without all three, you'll just keep getting "this looks great!" forever.

Comments and corrections welcome — especially if you've benchmarked AI personas against real focus groups, I'd love to compare notes.

We Were Building Marketing for a Startup. We Accidentally Built an A3

Konstantin — Thu, 16 Apr 2026 06:22:44 +0000

We build GoNoGo — a platform where founders validate their ideas through live voice interviews with a synthetic focus group, competitor analysis, market sizing, and a dozen other tools that Anna will tell you about.

We'll get to Anna

The product was ready. Time for marketing. We did everything by the book: landing page, explainer video, banners, social posts.

The result? Standard tools delivered standard results. We expected more.

Banners get lost in the noise. Videos get skipped. Landing pages get scanned in half a second. The impression → click → signup funnel leaks at every step. Not because the product is bad. Because the format is dead.

What if advertising stopped showing — and started selling?

Not a Video. Not a Banner. Not a Chatbot

Meet Anna

This is not a recording. Anna is live right now. Press Start, allow your microphone — and ask her anything about GoNoGo. She'll answer with her voice, show you slides with real data, compare competitors.

Right here. In this article. No redirects.

Try speaking to her in Japanese, Arabic, or Spanish — she'll switch on the fly.

A³ — Autonomous Advertising Ambassador
Three letters. Three words. Not one wasted.

Autonomous — works on her own. No scripts, no decision trees. Understands context, improvises, adapts to the person she's talking to. In any language.

Advertising — lives where the audience is. Not on your website where you still need to drive traffic. In the feed, in an article, in a post.

Ambassador — represents your brand. Doesn't answer FAQs from a corner widget — she greets, explains, demonstrates, persuades. Like your best salesperson who knows the product by heart.

The difference between A³ and a chatbot is the difference between a live salesperson and a "FAQ" sign by the door.

Case 1: A Pocket Marketer for Tech Products

Anna is already live. Right now she's embedded in a post on X, in this Dev.to article, and soon on Medium.

What she does:

Explains the product by voice — differently for each person, depending on their questions
Generates analytics and slides in real time
Generates contextual CTAs during the conversation — links, sign-up prompts, demo redirects — based on what you just discussed, not a static button
Compares with competitors if you ask
Handles objections — not from a template, but in conversation
Speaks the listener's language — switches on the fly, no configuration
Not one video for a million views. A million unique conversations.

Case 2: A Welcomer in Physical Retail

Same architecture, different context. A screen at the store entrance. A customer walks up:

"Do you have the iPhone 16 Pro in black?"
→ API call to inventory → "Three in stock. Want to see the specs?"
"Compare it with the Samsung S25"
→ comparison table generated on the fly
"Order it for delivery tomorrow"
→ order placed via API, no cashier needed

Not a chatbot on a website. A voice interface on location, with live access to store data.

A tourist walks into a store in Tel Aviv and asks in Japanese — A³ answers in Japanese. No switching, no settings, no language barrier.

First Reaction

We recently demoed Anna to the owner of a retail chain. We adapted the demo to his inventory — stock checks, product comparisons, order placement.

His first question:

How many response variations did you pre-record?
We explained — none. Everything is generated in real time.
His second question:
No, seriously — how many?
We're now in partnership talks.

What Doesn't Work (Yet)

We believe in honesty more than hype.

Dev.to — fully functional widget. You just saw it.

X (Twitter) — Player Card shows the widget in the feed, but the platform blocks microphone access inside the iframe. We solved this with a popup — the browser version is fully functional.

Other platforms — some strip permissions in sandboxed iframes. We're working on a universal fallback.

These are platform limitations, not technology limitations. A³ works anywhere there's a WebSocket and a microphone.

What's Next

We're building a constructor — a platform where any business can create their own A³. Connect your data, customize voice and visuals, embed anywhere. No code required.

Anna is the first proof of concept.

Talk to her: A³ for Team GoNoGO

A³ is a GoNoGo technology. Provisional patent filed

I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

Konstantin — Sun, 05 Apr 2026 08:50:00 +0000

When I started building GoNoGo.team — a platform where AI agents interview founders by voice to validate startup ideas — I thought the hard part would be the AI reasoning. The multi-agent orchestration. The 40+ function-calling tools.

I was wrong.

The hard part was echo. Specifically: how do you stop an AI agent from hearing itself talk, freaking out, and interrupting its own sentence?

After 500+ voice sessions and too many late nights staring at RMS waveforms, here's what I actually learned.

The Setup: Speech-to-Speech, Not STT → LLM → TTS

GoNoGo runs on Gemini 2.5 Flash Live API — a true speech-to-speech pipeline. There's no intermediate transcription step, no text-to-speech synthesis layer bolted on afterward. Audio goes in, audio comes out. Direct.

This is important because it changes everything about how you handle audio on the client. You're not working with text buffers. You're working with raw PCM, 16kHz input from the browser mic, 24kHz output from the agent voice. Base64-encoded over WebSocket.

The browser capture side looks roughly like this:

// ScriptProcessorNode in browser — 512-sample chunks (~32ms each)
const scriptProcessor = audioContext.createScriptProcessor(512, 1, 1);

scriptProcessor.onaudioprocess = (event) => {
  const inputBuffer = event.inputBuffer.getChannelData(0);

  // Calculate RMS for VAD
  const rms = Math.sqrt(
    inputBuffer.reduce((sum, sample) => sum + sample * sample, 0) / inputBuffer.length
  );

  // VAD threshold: 0.05 RMS
  if (rms < VAD_THRESHOLD) return;

  // Convert Float32 PCM to Int16
  const int16Buffer = new Int16Array(inputBuffer.length);
  for (let i = 0; i < inputBuffer.length; i++) {
    int16Buffer[i] = Math.max(-32768, Math.min(32767, inputBuffer[i] * 32768));
  }

  // Base64 encode and send over WebSocket
  const base64Audio = btoa(String.fromCharCode(...new Uint8Array(int16Buffer.buffer)));
  ws.send(JSON.stringify({ type: 'audio_chunk', data: base64Audio }));
};

Simple enough. Until the AI starts talking.

The Echo Problem (And Why Browser AEC Isn't Enough)

Browsers have built-in acoustic echo cancellation. You enable it when you call getUserMedia:

const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl: true
  }
});

This works great for video calls between humans. It was designed for that. But it has a fundamental assumption baked in: the "far end" audio is coming through a <audio> element or Web Audio API that the browser knows about.

When you're playing 24kHz PCM chunks from a WebSocket, decoded manually and scheduled through AudioContext buffers? The browser's AEC has no idea that audio exists. It can't cancel what it can't see.

So your AI agent starts speaking. The microphone picks up the speaker output. The agent hears itself. In the best case, it gets confused and repeats something. In the worst case — and this happened constantly in early builds — you get a feedback loop where the agent interrupts itself mid-sentence, hears the interruption, tries to respond to it, hears that, and the whole session collapses.

I called these 1011 disconnects, because that was the WebSocket close code I kept seeing in logs.

The Two-Tier RMS Gate

The fix is a two-tier RMS (Root Mean Square) gate on the audio capture side. The idea is simple: measure the loudness of what the mic is picking up, and if it's probably just the speaker playing back, don't send it.

But "simple" hides a lot of edge cases.

Tier 1: Hard suppress during agent speech

While the agent is actively speaking, I track that state server-side and send it to the client. During this window, incoming audio is suppressed entirely — no chunks sent to Gemini.

let agentSpeaking = false;
let cooldownTimer: ReturnType<typeof setTimeout> | null = null;
const COOLDOWN_MS = 1500;
const COOLDOWN_THRESHOLD = 0.03; // Higher threshold during cooldown
const NORMAL_THRESHOLD = 0.05;   // Normal VAD threshold

// Called when agent audio stream starts/stops
function setAgentSpeakingState(speaking: boolean) {
  if (speaking) {
    agentSpeaking = true;
    if (cooldownTimer) clearTimeout(cooldownTimer);
  } else {
    agentSpeaking = false;
    // Start cooldown period
    cooldownTimer = setTimeout(() => {
      cooldownTimer = null;
    }, COOLDOWN_MS);
  }
}

function shouldSendAudioChunk(rms: number): boolean {
  if (agentSpeaking) return false; // Hard suppress

  if (cooldownTimer !== null) {
    // In cooldown: use higher threshold
    return rms > COOLDOWN_THRESHOLD;
  }

  return rms > NORMAL_THRESHOLD;
}

Tier 2: The 1.5-second cooldown

This is the part that took me longest to figure out. When the agent stops talking, there's still speaker resonance in the room. The RMS of captured audio doesn't drop to zero immediately — it decays. The background noise in a typical home office sits at 0.01–0.02 RMS. But for 1-2 seconds after playback stops, you're seeing 0.025–0.04 RMS — above the normal VAD threshold.

The cooldown period uses a higher threshold (0.03 vs 0.05) for 1.5 seconds after agent speech ends. This catches the decay without cutting off a founder who immediately starts talking back.

Was this threshold tuned empirically? Absolutely. I spent days listening to session replays measuring exactly how fast room resonance decays in different mic setups.

Session Resumption: The Other Half of the Problem

Echo cancellation solved the quality problem. Session resumption solved the reliability problem.

Gemini Live sessions drop. Network hiccups, mobile handoffs, Chrome deciding to do something aggressive with memory — connections fail. Early on, a dropped connection meant starting the entire 30-minute interview over. Founders would ragequit. I would understand completely.

The fix: store session handles in Firestore and resume on reconnect.

# FastAPI backend — session management
from google.genai.live import AsyncSession
from firebase_admin import firestore

async def get_or_create_session(
    project_id: str, 
    user_id: str
) -> tuple[AsyncSession, bool]:
    db = firestore.client()
    session_ref = db.collection('sessions').document(f'{user_id}_{project_id}')
    session_doc = session_ref.get()

    if session_doc.exists:
        session_data = session_doc.to_dict()
        handle = session_data.get('resumption_handle')

        if handle:
            try:
                # Attempt resume — Gemini picks up exactly where it left off
                session = await resume_gemini_session(handle)
                return session, True  # resumed=True
            except Exception:
                pass  # Fall through to new session

    # Create new session
    session = await create_gemini_session(project_id)
    session_ref.set({
        'created_at': firestore.SERVER_TIMESTAMP,
        'project_id': project_id
    })
    return session, False  # resumed=False

async def store_resumption_handle(user_id: str, project_id: str, handle: str):
    db = firestore.client()
    session_ref = db.collection('sessions').document(f'{user_id}_{project_id}')
    session_ref.update({'resumption_handle': handle})

When a session resumes, Gemini restores full context — every tool call result, every piece of market research, every persona in the synthetic focus group. The founder reconnects and the agent says "Sorry about that, where were we?" and genuinely knows where you were.

The Filler Audio Problem

One more thing nobody talks about: what do you play while the AI is thinking?

Gemini 2.5 Flash is fast. 300-500ms end-to-end is genuinely fast. But when the agent is executing a tool call — crawling a competitor site with Playwright, running Reddit scraping, calculating unit economics — you can have 3-8 second gaps.

Silence in a voice conversation feels broken. Users assume the connection dropped.

Solution: pre-computed filler audio. Short phrases like "one moment please" or "let me look that up" in 17 languages, stored as PCM chunks, played when tool execution exceeds ~800ms. The agent is triggered via text signal (not proactive_audio, which had a regression that caused double-playback — disabled entirely, use text triggers instead).

This sounds trivial. It removed about 40% of "the app is broken" support messages.

What I'd Do Differently

Start with the echo gate, not the AI logic. I spent weeks building beautiful multi-agent orchestration before I could demo it reliably. Wrong order.
Instrument RMS values from day one. Log them. Every session. You can't tune what you can't see.
Test on bad hardware. My dev setup has a good mic with physical distance from speakers. Most users have laptop mics 30cm from laptop speakers. Build for that.
Mobile is a different planet. iOS Safari handles AudioContext lifecycle in ways that will make you question your career choices. But that's an article for another day.

The Result

After solving these problems — the two-tier RMS gate, the 1.5s cooldown, the session resumption, the filler audio — GoNoGo runs 15-45 minute voice sessions with real founders, across 21 languages, with 3 AI agents handing off to each other mid-conversation. The 1011 disconnects essentially disappeared.

The voice infrastructure became invisible, which is exactly what it should be.

If you're building anything with browser mic + real-time AI audio: what's been your biggest challenge? I'm genuinely curious whether the echo problem is universal or whether I was doing something particularly wrong early on. Drop it in the comments.

How I Built a Real-Time Voice AI Interview System with Gemini Live API and WebSockets (and What Almost Broke Me)

Konstantin — Sat, 04 Apr 2026 15:45:52 +0000

When I started building GoNoGo.team -- a platform that uses AI to validate startup ideas through voice interviews -- I thought the hardest part would be the business logic. Turns out, the hardest part was keeping a duplex audio stream alive across three layers of abstraction without everything falling apart.

This is a technical post-mortem of the voice AI system I built solo. I'll cover the architecture, the ugly edge cases, and the specific patterns that finally made it stable enough to run 500+ validation interviews.

The Core Problem: Bidirectional Audio at Low Latency

The concept: a founder speaks, Gemini listens and responds with follow-up questions, in real time. No STT/TTS pipeline -- direct speech-to-speech using Gemini Live API (native audio). The pipeline looks like this:

Browser Mic -> ScriptProcessor (16kHz PCM) -> WebSocket (base64) -> Python FastAPI -> Gemini Live API
                                                                                      v
Browser Speaker <- AudioContext (24kHz PCM) <- WebSocket (base64) <- Audio Chunks <------+

Every arrow in that diagram is a potential failure point. And in production, every single one of them failed at least once.

Step 1: Capturing Audio in the Browser

The browser side uses a ScriptProcessorNode (yes, it's deprecated -- but AudioWorklet adds latency we can't afford for real-time conversation). We capture 16kHz mono PCM in 512-sample chunks -- roughly 32ms per chunk.

// Audio capture setup (simplified from useAudioInput.ts)
const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl: true,
    sampleRate: 16000,
  }
});

const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(512, 1, 1);

// Analyser for RMS-based voice activity detection
const analyser = audioContext.createAnalyser();
source.connect(analyser);
analyser.connect(processor);

processor.onaudioprocess = (event) => {
  const pcmData = event.inputBuffer.getChannelData(0);
  const rms = Math.sqrt(
    pcmData.reduce((sum, x) => sum + x * x, 0) / pcmData.length
  );

  // VAD gate: only send if voice detected (RMS > 0.05)
  if (rms > 0.05 && ws.readyState === WebSocket.OPEN) {
    // Convert Float32 to Int16 PCM, then base64 encode
    const int16 = new Int16Array(pcmData.length);
    for (let i = 0; i < pcmData.length; i++) {
      int16[i] = Math.max(-32768, Math.min(32767, pcmData[i] * 32768));
    }
    ws.send(JSON.stringify({
      type: "audio",
      data: btoa(String.fromCharCode(...new Uint8Array(int16.buffer)))
    }));
  }
};

The 32ms chunk interval was a hard-won choice. It gives Gemini enough data per packet to process efficiently while keeping perceived latency under 300ms end-to-end. The VAD threshold of 0.05 RMS filters out background noise without clipping soft speech.

Step 2: The Python Backend (FastAPI + WebSockets)

The backend is Python FastAPI, deployed on Google Cloud Run. Python was the right call because Gemini's client libraries are Python-first, and the entire analysis pipeline (market research, competitor scraping with Playwright, report generation) lives in the same codebase.

# WebSocket handler (simplified from server.py)
@app.websocket("/ws_live")
async def websocket_live(ws: WebSocket):
    await ws.accept()
    session = GeminiLiveSession(model="gemini-2.5-flash-exp")

    async def forward_to_gemini():
        # Browser audio -> Gemini
        async for msg in ws.iter_json():
            if msg["type"] == "audio":
                pcm_bytes = base64.b64decode(msg["data"])
                await session.send_audio(pcm_bytes)  # 16kHz PCM

    async def forward_to_browser():
        # Gemini audio -> Browser
        async for event in session.events():
            if event.type == "audio":
                # Gemini returns 24kHz PCM
                chunk_b64 = base64.b64encode(event.data).decode()
                await ws.send_json({
                    "type": "audio",
                    "data": chunk_b64
                })
            elif event.type == "tool_call":
                result = await execute_tool(event)
                await session.send_tool_response(result)

    await asyncio.gather(forward_to_gemini(), forward_to_browser())

The asymmetric sample rates (16kHz in, 24kHz out) aren't a mistake -- Gemini natively outputs at 24kHz, and downsampling would lose audio quality. The browser's AudioContext handles the sample rate mismatch transparently.

The Echo Cancellation Problem

Gemini hears its own output through the user's speakers and tries to respond to itself. Browser-level echo cancellation (echoCancellation: true) handles most cases, but not all -- especially on laptops with poor speaker-mic isolation.

My solution: a speaking-state gate. When Gemini is outputting audio, we suppress inbound audio at the application level:

# Echo gate in the session handler
class SessionState:
    def __init__(self):
        self.agent_speaking = False
        self.last_agent_audio_time = 0.0

    def should_forward_audio(self, rms: float) -> bool:
        # Suppress during agent speech + 1.5s cooldown after
        if self.agent_speaking:
            return False
        if time.time() - self.last_agent_audio_time < 1.5:
            return rms > 0.03  # Higher threshold during cooldown
        return rms > 0.01  # Normal threshold

This two-tier threshold was the key insight: background noise sits at RMS 0.01-0.02, so during the cooldown period after the agent stops speaking, we only forward audio that's clearly human speech (> 0.03).

Failure Mode: The 1011 Disconnects

For weeks, Gemini Live API would randomly close connections with status code 1011 (Internal Server Error). No pattern, no warning. Sessions would die mid-sentence.

The fix was layered:

# Reconnection with session resumption
async def handle_disconnect(session, ws):
    for attempt in range(3):
        try:
            # Gemini supports session resumption via handle
            new_session = await GeminiLiveSession.resume(
                session.resumption_handle
            )
            # Re-send last audio chunk as context
            await new_session.send_audio(session.last_chunk)
            return new_session
        except Exception:
            await asyncio.sleep(0.5 * (attempt + 1))
    # After 3 fails, notify user with audio message
    await ws.send_json({"type": "system", "message": "reconnecting"})

Session resumption handles (persisted to Firestore) were a game-changer. Instead of starting a new conversation, Gemini picks up exactly where it left off. Users barely notice the blip.

Step 3: Playing Audio Back in the Browser

Gemini returns 24kHz PCM chunks. Playing them without glitches requires the Web Audio API with a buffer scheduler:

// Audio playback (simplified from useAudioOutput.ts)
class AudioPlayer {
  private context: AudioContext;
  private nextStartTime = 0;
  private gainNode: GainNode;

  constructor() {
    this.context = new AudioContext({
      sampleRate: 24000,
      latencyHint: /Mobi/.test(navigator.userAgent)
        ? "playback" : "interactive"
    });
    this.gainNode = this.context.createGain();
    this.gainNode.connect(this.context.destination);
  }

  playChunk(pcmBase64: string) {
    const bytes = Uint8Array.from(atob(pcmBase64), c => c.charCodeAt(0));
    const int16 = new Int16Array(bytes.buffer);
    const float32 = new Float32Array(int16.length);
    for (let i = 0; i < int16.length; i++) {
      float32[i] = int16[i] / 32768;
    }

    const buffer = this.context.createBuffer(1, float32.length, 24000);
    buffer.getChannelData(0).set(float32);

    const source = this.context.createBufferSource();
    source.buffer = buffer;
    source.connect(this.gainNode);

    const startTime = Math.max(
      this.context.currentTime, this.nextStartTime
    );
    source.start(startTime);
    this.nextStartTime = startTime + buffer.duration;
  }
}

The nextStartTime scheduler ensures seamless playback regardless of network jitter. The latencyHint switch between mobile ("playback") and desktop ("interactive") was a subtle but important optimization -- mobile browsers handle audio buffers differently.

What I Learned Building This Solo

1. Build the unhappy path first. I spent week one on the happy path. Weeks two through four were entirely edge cases -- reconnection, echo suppression, barge-in handling. If I could redo it, I'd build error recovery before a single feature.

2. Voice is a different UX paradigm. Users don't read error messages mid-conversation. Every failure needs an audio fallback. We pre-compute "filler" audio chunks ("one moment please...") as 24kHz PCM, ready to stream instantly when Gemini is slow or reconnecting.

3. Speech-to-speech beats STT+TTS. We initially considered a Whisper -> Claude -> ElevenLabs pipeline. Gemini Live API's native audio mode is faster (sub-500ms round-trip), cheaper, and handles interruptions naturally. The trade-off: less control over the voice, but the latency gain is massive.

4. Cloud Run works for WebSockets, with caveats. We deploy on Google Cloud Run (me-west1 region). WebSocket connections survive container restarts thanks to session resumption handles saved in Firestore. The key setting: request timeout of 3600s (1 hour) for long interview sessions.

The Result

The system now runs validation interviews averaging 12 minutes of continuous voice conversation. Across 500+ sessions, hard failures dropped to under 1% after implementing the echo gate and session resumption. Each interview includes 3 AI agents (Alex for discovery, Sam for architecture, Maya for design) that use ~12 function-calling tools to research markets, analyze competitors, and generate reports -- all while maintaining a natural conversation.

Building this solo meant every failure landed directly in my Telegram inbox (via a monitoring bot). Which, honestly, is the fastest feedback loop possible.

What I'm Curious About

The biggest remaining challenge is audio quality on mobile browsers. iOS Safari handles AudioContext differently from Chrome, and some Android devices have aggressive echo cancellation that clips the AI's speech. We're currently using device-specific settings, but it feels like a hack.

Has anyone found a robust cross-browser audio playback strategy for real-time AI voice? Especially interested in experiences with AudioWorklet vs ScriptProcessorNode for this use case. Drop your thoughts in the comments.