Forem: marinsky roma

QA is dead 2005 vs 2015 vs 2025

marinsky roma — Tue, 10 Feb 2026 16:24:31 +0000

Every couple of years, the same wave hits LinkedIn: "QA is being eliminated," "testers will be replaced," "QA teams are the first to go."

In 2025, the boogeyman is AI. Before that, it was shift-left. Before that - automation. Before that — Agile. The framing changes, the panic stays the same.

But here's the thing - none of this is new. Not even close.

2011 - Facebook had no dedicated QA team. Simon Stewart (creator of WebDriver) had already explained that developers owned quality. The Quora thread "Is it true that Facebook has no testers?" is from that era.

2012 - "How Google Tests Software" came out. Google had already transformed QA into Test Engineers and Software Engineers in Tools & Infrastructure. Developers owned their tests. The "traditional QA is dead" narrative was already in full swing.

2014 - Microsoft eliminated the SDET role during an 18,000-person layoff. Entire testing organizations were restructured overnight. Gergely Orosz documented this firsthand in "How Big Tech does QA."

2015 - ThoughtWorks published "Is QA Dead?" Yahoo cut its QA department entirely. Slashdot had a 500+ comment thread debating it. Elisabeth Hendrickson introduced "testing = checking + exploring" at OnAgile — addressing the exact same fear we see today.

2015 - Also the year the Slashdot thread "No More QA: Yahoo's Tech Leaders Say Engineers Are Better Off Coding With No Net" went viral. Read the comments. They could've been written yesterday.

That was a decade ago. QA didn't die. It adapted. Testers who evolved - survived and thrived. Those who didn't - struggled. Same as in every profession.

Fast-forward to today, and replace "automation" with "AI agents" - the script is identical. "QA will be replaced by AI." "You won't need testers anymore." "Your team will be cut in half."

And conveniently, there's always a tool to sell you right after the scare.
Here's what actually happens every single time: the panic fades, the roles evolve, and the people who understand systems, edge cases, risk, and user behavior remain indispensable. Because quality isn't a phase you bolt on — it's a mindset. No tool replaces that.

If you're a QA engineer reading yet another "your job is disappearing" post, relax. Learn, adapt, stay curious. But don't let recycled fear-mongering from people selling solutions make you question your value.

This conversation has been going in circles for 15 years. The only thing that changes is who profits from the panic.

Agent Orchestration, Multi-Model Setups, 1M Context Window - It's Marketing for Those Who Haven't Tried

marinsky roma — Wed, 28 Jan 2026 19:39:31 +0000

This is aggregated experience - mine personally and from public engineers, current and former colleagues who code daily in enterprise, build their own products, countless prototypes: Viktor Tulskyi, The Prime Gen, Theo (t3.chat), Peter Steinberger, Gregory Orosz, and many other pragmatic engineers.
This isn't a research paper with metrics. These are practical conclusions from those who tried everything and came back to simplicity.

AI Can Be Exciting and Useful, But There Will Be "Buts"

I enjoy AI. I use it daily. I can now quickly do things I never understood before - at the level of PoC, MVP, internal utilities that turn a five-minute task into a second.

I built internal extensions for infrastructure, a console utility for transcribing voice to text via clipboard, a macOS utility for the same but more convenient - this has become exciting for me.

Every day, I follow what's happening in the AI world. Every day I try something new:

Tool updates, services like Supabase, Vertex, Convex
New bullshit benchmarks from providers
New papers on approaches, prompt combinations, contexts, models, and tokenization

So much to learn - and at the same time, so much that doesn't meet expectations. But useful stuff, everything I follow - ideas already worked out by other engineers, companies, researchers.

And also - so many promotional, sales, marketing videos. How someone built multi-agent orchestration, and agents solve tasks independently, practically without participation. Demos of full autonomy. Beautiful, polished, convincing.

I tried this too. Experimented. Temperatures, system prompts, fallbacks, and playgrounds with different models. Multi-model setups, multi-agent systems, sub-agents, RAG.

Everything, literally! It's pouring from empty to empty. Everything falls apart in the details when it meets reality.

My Prompting Journey Looked Like This:

Stage 1: "Fix this, here's the stacktrace"

Stage 2: Multi-model, multi-agent, orchestration, RAG, complex system prompts, temperatures, fallbacks...

Stage 3: "Fix this, here's the stacktrace, here's when it happens, probably the problem is here"

The difference between first and third stage - I now know exactly what needs to be fixed. Where to look, what context to give, how to formulate.

Narrowing context always works better than 1 MILLION input tokens for writing code.

All that stuff in the middle - multi-agents, orchestration, RAG - was marketing I swallowed. And you can try it too. Try it! Seriously. So you understand firsthand HOW it doesn't work. And when it potentially could work.

But don't spend too much time on it - in the long run, for MVPs and large projects, it makes weakly controllable changes with their own unique consequences.

Accuracy

One of the key parameters for evaluating LLMs for code generation is accuracy.

More specifically:

Pass@k - did at least 1 of k attempts pass tests
Pass@1 - did the code complete the task on the first try

Not "almost works", not "needs tweaking". Works or doesn't.

Not "how nicely it sounds". Not "how confidently written". But how repeatable and correct.

Accuracy isn't the only metric. For brainstorming, analysis, and review, variability can be useful. But for code that must compile, pass tests, and work in production, accuracy is more critical.

Code either works or it doesn't.

The interpreter and compiler aren't creative personalities having a discussion with the author, and a "500 error" isn't production abstractionism.

The Agent/LLM Doesn't "Think"

Sorry to repeat, but an LLM is a probability generator for the next token. And when you put two token generators to communicate with each other, you don't get a "team". You get accumulating hallucinations with each step.

What happens to "Accuracy" when agent A passes results to agent B, which passes to agent C? It drops. Exponentially. Each step adds variability. Each step moves further from the expected result.

Multi-agent is a marketing term for "I wrote several prompts, and they call each other".

Multi-agent isn't an architecture. It's hoping that LLM will magically understand the context you haven't formalized yourself.

"But Devin, Cursor, Windsurf background agents - they work!"

They work. In specific cases. On particular workflows. But compare: how much time will you spend configuring Cursor rules, custom agents, your own RAG system vs. understanding the project architecture, knowing where to look for the problem, and asking AI to brainstorm solution options?

A person who understands the system + simple conversation with a "model" = (more often gets) faster and more accurate solution than juggling multi-agent setups you configured for a week.

Sub-agents? Charade - as Peter Steinberger aptly called it in his blog post about how you can work more simply with "agents".

What others do through sub-agents, you can do through separate terminal windows. Full control. Full context visibility. Less exponential progression of hallucinations. And most importantly - athe bility to verify results at each step!

MCP? Marketing Token Burner and Hallucination Igniter

Most of them should just be CLI or API clients.

GitHub MCP eats 23k context tokens from the start. gh CLI does the same - "for free".

Or generate a simple script that calls GitHub API for a specific task - you'll get predictable results without the magic of "now the agent will figure it out thanks to MCP".

MCP provides structure? API does too. CLI with the proper output format, too. The difference is that you fully control a script, while MCP is a black box, often limited, that can return anything because it accidentally called the wrong tool and ate context just because you initialized it in the agent.

Note: MCP for Figma is a separate story - for frontend work, it's genuinely useful. I'm more skeptical about MCP for browser/database - the value there is questionable.

RAG for Code?

In an enterprise for searching documentation, regulatory requirements, and internal wikis, RAG makes sense. If you have a DEDICATED team maintaining it, write in comments what cases it actually helps, and that you're from a "wealthy family".

But if you're a developer who wants to set up RAG for your own codebase so the "model better understands the project," it's overkill without value. Modern models already search code well when given the right context.

Your time is better spent on understanding the architecture/structure than on configuring vector indexes.

A Separate Illusion - Context Window

"Gemini has a million tokens! You can throw in the entire codebase!"

There's research "lost in the middle" - the model loses information in the middle of a large context. Yes, the research is old, and new models show better results in synthetic tests like "needle in a haystack".

But even Google Gemini 3.0 Pro declares the same - the more facts you search for simultaneously (like in real work) - accuracy drops sharply!

In practice, mine and my colleagues' in an enterprise - large context still works worse. Not because the model "forgets". But because more context = more noise = more response variability = lower "Accuracy".

Compare:

Throw 500k tokens of code and say, "find and fix the problem"
Know where the problem likely is, give the relevant file and context for the fix

For simple things - find a function, update implementation, see where it's used - agents handle it fairly well. But for more complex changes or refactoring, you often get "shrapnel": changes scattered across the project, extra code created that then needs cleanup.

When you give more hints about the structure, it works more reliably. But for that, you need to understand the architecture and codebase. A million context tokens won't replace that.

Again:
LLM is a probability generation machine, not a logical processor.
More noise in - more noise out.

Autonomy

Agent autonomy works exactly until the task goes beyond the demo scenario. First edge case - and the whole "orchestration" falls apart.

Why don't those selling courses and tools talk about this? Because "it works for 70% of cases, and the rest needs manual work" - doesn't sell. "Full autonomy" - sells.

What Actually Works?

Iterative work!

Just talk to the model, literally. Check the result. Adjust. Repeat. Stop when something goes wrong.

Ask: "let's come up with different solutions", "what best practices exist for working with these specific problem domains", "now let's make a prompt for the next iteration" (APE/APRO)

Instead of blindly approving every action.

"Accuracy" grows not from a larger context or a larger number of agents. It grows through iterations under the control of someone who understands the task.

Most likely:

You don't need overkill frameworks with tons of abstractions and configurations. Don't need sub-agents.
Don't need MCP for every service - write a script that does a specific thing, or just use SDK or API.
Don't need a million input context tokens - need the right context.

You need your head. Your technical expertise. Your understanding of the task. And a model you talk to like an engineer - not a magical oracle that will understand, find, and solve everything for you.

Neither expanded context, nor multi-agent, nor MCP, nor RAG - none of these increases accuracy/stability by itself. These are all tools that amplify an expert. They don't replace one.

Everything else is silver bullinng for investors and content for LinkedIn, sales presentations of yet another charade of autonomy.

P.S. This post wasn't written by LLM, it was transcribed from monologues with my PoC Dictate to Buffer, links and proofs used from my bookmarks stash, certain parts with counterexamples researched thanks to Claude

P.P.S. Most of my conclusions I reached myself based on my own trial and error, and then it turned out others had already tried and done this at a larger scale 😅

Playwright Quirks — waitForResponse

marinsky roma — Thu, 20 Nov 2025 15:56:29 +0000

Playwright has a convenient feature for waiting on responses from requests - waitForResponse.

waitForResponse — playwright.dev

This is helpful when there are no visual changes on the web UI, but you need to verify that a request was actually sent and an entity was successfully created. Instead of:

Opening the page with that entity and writing checks to verify all data is correct
Or directly "poking" a specific endpoint to check its fields

You can implement response waiting. Here's an example from the documentation:

Option 1: Declare the expectation first

Start waiting, perform the action, then await the response:

const responsePromise = page.waitForResponse('https://example.com/resource');
await page.getByText('trigger response').click();
const response = await responsePromise;

Option 2: Use a predicate

Declare the predicate with expectations, perform the action, and get the result:

const responsePromise = page.waitForResponse(response =>
    response.url() === 'https://example.com' && response.status() === 200
    && response.request().method() === 'GET'
);
await page.getByText('trigger response').click();
const response = await responsePromise;

The Promise Advantage

Promises supposedly give us an advantage over other non-async languages.
Legally, the logic described is correct and logically sound:

First, we declare that we need to wait for a response from the backend, but we don't block the execution of subsequent actions
Then we perform the action itself, and wait for it to complete - the browser directly tells Playwright that the action is done, the event is complete, roughly speaking
Only then do we "receive" the result of the async wait that we declared to the browser

But Here's the Thing...

Your backend doesn't respond to the frontend in 1 millisecond. At best, it'll respond in 100 milliseconds (unless we're talking about gRPC or WebSocket).
The question arises: How much time passes between one promise completing and the next promise starting?
Answer: Microseconds - it's just the time for the JS event loop to process the microtask queue.

So You Can Actually Write It Like This

await page.getByTestId('submit-button').click(); 
await page.waitForResponse('**/api/users');

It's more compact and simpler!
You're very unlikely to encounter a race condition where a click or fill takes so long that waiting for the response becomes pointless.
So feel free to try writing "synchronous" code like this in your project - it works for me, and it'll work for you.

Proof of Concept

For skeptics, I've prepared an example repository that emulates this behavior on localhost, where everything runs locally and passes successfully:

https://github.com/rmarinsky/trivial_service_for_experiments

Of course, edge cases can occur depending on your frontend implementation.

Does it really work for anybody, AI agentic testing, or automation

marinsky roma — Tue, 23 Sep 2025 11:28:32 +0000

I have extensive experience in testing, automation, and setup of all these systems, and have successfully led numerous projects in the past that have fixed and scaled up automation, even for non-technical engineers.

Participated and consulted with around ten companies about how to fix their Agentic testing or automation approach.
In most cases, automation and testing were so unreliable that they even resulted in false positives, prompting complaints from C-level executives, managers, and engineers about their decision to shift Test Engineers or Automation engineers.

Some of the Agentic testing providers claim to develop around 1000 test cases monthly, but after a few months, they deliver around 500 largely flaky tests.
Some providers have claimed zero flakiness, connect Confluence or documents, and our tool will build excellent test coverage for every feature without requiring human interaction.

I had a chance to listen to their demos, watch their ads, and websites with excellent descriptions
But most of their demos, sites, and ads are based on just login, or just fulfilling a simple user registration setup

But moreover, I see lots of articles on the engineering media that "Playwright MCP will blow your mind" and it's the same - login scenario, maximum simple registration, and shitty code generated based

So, does anybody have a success story for a big product to shift human Test Engineering to Agentic testing tools?