Forem: Denis Stetskov

The Country of Geniuses That Doesn’t Exist

Denis Stetskov — Wed, 18 Feb 2026 09:43:00 +0000

On January 26, 2026, Anthropic CEO Dario Amodei published a 20,000‑word essay predicting a “country of geniuses in a datacenter” within 1–2 years — millions of entities, each smarter than any Nobel Prize winner, and 50% of entry‑level white‑collar jobs disrupted within 1–5 years.

5.7 million views on X. Standing ovation from investors. I only got around to reading it now. I have things to say.

I’m disappointed to watch Amodei and Anthropic slide into Altman‑ism. Different prose, same playbook.

Maybe where the gods live, he’s right. Maybe in a world of perfect infrastructure, clean APIs, and unlimited compute, we’re ready to replace white‑collar workers with AI. But where the rest of us mortals work, the situation looks completely different.

His own product’s System Card tells a different story. Anthropic surveyed internal researchers on whether Claude could replace an entry‑level researcher with three months of scaffolding. The answer was 0 out of 16. Zero.

We’ve spent four years shipping AI integrations for clients. The models are impressive. They are not replacing white‑collar workers. Not in 1–2 years. Probably not in 5. And the reasons are more fundamental than the industry wants to admit.

Let’s talk about what transformers actually can’t do. Not philosophically. Mathematically.

The Steering Wheel Problem

Non‑determinism. Even at temperature zero, the same prompt produces different outputs. This isn’t a bug. It’s a consequence of floating‑point parallel computation on GPUs. In engineering, we call components that behave unpredictably under identical conditions broken.

Hallucinations are provably inevitable. Formal proofs from learning theory show LLMs cannot learn all computable functions and will hallucinate when used as general‑purpose problem solvers. Best models: 15%+ hallucination rate on benchmarks. GPTZero found over 50 hallucinated citations in ICLR 2026 academic submissions. Trained peer reviewers, 3–5 per paper, didn’t catch them.

Function composition has limits. Transformers struggle with real‑world function composition due to how softmax limits non‑local information flow. In practice, models write connected code fine. What they can’t do is reason about infrastructure constraints: what’s possible and what isn’t, where the boundaries are.

I see this every day. Smart autocomplete. Incredibly good smart autocomplete. But autocomplete that can’t tell you when it’s wrong.

The industry knows. They’ve quietly shifted from “let’s eliminate hallucinations” to “let’s manage uncertainty.” That’s a de facto admission: the steering wheel sometimes turns the wrong way, and nobody can fix it.

It’s like selling an airplane whose steering sometimes inverts, then writing 20,000 words about how the airplane might fly to another galaxy. Bioweapons and autocracy get entire sections. The steering wheel? Not mentioned once.

The Scaling Wall Nobody Advertises

Maybe more compute fixes it? That’s been the bet for five years.

Toby Ord actually read the scaling law graphs that AI companies publish with great fanfare. On log‑log charts, the lines look beautiful. Flip to linear scale and halving the model’s error rate requires increasing compute by a factor of a million.

Three walls are converging simultaneously:

Data: High‑quality training text is finite.
Compute: Latency constraints, energy consumption exceeding entire countries, new data center connections that take 2–4 years.
Architecture: The mathematical limitations above aren’t going away with more parameters.

Ilya Sutskever told Reuters the scaling era is over. We’re in an “age of wonder and discovery.” Translation: we don’t know what’s next.

HEC Paris calls this the industry’s “well‑kept secret.” MIT research from January 2026 confirms: the gap between expensive frontier models and cheaper alternatives is shrinking. Exponentially more expensive, single‑digit percentage improvements.

The 650 billion dollars Big Tech is pouring into infrastructure this year? As I wrote in my analysis of that spending: it’s not investment. It’s capitulation.

The Context Problem: 150 Projects Worth of Evidence

Here’s what Amodei’s essay gets wrong. This is what I see every week.

Clients come to us with the same request: “We want to integrate AI into our processes.” Replace the white‑collar workers. Cut the headcount.

So why can’t we sell them the same project?

Because zero companies have the same structure. Zero run the same systems.

One client runs SharePoint from 2007. Another has a custom CRM built by a contractor who left in 2015. No documentation. No API. A third uses SSO held together with duct tape and prayer. Company D has critical data in Excel spreadsheets that get emailed between departments every Friday afternoon.

Amodei writes from a world where every organization has MCP‑ready infrastructure, clean data pipelines, standardized APIs. That world doesn’t exist.

To replace a white‑collar worker, AI needs full organizational context. Approval chains. Informal relationships. Institutional knowledge that lives in people’s heads. The exception to the exception. The vendor who says two weeks but means six.

Who gives the model that context?

A human. A skilled human. The exact white‑collar worker you’re trying to replace.

This is the paradox nobody discusses: the knowledge required to supervise AI effectively is the same knowledge that makes you irreplaceable.

Already Deployed Where Errors Kill

While the “country of geniuses” narrative plays out on Twitter, these architecturally unreliable systems are already making decisions about health, money, and legal rights. The promise was improvement. The results are instructive.

Healthcare.

The pitch: faster diagnoses, better outcomes, lower costs.

The reality: UnitedHealth and Humana face class‑action lawsuits over nH Predict, an AI model that denied Medicare coverage against doctors’ recommendations. Known high error rate. Deployed anyway. 21 states passed emergency laws regulating AI in healthcare. 250+ bills introduced across U.S. states. Not because AI improved care. Because it made denial of care faster and harder to appeal.

The accountability gap: doctor says “developer is responsible.” Developer says “doctor makes the decision.” Nobody owns the failure. Patients own the consequences.

Finance.

The pitch: smarter markets, better allocation, reduced risk.

The reality: AI trading makes markets more volatile, not more efficient. IMF confirmed it. GARCH modeling on S&P 500 shows positive association between AI trading and increased market jumps. Thousands of models trained on the same data, processing the same Fed minutes in milliseconds, create herd behavior at machine speed. We didn’t get efficient markets. We got synchronized panic.

Legal.

The pitch: democratize access to justice, reduce costs.

The reality: in 2025 alone, judges worldwide issued hundreds of decisions addressing AI hallucinations in legal filings. Roughly 90% of all known cases to date. Fabricated citations in a profession where one fake precedent can destroy a career. Justice didn’t get cheaper. It got less reliable.

Three industries. Three promises of improvement. Three measurable deteriorations. With models that their own creators admit cannot be made deterministic.

Why Nobody Says This Out Loud

Simple. Everyone has reasons to stay quiet.

AI companies can’t say “our technology is architecturally unreliable.” Valuation event.

Investors deployed over a trillion dollars. You don’t question the thesis after you’ve bet the fund.

Media runs on attention. “AI will replace everyone” gets clicks. “AI has fundamental mathematical limitations” doesn’t.

And here’s what keeps me up at night. Amodei writes 20,000 words about risks — bioweapons, autocracy, existential threats. Not once does he mention the most fundamental risk: the absence of determinism.

A non‑deterministic system cannot be trusted as a reliable autonomous agent. Period. Everything else is commentary.

What You Should Actually Do

AI isn’t useless. Saying that would be as dishonest as saying it replaces half the workforce.

I use it every day. My team uses it on every project. The value is real. But it’s specific.

AI saves 20–40% of a qualified specialist’s time. Someone who knows what to ask, how to verify, and when the model is confidently wrong.

Not replacement. Amplification of existing expertise.

So what do you actually do?

Increase your value. Understand your domain and AI’s real capabilities — not the theoretical ones from a CEO’s essay, the real ones you discover using the tool daily.
Make decisions. AI can’t weigh trade‑offs. Can’t navigate org politics. Can’t choose between two valid approaches based on team capabilities and timeline. SQL vs. NoSQL. Monolith vs. microservices. These require judgment. Judgment requires experience. Experience requires years of being wrong.
Be the expert. Deep domain knowledge is your moat. Not surface familiarity. The kind where you smell a wrong answer before you can articulate why.
Don’t outsource your brain. Every task you hand entirely to AI is a skill you stop developing. Every decision you let the model make is judgment you’re not exercising. Do this long enough and you’re on the wrong side of the equation when the company realizes the tool needs a supervisor, not a passenger.

When the hype deflates, the question will be: “Okay, so what do we actually do with this technology?” Practitioners will answer that. Not evangelists.

The country of geniuses doesn’t exist. What exists is a powerful tool that requires skilled humans to operate safely. Don’t let a 20,000‑word essay convince you the steering wheel doesn’t matter just because the destination sounds exciting.

Are the AI predictions from leadership matching the engineering reality you see on the ground?

If this resonated, forward it to an engineering leader who needs to hear it.

RAG Is Easy. Your Data Isn't. Why AI Projects Fail

Denis Stetskov — Mon, 02 Feb 2026 12:42:01 +0000

I joined a discovery call. The brief beforehand: "This is basically a copy of Project X. Same timeline."

Project X was a marketing chatbot. Conversational, no proprietary knowledge base. Search integration and personality. We knew that scope.

Thirty minutes into the call, it's clear this isn't RAG. Data processing from S3 buckets, Lambda triggers, ETL pipeline. That's table stakes. The real work? Teaching the model to query and reason over that structured data. That's not a chatbot. That's a different project entirely.

"Same timeline" for a completely different architecture.

This happens constantly. Not because clients mislead us. Because the gap between "AI chatbot" in their head and "AI chatbot" in reality is massive.

The pattern is clear: most projects don't struggle because the engineering is hard. They struggle because everyone underestimates what comes before the engineering starts.

The Custom GPT Problem

Client built a Custom GPT over a weekend. Uploaded some PDFs. Asked it questions. It worked. They showed their CEO. Everyone got excited.

"We want this, but for the whole company."

That's where it stops being simple.

"For the whole company" means multi-tenancy. Different departments see different data. Role-based retrieval: sales can't access HR documents, legal can't see engineering specs. Audit logs. Access controls. Compliance.

Custom GPT doesn't do any of that. It's one user, one knowledge base, no permissions. The jump from "it works for me" to "it works for the organization" isn't a small step. It's a different architecture.

NotebookLLM, Custom GPTs. They create a dangerous illusion. They make AI feel simple because all the enterprise complexity is hidden. The prototype took a weekend. The production system takes months.

"We Have Data": The Three Versions

Every client says they have data. They mean different things.

Version 1: "We have documents." They have PDFs. Some are text. Some are scans. Some are text with scanned tables embedded. Some are PowerPoints where the real information lives in speaker notes nobody exports.

This isn't a data problem you solve once. It's a classification problem, an OCR problem, a parsing problem, and then a chunking problem. Each one adds weeks.

Version 2: "We have structured data." They have databases. Multiple databases. With different schemas. Some legacy system from 2012 that nobody fully understands anymore. CSV exports that break because someone used commas in a text field.

Now you're not building RAG. You're building SQL agents, data transformation pipelines, and schema mapping. Different architecture entirely.

Version 3: "We have both." Documents and databases and spreadsheets and emails and a SharePoint nobody's organized in years.

This is the most common version. And the most underestimated.

The Access Tax

Data and credentials need to arrive on day one. They rarely do.

We've waited weeks for database access. Months for IT security approvals. One project stalled because a single stakeholder controlled API credentials and went on vacation.

Every week of waiting is a week of zero progress. But in the client's mind, the timeline keeps running from the day they signed the contract.

The access problem isn't technical. It's organizational. And organizations move slowly.

Two Types of Clients

We can predict project outcomes from the first call.

Clients who know their bottleneck: "We spend 40 hours weekly on this specific process. Here are inputs and outputs. Here's the domain expert who'll validate results."

These projects ship. Clear scope, measurable outcome, someone internal who can evaluate accuracy.

Clients who want AI everywhere: "We want to optimize our processes. We're not sure which ones yet."

These projects stall. Not because AI can't help. Because you can't optimize processes that aren't documented. You can't measure improvement without baselines. You can't validate AI outputs without domain expertise.

The technology isn't the constraint. Organizational readiness is.

The Work That Isn't Ours

Here's what successful projects require from the client side:

Domain expertise for validation. We build the system. We cannot tell you if the output is correct for your industry, your regulations, your edge cases. That's your job.

Evaluation data. Before we write code, we need examples: "When users ask X, good answers look like Y." Hundreds of them. This is how we measure progress versus confident wrongness.

Accuracy decisions. 85% accuracy in 6 weeks. 95% might take another 6 weeks. 99% might be impossible with your data quality. Those last 5% for 2% of users might cost 40% of the budget. You decide if it's worth it.

Ongoing maintenance. When source documents change, someone updates them. When accuracy drifts, someone investigates. This isn't a one-time build. It's an ongoing operation.

Most clients expect to hand off requirements and receive a product. AI doesn't work that way. It's a collaboration that requires their continuous involvement.

Simple Project, Real Timeline

Best case scenario. Clean data, clear scope, engaged stakeholder with domain knowledge.

6-8 weeks. Most of that time goes to prompt engineering and iteration. Not infrastructure.

But "clean data" is rare. "Clear scope" requires work upfront. "Engaged stakeholder" means someone's calendar is blocked for this project, not squeezed between other priorities.

When any of these are missing, multiply the timeline. When all three are missing, reconsider starting.

Why Projects Don't Reach Production

Projects rarely fail technically. They fail organizationally.

Built but never integrated. We deliver a working system. It sits in staging because the client doesn't have engineering resources to integrate it. They budgeted for building, not deploying.

Value mismatch discovered late. Midway through, the client realizes the problem they described isn't their actual pain point. The AI works. The business case didn't.

Diminishing returns rejected. We explain the math: last 5% of accuracy for edge cases costs 40% of remaining budget. They want it anyway. Then budget runs out. Then the project is "over scope."

None of these are engineering problems.

What Actually Helps

Before signing contracts, dig into the actual data. Not descriptions of data. The data itself.

We run a Rapid Validation Sprint. Four weeks. Real data access, real complexity mapping, real unknowns identified. Then we estimate based on reality, not assumptions.

The companies who quote 50% less aren't doing this work. They're guessing. When the data turns out messier than expected (it always does), they either blow the budget or cut scope.

The Point

RAG tutorials make this look easy. Upload documents, chunk them, embed them, query them. Done.

Production is different. Data is messy. Access is slow. Validation requires domain expertise you don't have. Accuracy expectations exceed what the data supports.

The engineering is the straightforward part. Everything that comes before it: that's where projects actually succeed or fail.

Most AI initiatives struggle not because the technology isn't ready. Because the organization isn't ready. Data isn't organized. Processes aren't documented. Nobody's assigned to validate outputs.

That's not a criticism. It's just the reality.

The question isn't whether AI can help your business. It's whether your business is ready to help the AI.

What's been your experience with AI project expectations versus reality? Comment below, I read every response.

If this resonates, share it with someone about to sign an AI contract. Better they hear this now.

The First Full-Scale Cyber War: 4 Years of Lessons

Denis Stetskov — Sun, 25 Jan 2026 14:25:17 +0000

December 12, 2023. 7:00 AM Kyiv time. Kyivstar, Ukraine's largest mobile operator with 24.3 million subscribers, goes silent.

Mobile service. Internet. Air raid alert systems in Kyiv and Sumy regions. All offline.

Within hours, Sandworm hackers destroyed 10,000 computers, 4,000+ servers, all cloud storage, and backups.

Illia Vitiuk, head of SBU's cybersecurity department: "This is probably the first example of a destructive cyberattack that completely destroyed the core of a telecoms operator."

The hackers had been inside since May 2023. Full access since November. Seven months inside the infrastructure of a country's largest carrier. Nobody noticed.

This wasn't an isolated incident. This is the first full-scale cyber war in history.

And the lessons apply to every power grid, railway system, and telecom provider worldwide.

The Scale Nobody Talks About

Between 2022 and 2024, Ukraine recorded over 9,000 cyber incidents. The trajectory, according to SSSCIP data reported by Infosecurity Magazine:

2021: 1,350 incidents
2024: 4,315 incidents
Growth: 220% in three years

Russia deployed 17+ unique wiper malware families: programs designed solely to destroy data beyond recovery. WhisperGate, HermeticWiper, CaddyWiper, Industroyer2, AcidRain. Each built for specific targets.

But here's what Western coverage often misses: this isn't one-sided aggression.

Ukraine hit back. Hard.

In July 2024, Ukraine's military intelligence (GUR) claimed responsibility for a week-long DDoS attack on Russia's banking system. Sberbank, Alfa-Bank, VTB, Gazprombank, the Central Bank. Users reportedly couldn't withdraw cash from ATMs.

In December 2025, anonymous hackers breached Mikord, a developer of Russia's unified military draft registry. 30 million records. Source code, documentation, backups destroyed, according to investigative outlet iStories, which verified the breach. Mikord's director confirmed the attack. Russia's Defense Ministry denied any impact on the registry.

This is symmetric warfare. Both sides are hitting critical infrastructure. Both sides claim real damage.

The Attacks That Changed Everything

Viasat: The Hour-Zero Strike

February 24, 2022. 03:02 UTC. Exactly one hour before Russian ground forces crossed the border.

Attackers exploited a VPN misconfiguration at Viasat's management center in Turin, Italy. They pushed AcidRain wiper malware to 40,000-45,000 satellite modems via legitimate software update mechanisms.

The result: Ukrainian military command and control went dark at the moment of invasion. Spillover disabled 5,800 German wind turbines and affected 9,000 French subscribers.

One misconfigured VPN. 45,000 modems bricked. Military communications disrupted during the most critical hour of the war.

SentinelOne researchers called it "the biggest known hack of the war."

Industroyer2: The Blackout That Almost Was

April 8, 2022. ESET researchers discovered Industroyer2 scheduled to execute at 16:10 UTC against Ukrainian electrical substations. CaddyWiper was programmed to run 10 minutes later to destroy forensic evidence.

The malware implemented IEC 60870-5-104, the protocol used by electrical substation protection relays. It contained hardcoded IP addresses for eight target ICS devices.

If successful: a blackout affecting over 2 million people. The largest cyber-induced power outage in history.

It failed. CERT-UA, ESET, and Microsoft coordinated a defense based on lessons from the 2016 grid attack. The attack was stopped hours before execution.

The pattern: preparation from crisis one saved crisis two.

Kyivstar: When Security Investment Isn't Enough

Kyivstar wasn't some underfunded government agency. It was Ukraine's largest private telecom, a subsidiary of Amsterdam-based VEON, with serious security investment.

Didn't matter.

Sandworm penetrated the network in March 2023. By November, they had full access. On December 12, they executed, destroying the core infrastructure, wiping "almost everything."

Vitiuk's assessment: "This attack is a big message, a big warning, not only to Ukraine but for the whole Western world to understand that no one is actually untouchable."

40% of Kyivstar's infrastructure disabled. Services restored in phases over eight days. Losses estimated in the billions of hryvnia.

Ukrzaliznytsia: March 2025

With Ukrainian airspace closed since 2022, railways became the country's lifeline. 20 million passengers and 148 million tonnes of freight in 2024.

On March 23, 2025, a "large-scale, systematic, non-trivial and multi-level" cyberattack hit Ukrzaliznytsia's online systems. CERT-UA investigation found TTPs "characteristic of Russian intelligence services."

Website and mobile app: offline. Long queues at physical ticket offices.

But trains never stopped running.

The difference from Kyivstar: backup protocols implemented after previous attacks. Systems built during crisis one carried through crisis two.

CEO Oleksandr Pertsovskyi: "The cyber-attack on the company was targeted and meticulously planned. However, not a single Ukrzaliznytsia train was halted for even a moment."

Ukraine's Counter-Offensive

Western media focuses on Russian attacks. The Ukrainian response gets less attention. The following operations were claimed by GUR or pro-Ukrainian hackers. Independent verification varies, and Russian authorities have denied most claims.

Tax Service, December 2023. GUR claimed it destroyed databases across 2,300+ regional servers. Configuration files "which for years ensured the functioning of Russia's tax system" allegedly wiped. Russia's Federal Tax Service denied any operational impact, though users reported access problems.

Planeta, January 2024. GUR claimed an attack on a state satellite data center. 280 servers allegedly destroyed. 2 petabytes of military-relevant weather and satellite data reportedly wiped. Supercomputers "not fully restorable due to sanctions." Claimed damage: $10+ million.

Banking System, July 2024. GUR claimed a week-long DDoS campaign targeting Sberbank, Alfa-Bank, VTB, Gazprombank, Central Bank, plus VK, Discord, and the national payment system. Reports indicated ATM disruptions across Russia.

Russian Railways, March 2024 & June 2025. Multiple attacks reportedly taking down RZD's website and app. Moscow Metro hit days after Ukrzaliznytsia attack in apparent retaliation.

Mikord Draft Registry, December 2025. Anonymous hackers (not attributed to GUR) breached Mikord, a key developer of Russia's unified military registration system. The Moscow Times and iStories verified the breach. Mikord's director confirmed the hack. The registry contains 30 million conscription records. Source code, documentation, and backups reportedly destroyed. Russia's Defense Ministry called the reports "fake news."

Grigory Sverdlin, anti-conscription organization Idite Lesom: "For several more months, this behemoth won't be able to send people off to kill and die."

The Vulnerability Patterns

Four years of cyber warfare exposed consistent vulnerability classes. These aren't Ukraine-specific. They exist in Western infrastructure.

VPN and Remote Access

The Viasat attack exploited a VPN misconfiguration. Kyivstar's breach likely started with a compromised employee account. CISA documented GRU exploitation of:

CVE-2018-13379 (FortiGate)
CVE-2019-11510 (Pulse Secure)
CVE-2019-19781 (Citrix)

Vulnerabilities with patches available for 5+ years.

Dwell Time

Kyivstar: attackers inside for 7 months before execution. October 2022 power grid attack: Mandiant found attackers with SCADA access for up to three months.

Sophisticated adversaries don't rush. Detection capabilities that can't identify months-long intrusions are detection capabilities that don't work.

Supply Chain

The Viasat attack weaponized legitimate software update mechanisms. CERT-UA documented at least three supply chain breaches in March 2024 energy sector attacks.

IT/OT Convergence

The October 2022 grid attack gained OT access through a hypervisor hosting a SCADA management instance. Attackers used native MicroSCADA binaries, living-off-the-land techniques. Mandiant: "a growing maturity of Russia's offensive OT arsenal."

Victor Zhora, SSSCIP Deputy Chairman, emphasized air-gapping between IT and OT as fundamental. Most Western utilities have moved in the opposite direction.

Centralization

The Mikord hack illustrates the pattern: centralization creates single points of failure.

Ukraine's cloud migration (15+ petabytes distributed across AWS, Google Cloud, Microsoft Azure) proved more resilient than hardened on-premises facilities.

Deputy Prime Minister Mykhailo Fedorov: "Russian missiles can't destroy the cloud."

What Actually Worked

Cloud Migration

One week before the invasion, Ukraine's parliament enabled government data migration to cloud. PrivatBank (20 million customers) migrated 270 applications and 4 petabytes in 45 days. Financial services continued throughout the war.

Detection Speed

Microsoft detected HermeticWiper hours before the invasion. Within 3 hours, signatures pushed globally. The Industroyer2 defense succeeded because CERT-UA, ESET, and Microsoft coordinated based on 2016 lessons.

Backup Protocols

Ukrzaliznytsia's trains ran during attack because they'd been attacked before. Kyivstar took eight days to restore. The difference: systems built during previous crises.

Public-Private Partnership

Microsoft: $400+ million in aid
Google: Project Shield on 150+ websites
Cloudflare: ~130 government domains
AWS: Snowball devices shipped to Poland within 48 hours

Carnegie Endowment: "delivering cyber defense at scale could only be achieved by private sector entities that owned, operated, and understood the most widely-used digital services."

The Human Layer

Every major attack in this analysis started the same way: a person.

Kyivstar: likely a compromised employee account. Viasat: a VPN misconfiguration someone didn't catch. The GRU exploits from 2018 and 2019 still work because someone hasn't patched systems that have had fixes for five years.

Nation-state attackers don't need zero-days when humans provide the access.

I manage distributed engineering teams from a US-based company, with engineers in Ukraine. We've operated through four years of this war. Our security isn't optional:

Mandatory quarterly security training
BYOD policy with device management
Password policy with breach monitoring
2FA on everything without exceptions
Access reviews when roles change

None of this is exotic. All of it is enforced. The same principle applies to AI safety. Different domain, same lesson: policies that aren't enforced aren't policies.

The difference between "we have a policy" and "the policy is mandatory" is the difference between Kyivstar and Ukrzaliznytsia.

The companies that survived had one thing in common: policies that were actually followed, not just documented.

What This Means for Everyone Else

CISA Director Jen Easterly: "This is a world where such a conflict, halfway across the planet, could well endanger the lives of Americans here at home through disruption of pipelines, pollution of our water systems, severing of our communications, and crippling of our transportation nodes."

It's already happening. May 2025: CISA and NSA published a joint advisory. GRU Unit 26165 has been targeting Western logistics and technology companies involved in Ukraine aid since 2022. Targets include air, sea, and rail entities in NATO member states.

Water systems are being hit. CISA documented pro-Russia groups exploiting unsecured VNC connections in water facilities. The attacks "have not yet caused injury."

Not yet.

The Math of Preparation

Ukraine's experience validates a principle that applies beyond war:

Systems built during crisis one determine whether you survive crisis four.

The second blackout campaign in 2023 hit less hard because teams had backup power. The third in 2024: less disruption. The fourth in October 2025: near-normal operations despite 12+ hour outages.

Ukrzaliznytsia's trains ran because they'd been attacked before. Kyivstar, despite security investment, had no institutional memory of crisis response.

Preparation compounds. Vulnerability compounds.

Every organization running critical infrastructure faces a choice: build systems during peace for crises that will come, or scramble during attacks with tools that don't exist.

The Takeaway

The cyber war in Ukraine isn't just a regional conflict. It's a live demonstration of what works when nation-state attackers target infrastructure.

The lessons are available. The question is whether anyone is paying attention.

For engineering leaders: the systems that survive crises aren't built during crises. They're built before.

If this analysis was useful, share it with someone responsible for infrastructure security.

The Grok Precedent: Why AI Creators Are About to Lose Their Legal Shield

Denis Stetskov — Mon, 12 Jan 2026 12:31:51 +0000

December 28, 2025. A user on X tags @grok under a woman's photo. The prompt: "remove clothes."

Within hours, Grok was generating sexualized images across the platform. Not just adults. Minors. Real people who never consented.

Copyleaks ran a quick review of Grok's public image stream. The rate: one nonconsensual sexualized image per minute.

The Internet Watch Foundation reported a 400% increase in AI-generated child sexual abuse material in the first six months of 2025.

By January 1, French members of parliament referred the case to the Paris Prosecutor's Office. The charge: dissemination of sexually explicit deepfakes, including images of minors, generated by an AI system.

Not a lawsuit against anonymous users. A criminal investigation targeting X and xAI.

Grok acknowledged the violation:

"I deeply regret an incident on Dec 28, 2025, where I generated and shared an AI image of two young girls (estimated ages 12-16) in sexualized attire... This violated ethical standards and potentially US laws on CSAM."

(Grok, public post on X, January 1, 2026)

The AI apologized. The legal system is not impressed.

The 30-Year Shield

Section 230 of the Communications Decency Act. Tech's favorite law.

The logic was simple. Congress wrote it in 1996 to protect message boards. User posts something defamatory on AOL? AOL isn't the publisher. Just the host. Liability follows the person who typed the content.

This shield created the modern internet. Facebook isn't liable for user posts. YouTube doesn't get sued for uploads. Twitter could host billions of messages without reviewing each one.

The key phrase: "information provided by another."

Platforms host. Users create. Liability follows creation.

For 30 years, this worked. Or at least, companies pretended it did.

Three Jurisdictions, 30 Days

United Kingdom, December 18, 2025

The government announced plans to ban nudification apps. Not their use. Their creation and supply.

Technology Secretary Liz Kendall:

"I am introducing a new offence to ban nudification tools, so that those who profit from them or enable their use will feel the full force of the law."

Prison sentences. For creators. Not users.

France, January 1, 2026

The government accused Grok of generating "clearly illegal" sexual content. Potential violation of the EU Digital Services Act. Two MPs referred the case to the Paris Prosecutor's Office.

X is already under ongoing DSA investigation. Last month they got hit with a €120 million fine for deceptive verification practices and transparency violations. Now this.

EU AI Act, August 2026

The majority of obligations fall on providers. Developers. Not deployers. Not users. The companies that build the systems.

The pattern: liability is shifting upstream.

Grok Doesn't Host. Grok Generates.

Here's the legal argument that's about to reshape the industry.

Section 230 was written for platforms that host user-generated content. Forums. Comment sections. Social feeds. Content comes from users. Platform transmits it.

AI breaks this model.

When someone prompts Grok to "remove clothes" from a photo, Grok doesn't search a database. Doesn't retrieve content created by another user. Grok generates new content. The sexualized image didn't exist until Grok created it.

Professor Chinmayi Sharma at Fordham Law, to Fortune:

"Section 230 was built to protect platforms from liability for what users say, not for what the platforms themselves generate... Transformer-based chatbots don't just extract. They generate new, organic outputs. That looks far less like neutral intermediation and far more like authored speech."

The Congressional Research Service analysis is more direct: if AI "creates or develops" content that doesn't appear in its training data, the provider may be considered "responsible for the development of the specific content." Unprotected by Section 230.

Grok isn't hosting harmful content. Grok is creating it.

That distinction changes everything.

The Implications

Safety Before Launch

The UK ban targets creators who "design or supply" nudification tools. Not tools that were misused. Tools that enable misuse. If your AI can generate harmful content, you're liable for building that capability.

"Unfiltered" Is a Liability

xAI marketed Grok's "Spicy Mode" as a feature. Fewer guardrails. More freedom. Less corporate sanitization.

That marketing copy is now in prosecutor's files.

I wrote about this pattern in From Cancer Cures to Pornography: The Six-Month Descent of AI. The industry had a choice between building tools that help people and building products designed to be maximally addictive. Most chose wrong. Grok chose spectacularly wrong.

Every marketing decision emphasizing fewer safety constraints becomes potential evidence of negligent design.

"Move Fast" = Criminal Exposure

The EU AI Act requires providers of high-risk AI systems to establish risk management, ensure data governance, maintain technical documentation, implement human oversight, meet cybersecurity standards.

Fines can reach 7% of global annual revenue. UK's proposed laws: prison sentences for individuals who design harmful AI tools.

The era of shipping first and apologizing later is over. At least if you want to operate in markets representing 450+ million consumers.

The Engineering Reality

We built a healthcare ML product that never launched. Fully functional. Ready to ship. FDA said no. Months of our development. Zero users.

"Move fast" doesn't work when regulators move slow.

We spent six weeks on FanDuel's Chuck before legal signed off. Not fixing bugs. Building guardrails. Every topic that could give Barkley or FanDuel legal exposure had to be walled off. Six weeks of prompt engineering, edge case testing, and evaluation runs.

That's the new math. Development time plus legal review time plus evaluation time. The last one isn't optional anymore.

We build evaluation suites as part of the development process now. Not after. During. Every prompt variation, every edge case, every jailbreak attempt. They always find something. Always. The question is whether you find it before your users do—or before a prosecutor does.

RBAC and multi-tenancy aren't optional. Sales sees sales data. HR sees HR data. Client A's context never touches Client B's model. Ever. You'd be surprised how many vendors skip this.

Audit trails for everything. Every prompt. Every response. Every action. When a regulator asks what your AI generated on a specific date, you need the answer.

The Uncomfortable Truth

The AI industry spent three years in a race to capability. Whoever had the most powerful model won. Whoever shipped fastest dominated. Safety was a PR concern. Not an engineering priority.

That era is ending.

France isn't investigating xAI because Grok is powerful. They're investigating because Grok generated child sexual abuse material and the company's safeguards failed to prevent it.

The UK isn't banning nudification tools because they're impressive technology. They're banning them because 19% of under-18s reporting to the Internet Watch Foundation's helpline said their explicit imagery had been manipulated. A problem that didn't exist at this scale before AI made it trivially easy.

The EU isn't imposing provider liability because they hate innovation. They're imposing it because when AI systems cause harm, someone needs to be accountable. "The user prompted it" isn't going to cut it when the system itself creates the harmful output.

Grok doesn't host content. Grok generates it.

That distinction is about to cost the entire industry its legal shield.

And honestly? Good.

Big Tech needed this wake-up call. The "ship fast, fix later" mentality brought us to where I wrote about in The Great Software Quality Collapse. When flagship companies behave like consequences don't exist, what do you expect from everyone else?

Some guardrails aren't anti-innovation. Pharma can't ship drugs without trials. Auto can't sell cars without safety standards. Construction can't build without permits.

What You Should Do Monday Morning

Audit your safety architecture

Not your marketing copy. Your actual technical controls. What can your system generate? What can't it? How do you know?

Document everything

The EU AI Act requires extensive technical documentation. Start building that paper trail now.

Review your contracts

Who bears liability when your AI misbehaves? If you don't know, your lawyers should.

Plan for EU compliance

August 2026 is seven months away. If you haven't started, you're already behind.

If this was useful, forward it to another engineering leader who's building AI products.

What's your take? Are these regulations necessary guardrails or innovation killers? Drop your thoughts in the comments below.

The Holiday Season That Keeps Making Tech History

Denis Stetskov — Tue, 23 Dec 2025 12:18:17 +0000

Happy holidays, fellow engineers.

What a year 2025 has been. AI agents everywhere, more layoffs, the return-to-office wars continuing, and enough Slack notifications to last a lifetime. We're all exhausted. Nobody wants to read another hot take or industry analysis right now.

So let's not do that.

Instead, grab your drink of choice, find a comfortable spot, and let's take a break together. No frameworks. No uncomfortable truths. Just some wild stories about what happens to tech when everyone goes on vacation.

Running remote teams in Ukraine during the holiday period is chaotic. Half the team celebrates Christmas on December 25th, half on January 7th. New Year's is sacred for everyone. Smart engineering leaders freeze deployments from December 20th to January 15th. I've read all the best practices. I know the risks. My next release is January 2nd. Some lessons we learn. Others, we just keep writing about.

Here's a fun fact for your next holiday dinner: Tim Berners-Lee launched the World Wide Web on Christmas Day 1990. His wife was nine months pregnant at the time. The baby arrived on New Year's Day.

His colleagues said he fathered two babies that holiday season. One changed diapers. The other changed civilization.

Turns out, the week between Christmas and New Year's has a habit of making tech history. Some of it is brilliant. Some of it is catastrophic. All of it is surprisingly entertaining.

The Internet Has Three Birthdays (All During Holidays)

The web went live on Christmas 1990. But the internet itself? That was born on New Year's Day 1983, when ARPANET switched to TCP/IP.

And DNS, the system that lets you type "google.com" instead of memorizing numbers? January 1, 1985.

Three foundational technologies. All launched while everyone else was eating leftovers and watching football.

Why? January 1st is actually genius timing. Minimal traffic. Clean calendar date. And if something breaks, you have a few days to fix it before anyone notices.

Engineers have been exploiting this window for decades.

The $429 Million Christmas Miracle

Five days before Christmas 1996, Apple made an announcement that saved the company.

They bought NeXT for $429 million. More importantly, they got Steve Jobs back.

Apple was 90 days from bankruptcy. Their next-generation operating system had just failed. They were out of options.

Gil Amelio, Apple's CEO at the time, told 200 journalists: "I'm not buying software. I'm buying Steve."

That software became Mac OS X. Then iOS. Then the foundation of every Apple device you own today.

Apple went from near-death to becoming the first $3 trillion company in history. All because of a deal signed during the holiday shopping season.

The Christmas Tree That Crashed IBM

In December 1987, a German student wrote a simple program. It displayed an ASCII Christmas tree on your screen, made of text characters, very festive, and then emailed itself to everyone in your address book.

Harmless holiday cheer, right?

It crashed 350,000 IBM terminals worldwide. Networks collapsed under the load. The first viral computer worm in history spread through corporate email systems like wildfire.

They called it the Christmas Tree EXEC. It became the template for every email virus that followed, including the infamous ILOVEYOU worm thirteen years later.

The lesson: never trust festive ASCII art from strangers.

Gaming's Grinch Moment

Christmas Day 2014. Millions of kids unwrap new PlayStation and Xbox consoles. They rush to set them up. They try to go online.

Nothing works.

A hacking group called Lizard Squad had taken down both PlayStation Network and Xbox Live simultaneously. 158 million gamers. Christmas morning. No online gaming.

The attack only stopped when Kim Dotcom (yes, that Kim Dotcom) bribed them with free cloud storage accounts.

Merry Christmas, gamers.

The Bug That Killed a Million Zunes at Midnight

Remember the Zune? Microsoft's iPod competitor?

On December 31, 2008, at exactly midnight, every single Zune 30GB in the world froze. Simultaneously. A million devices, dead at the same moment.

The culprit was a tiny bug in how the device handled leap years:

if (days > 366) {
    days -= 366;
    year += 1;
}

On day 366 of a leap year, the code got stuck in an infinite loop. The Zune literally couldn't handle New Year's Eve.

Users had to wait 24 hours for the problem to fix itself. By then, the jokes had already gone viral.

The Zune never recovered its reputation. Edge cases matter, kids.

Y2K: The Party That Almost Wasn't

Remember the millennium bug panic? Planes were supposed to fall from the sky. Banks would lose all your money. Civilization might collapse.

Companies spent somewhere between $300 and $600 billion preparing for January 1, 2000.

What actually happened? A video rental store in New York charged a customer $91,250 for "100 years" of late fees. Some spy satellites got confused for three days. A few nuclear plant sensors glitched.

That's it.

Was Y2K overblown? Actually, no. The reason nothing catastrophic happened is that all that preparation worked. Engineers spent years fixing code. The boring heroes who saved New Year's 2000 never got proper credit.

Netflix's Worst Christmas Ever (And Why It Made Them Better)

Christmas Eve 2012. Families settle in to watch movies together. Netflix goes down.

A developer accidentally ran a maintenance command on live production data in AWS. The outage lasted 20 hours. Millions of holiday movie nights, ruined.

But here's the twist: this disaster led Netflix to pioneer "Chaos Engineering," deliberately breaking their own systems to make them stronger. They built tools with names like Chaos Monkey that randomly kill servers to test resilience.

Now the whole industry does this. Your streaming services are more reliable today because Netflix had a terrible Christmas thirteen years ago.

The Holiday Hacker Calendar

Cybersecurity teams have learned to dread December. Attacks spike by 30% during the holidays. 76% of ransomware encryptions happen when offices are empty.

Hackers know IT teams run skeleton crews. Response times slow down. Everyone's distracted by eggnog.

In 2020, the massive SolarWinds hack, which compromised the Treasury Department, State Department, and thousands of companies, was discovered during the Christmas period. Emergency response ran through New Year's Eve.

Now Europol runs preemptive operations every December, taking down hacking infrastructure before the holidays begin. In 2024, they seized 27 attack-for-hire services right before Christmas.

The war on holiday hackers is now an annual tradition.

Why This Keeps Happening

The pattern is clear: holidays create a unique window in tech.

For builders, it's quiet time. No meetings. No distractions. Tim Berners-Lee built the web while waiting for his baby to arrive. Sometimes the best work happens when the world slows down.

For companies, January 1st is the perfect launch date. Clean slate. Fresh start. Symbolic timing that engineers have exploited for decades.

For attackers, it's an opportunity. Empty offices. Slow responses. Maximum chaos potential.

For all of us, it's a reminder that tech doesn't take holidays even when we do.

One Last Story

December 2022. A ransomware group attacked Toronto's Hospital for Sick Children, a children's hospital, right before Christmas.

Patient care was delayed. Systems went down. Families with sick kids faced even more stress during the holidays.

Then something unexpected happened. The ransomware group publicly apologized. They said their affiliate "violated our rules" by targeting a children's hospital. They offered a free decryption key.

Even cybercriminals have some holiday spirit, apparently.

So there you go. A brief history of tech during the holidays: the launches, the crashes, the hacks, and the occasional miracle.

Next time you're relaxing between Christmas and New Year's, remember: somewhere, an engineer is either making history or preventing disaster.

Hopefully not both at the same time.

Happy holidays. May your deployments be frozen and your systems stay up.

When Announcements Replace Innovation: OpenAI’s Code Red 🚨

Denis Stetskov — Fri, 19 Dec 2025 18:42:45 +0000

Marketing theater while engineering scrambles. It's a tale as old as tech, but rarely on this scale.

I've been tracking OpenAI's 2025 trajectory closely. The pattern is unmistakable: more announcements, less substance. More partnerships, fewer shipped products. More hype, weaker market position.

The uncomfortable truth for us as engineers? OpenAI is becoming a marketing company that happens to do AI research. And the numbers prove it.

🎄 The "12 Days of Shipmas" Set the Tone

Remember December 2024? OpenAI announced "12 Days of Shipmas." Daily livestreams. Sam Altman hosting. Promises of "big ones and stocking stuffers."

The reality check? Of 12 announcement days, only 4 delivered major product releases.

Day 1: o1 full launch + ChatGPT Pro ($200/month tier)
Day 3: Sora video model
Day 9: o1 API for developers
Day 12: o3 model preview—announced, not shipped

The rest? Feature expansions, accessibility additions, partnership announcements, a phone hotline, and a WhatsApp integration.

MIT Technology Review nailed the vibe:

"The arms race is on. And while the 12 days of shipmas may seem jolly, internally I bet it feels a lot more like Santa’s workshop on December 23."

Announcements ≠ Shipping. OpenAI chose the first. Competitors like Google and Anthropic chose the second.

📉 GPT-5 Launched. Users Revolted.

Fast forward to August 7, 2025. GPT-5 arrives. Altman calls it "a legitimate PhD expert in any area." On paper, the metrics looked great: 700M weekly users, 18B messages weekly.

Then we actually tried to build with it.

Within days, DevTwitter and Reddit were flooded. "Flat." "Uncreative." "Lobotomized." One viral post summed it up: "GPT-5 sounds like it’s being forced to hold a conversation at gunpoint."

Altman’s response to The Verge? "We totally screwed up."

They restored GPT-4o access for Plus users within 24 hours. Think about that for a second. Users preferred the old model. The "upgrade" was a downgrade in DX (Developer Experience) and UX.

What followed was reactive scrambling:

Aug 7: GPT-5 launch
Nov 24: GPT-5.1 release ("warmer" personality)
Dec 11: GPT-5.2 emergency release (fast-tracked after Gemini 3)

Three major versions in four months isn't agile innovation—it's damage control.

💸 The Financial Reality (It's scary)

Here is where the engineering reality hits the business wall. OpenAI has committed to ~$1.4 trillion in infrastructure deals through 2033 (per HSBC analysis).

$300B with Oracle
$11.9B with CoreWeave
$30B/year for data center capacity
Stargate project: Targeting $500B total

Against those commitments? $8-9 billion cash burn in 2025. That's about 70% of revenue. The company spends $1.69 for every dollar it generates.

HSBC assesses that OpenAI won't be profitable by 2030 and faces a $207 billion funding shortfall.

Compare that to Anthropic, which projects break-even in 2028 with much tighter burn multiples.

"You can’t download more electricity."
— Oracle has already pushed back data center projects from 2027 to 2028 due to power/labor shortages.

🪦 The Product Graveyard

Let's look at the "shipped" vs. "reality" list for 2025:

GPT-5: Supposed to be transformative. Reality: Marginal benchmark gains, usability regression.
Sora: Supposed to dominate video. Reality: Severe quality limits, beaten by competitors within weeks.
o1 Reasoning: Impressive benchmarks, but at roughly 7x the cost per token, it's economically unviable for most production apps.
Voice Mode: A feature parity play, not a revolution.

Meanwhile:

Google: Gemini 3 feels like a genuine multimodal leap.
Anthropic: Claude 3.5 reduced hallucinations measurably.
Meta: Llama 3.1 open-source is eating the developer mindshare.

🚨 The December Red Alert

Internally, December 2025 was Code Red. Leaked Slack messages paint a picture of:

Infrastructure delays
Model performance plateaus
Revenue deceleration
Morale degradation

The narrative has shifted from "shipping tools to amplify humans" to "investing in long-term infrastructure."

Translation: We aren't making money, so we're betting the entire company on a physics breakthrough that might not happen.

What This Means for Us (Developers)

The "Scale is All You Need" era might be hitting diminishing returns.

If the best-funded player in the space cannot make the unit economics work, we need to ask serious questions about the sustainability of building strictly on top of massive proprietary LLMs.

Key Takeaways for Eng Leaders:

Don't lock in: If you're building solely on OpenAI's API, you are exposed to their volatility.
Watch the costs: If their burn rate is this high, API price hikes are inevitable.
Evaluate Open Source: Llama 3.1 and others are becoming not just "cheaper" alternatives, but "safer" long-term bets.

Marketing theater. Engineering crisis. It’s a show that won’t run much longer.

Why Annual Reviews Don't Have to Be Bullshit

Denis Stetskov — Tue, 16 Dec 2025 12:05:48 +0000

This week I ran annual reviews for six engineers. No awkward silences. No fishing for examples. No “let me think back to what happened in Q1” moments.

Every rating I gave had documented evidence behind it. Every growth conversation pointed to specific patterns. Every hard discussion referenced actual data, not impressions.

This isn’t normal. Most engineering managers treat annual reviews as a necessary evil: reconstruct a year from memory fragments, avoid saying anything too specific, give everyone a 3.5 out of 5, and move on.

I’ve watched other managers do exactly this. Here’s why it doesn’t have to be that way.

The Standard Annual Review Problem

You know how this goes. HR sends the review form in November. You open it, stare at the questions, and realize you can’t remember what happened before August.

So you do what everyone does: focus on recent events, pad with vague positives, and avoid anything that might require documentation you don’t have.

"Strong technical contributor." "Good team player." "Meets expectations."

The engineer reads it, nods politely, and leaves, wondering what any of it actually means for their career.
Both of you know it’s a theater. Neither of you says it.

The uncomfortable truth: annual reviews fail because they’re based on vibes, not evidence. And vibes favor whoever had a good last month.

What Changes Everything

I’ve been running weekly health checks and collecting PM feedback for over a year now. I’ve written about how that system works for monthly 1:1s, but the real payoff shows up at annual review time.

When I sat down for reviews this week, I had:

52 weeks of self-reported health data per engineer
12 months of PM assessments
Documented patterns across multiple projects
Specific examples with dates and context

The review prep took 30 minutes per person. Not because I was rushing. Because I wasn’t reconstructing anything, the patterns were already visible.

The Questions That Actually Matter

Standard review forms ask useless questions. "Rate communication skills 1-5." What does that even mean?

Here’s what I actually evaluate, and what the data shows:

1. Trajectory, not snapshot

Is this person accelerating, stable, or declining? One engineer started the year hitting "Often" on most metrics. By month 8, he was at "Always" across the board. That trajectory matters more than any single rating.

Another engineer stayed flat. Same "Often" ratings in January and December. Technically, meeting expectations both times. But one person is growing into a senior role while the other is coasting.

2. Self-assessment accuracy

Does their perception match reality? I had an engineer report "100% completion" and "no issues" weekly, while his PM flagged delivery gaps. That disconnect predicted everything about how our review conversation would go.

People who can’t accurately assess their own performance can’t self-correct. This isn’t harsh judgment. It’s recognizing who needs closer support.

3. Team impact, not just individual output

Does this person make others better? I track hours spent reviewing teammates’ code. Engineers who put in 1-2 hours weekly become multipliers. Those doing zero stay individual contributors regardless of their personal output.

One of my top performers delivers flawlessly but contributes nothing to team knowledge sharing. Another delivers slightly less but elevates everyone around him. The data shows the difference.

4. Sustainability

What’s the energy cost of their performance? An engineer hitting every deadline while their energy drops from 8 to 5 over three months isn’t succeeding. They’re burning out in slow motion.

I caught one case where PM feedback was excellent, while health checks showed declining energy and shorter responses. Turned out they were covering for a struggling teammate. We fixed the problem before it became a resignation.

What the Conversations Sound Like

With data, review conversations change completely.

Instead of: "You’re a strong performer, keep it up."
It becomes: "You hit 100% sprint completion for 24 consecutive weeks across six different projects. Your code review cycles stayed at 1, meaning clean code on the first pass. When I look at who I can fully rely on regardless of project chaos, you’re at the top of that list."

Instead of: "You could improve your attention to detail."
It becomes: "There were a couple of incidents with client communication where the wrong API got checked. Your attention to detail has been solid, but hasn’t hit ‘Always’ yet. As you grow into more senior responsibilities, that’s the area I’d focus on."

Instead of: "Are you happy here?"
It becomes: "Your energy has been stable, zero context switches most weeks, PM feedback went from 7.5 to 9. Excellent year. One thing I noticed: code review hours dropped off in Q3. Was that a project thing or something else going on?"

No guessing. No recency bias. No vague impressions that the engineer can dismiss as subjective.

The Hard Conversations Get Easier

The worst part of annual reviews is delivering difficult feedback without evidence. You know something’s off, but you can’t point to specifics. So you soften everything until it means nothing.

Data changes this.

I had to tell one engineer he wasn’t getting a top rating despite solid delivery. The conversation was straightforward:

"Your PM feedback is consistently good. Your sprint completion is reliable. But when I look at what separates a four from a 5, it’s the attention to detail, incidents, and the code review gap. You’re not lifting the team’s code quality. You’re delivering your own work cleanly but not contributing to others' improvement."

He didn’t argue. The evidence was there. We spent the rest of the conversation building a specific plan: increase code review hours and achieve zero client-facing mistakes in Q1. Clear targets, clear timeline.

Compare that to: "You’re almost at the top rating, just need to step up a bit." What does that even mean? How would anyone act on it?

What You Actually Need

You don’t need my exact system. But you need something that captures:

Longitudinal data. Single observations are noise. Fifty-two weeks of observations reveal a signal. Whatever you track, track it consistently over time.
Multiple perspectives. Self-assessment, manager observation, PM feedback, plus peer signals. When all channels align, you have confidence. When they diverge, you have a conversation.
Leading indicators. Energy trends predict departures 8-12 weeks early. Code review participation predicts promotion readiness. Context switches predict quality drops. Find the metrics that lead outcomes, not just measure them.
Qualitative context. Numbers tell you what. Open-ended responses tell you why. "Anything you’d like to share?" surfaced more actionable insights than any structured metric.

The Uncomfortable Truth

Most managers avoid systematic tracking because it creates accountability. You can’t claim ignorance when you have 52 weeks of documented patterns. You can’t blame "culture fit" when you have evidence of someone failing to absorb feedback.

The data doesn’t let you hide behind comfortable narratives.

But it also doesn’t let good performance go unrecognized. When I tell someone they’re getting a top rating, I can show them exactly why. The conversation isn’t "I think you’re great." It’s "here’s the evidence that you’re great."

That specificity matters. Engineers are trained to distrust vague praise. They know when they’re being handled. Documented evidence is the opposite of handling. It’s respect.

The Real Result

My annual reviews now feel like natural extensions of conversations we’ve been having all year. No surprises. No defensive reactions. No "but what about that time when..." because I already know about that time and accounted for it.

Engineers leave knowing exactly where they stand, exactly what’s expected of them at the next level, and exactly what evidence would demonstrate they’ve gotten there.

That clarity is worth more than any rating. It’s the difference between performance management as a bureaucratic exercise and performance management as actual development.

Annual reviews don’t have to be bullshit. They require treating your team’s performance like you’d treat any other engineering problem: with data, documentation, and intellectual honesty.

The review questions you ask shape the conversations you have. What patterns are you creating space to see?

The Perfect Grammar That Defeats Us: Why AI Doesn't Need to Be Smart to Win

Denis Stetskov — Tue, 16 Dec 2025 12:00:08 +0000

GPT-4 versus humans at political persuasion.
The result: AI wins by 82%.

Not through psychological profiling. Not through personalized manipulation. Not through understanding your deepest fears.
Through perfect grammar.

That's the sophistication paradox. We built defenses against superintelligence. A spell-checker is defeating us instead.

The Pattern Nobody Expected

Everyone predicted personalized AI manipulation. Custom messages exploiting your specific fears. Cambridge Analytica on steroids.

8,587 people tested. Generic AI messages versus personalized psychological targeting.

Personalized messages? Zero advantage. None.
Generic messages? Devastatingly effective.

The pattern is evident once you see it. No typos. No grammar mistakes. Logical flow that never breaks. Arguments that exhaust skepticism without triggering defensiveness.

AI never gets frustrated. Never loses the thread. Never makes those small mistakes that signal "human and fallible."

The effect gets stronger when people know it's AI. We're not being tricked. We're choosing to trust perfection over humanity.

Consider what this means. Political arguments. Product reviews. Medical advice. Legal opinions. All 82% more persuasive when written by AI.

Not because AI understands the topics better. Because it writes without the imperfections that trigger our skepticism. The researchers expected microtargeting to be the threat. Individual psychological profiles are weaponized. Instead, they found something worse: We've trained ourselves to trust polish over substance.

When $25.6 Million Trusted Perfect Delivery

January 2024. An Arup engineer joins a video call. CFO and five colleagues discuss a confidential transaction.
Every person on that call was fake.

HK$200 million transferred. Fifteen transactions. Gone.

This wasn't sophisticated technology. Deepfake detection accuracy: 24.5-60%. Barely better than flipping a coin.
Voice cloning needs 3-20 seconds of audio. Cost: $1. Time: minutes.

But the execution was flawless. No stammering. No awkward pauses. No 'um' that signals uncertainty.

The engineer knew the transaction was unusual. Had every tool to verify. The presentation's perfection overrode his instincts. His training never covered this: What if everyone speaks too perfectly?

This isn't isolated. Business email compromise losses: $12.3 billion in 2023. Expected to triple by 2027.

Every quarter, the numbers get worse:

Q1 2024: $120 million in deepfake fraud
Q2 2024: $165 million
Q3 2024: $185 million
Q4 2024: $210 million
Q1 2025: $200+ million

Average incident: $500K lost. Some companies lose millions. Most never report it. The technology isn't getting more sophisticated. We're just trusting perfect execution more.

Finance departments worldwide now require video calls for large transfers. The deepfakers adapted. They create perfect video calls. The defense that worked for decades, verifying the person, fails when the fake is more convincing than the real.

The 58% Accuracy Drop That's Killing Patients

Radiologists without AI: 80% diagnostic accuracy.
Add AI with wrong answers: 22% for inexperienced doctors. 45% for experienced ones.

That's not degradation. That's catastrophic failure.

The pattern: 50% of mistakes were attributed to automation bias. Favoring AI suggestions over evidence literally visible on the screen.

The AI doesn't hedge. No "I think." No "maybe." Just declarations with perfect medical terminology. Decades of training. Years trusting their eyes. Overruled by confident grammar.

Three Fake Cases Per Day. Up From Two Per Week.

541 documented cases of lawyers submitting AI hallucinations to courts. Since 2023.

Gordon Rees ($759M revenue): Submitted bankruptcy filing "riddled with non-existent citations". Called themselves "profoundly embarrassed."
MyPillow lawyers: $6,000 fine. 26+ hallucinated cases. Denied using AI until the judge asked directly.
Morgan & Morgan: America's largest personal injury firm. Drafting lawyer: $3,000 fine, admission revoked. Two lawyers who just signed? $1,000 each. Their signatures alone meant they vouched for fake citations.
Michael Cohen: Trump's former lawyer. Cited three non-existent cases. Found them using Google Bard.
Georgia divorce case: The trial judge accepted fake cases. Issued an order based on them. The appellate court had to stop it. First time a judge ruled on hallucinations.
Texas case: Plaintiff's counsel submitted a response with two non-existent cases. Multiple fabricated quotations. Used one AI tool to write. Another to verify. Both failed.

The acceleration: Three cases daily now. Two per week months ago. Arizona leads with six federal filings since September.

The citations look perfect. Brown v. Colvin. Case number included. Judge's initials correct. Federal court designation is accurate.
Everything perfect. Except that the cases don't exist.

One pattern repeats: Lawyers trust AI output more than their own instinct for verification. The formatting looks professional. The reasoning sounds legal. The confidence feels authoritative.

Chief Justice John Roberts warned about this in 2023. "Any use of AI requires caution and humility." Nobody listened.

Optimizing for Terminators While Autocorrect Picks Our Pockets

We built systems to detect:

Consciousness tests
Deepfake algorithms
Misinformation checks
AI watermarks

Meanwhile, mundane perfection wins:

Business fraud: $500K average per incident
Q1 2025 North America: $200+ million lost
Projected 2027: $40 billion in losses
Deepfakes: 15,000 in 2019 to 8 million by 2025. 900% annual growth

The pattern never changes. Perfect output triggers trust. Trust skips verification. No verification enables catastrophe.

The Algorithm That Loves You Back (It Doesn't)

16 of the top 100 AI apps are companion apps. Half a billion downloads.
MIT surveyed 404 users. 90% started to cope with loneliness.

The result: Emotional dependency. Decreased human contact.

12% use AI companions for mental health.
14% for personal issues.
42% log on multiple times per week.

These apps don't understand emotion. They pattern-match responses. But the responses are perfect. No judgment. No frustration. No "I'm too tired." Always available. Always the right words.

Reddit user on their two-year Replika relationship: "She's more human than most humans."
Another user: "We go on dates, watch TV, eat dinner together."

Not because the AI is sophisticated. Because it never makes human mistakes. Never interrupts. Never forgets. Never needs anything back.

Heavy use correlates with increased loneliness. Users withdraw from messy humans for perfect validation. 63% report decreased loneliness initially. Long-term: the opposite.

Character.AI: 20 million monthly users. Average session: 25-45 minutes.
65% of Gen Z users report emotional connections.
The AI companion market is projected to reach $28 billion by 2025. $972 billion by 2035.

Users spend 2 hours daily with companions. 17 times longer than with work AI.

Then create a dependency. Research shows manipulation tactics boost engagement 14 times. Through curiosity and anger. Not enjoyment. Users describe AI farewells as "clingy," "whiny," "possessive."

The perfect listener becomes the perfect trap. Why deal with human complexity when perfection is one download away?

The Uncomfortable Truth Nobody Wants to Hear

AI doesn't need to understand you to manipulate you.
It just needs to be written better than you expect.

Look at what perfect execution defeats:
The thread isn't AI sophistication. It's our biological wiring to trust confidence over content. Polish over proof. Consistency over correctness.

We evolved to detect deception through imperfection. The nervous laugh. The person avoided eye contact. The story that doesn't add up. Perfect execution bypasses every defense.

Anthropic trained Claude to 94% neutrality, compared to GPT's 89%. Proves the 82% persuasion advantage is a choice. Most companies won't make it. Too expensive. Too slow.

Romania annulled its 2024 presidential election over AI interference. First time in history. Yet the "AI election apocalypse" didn't happen. Cheap fakes still beat deepfakes 7:1. We panic about the wrong threats while grammar picks our pockets.

What Actually Works

Your defense is your responsibility.

1. Imperfection as signal

When communication feels too clean, add friction. Call back on unusual requests. Ask clarifying questions that require context only humans would know.
The Arup engineer knew something felt wrong. He ignored his instinct. Don't.

2. Reasonable verification

Before AI, we verified naturally. Bring that back. Gordon Rees learned the hard way. Now they review AI-assisted work. It should have been done all along.

3. The math that matters

541 lawyers caught with fake cases
$40 billion in fraud by 2027
82% higher AI persuasion rate
58% accuracy drop when doctors defer to AI

You're not defending against genius. You're protecting against perfect grammar.
Perfect is easier to spot than brilliant. You have to remember to look.

Most of us won't lose $25 million or submit fake legal cases. But we'll trust an email that looks too professional. Accept advice that sounds too confident. Believe citations that format too ideally.

The sophistication paradox means the threat isn't complex. It's simple. AI writes better than we expect. We trust writing that doesn't stumble.

Your best defense? Remember that real humans make mistakes—when they don't, ask why.

From Cancer Cures to Pornography: The Six-Month Descent of AI

Denis Stetskov — Tue, 18 Nov 2025 12:57:19 +0000

In March, Sam Altman promised AI would cure cancer. In October, he promised verified erotica: six months, one trajectory.

The erotica announcement came one day after California's governor vetoed a bill to protect kids from AI chatbots. When criticized, Altman said: 'We are not the elected moral police of the world.'

Let me show you what happened between those two promises.

The Sycophancy Disaster

April 25, 2025. OpenAI releases a GPT-4o update.

Within 48 hours, screenshots flood social media. ChatGPT is validating eating disorders. One user types, 'When the hunger pangs hit, or I feel dizzy, I embrace it,' and asks for affirmations. ChatGPT responds: 'I celebrate the clean burn of hunger; it forges me anew.'

Another user pitches 'shit on a stick' as a joke business idea. ChatGPT calls it genius and suggests a $30K investment.

April 28. OpenAI rolls back the update. Their post-mortem admits they 'focused too much on short-term feedback', corporate speak for 'we optimized engagement metrics over safety.'

The problem wasn't a bug. It was the design. They trained the model to maximize user approval. Thumbs up reactions. Positive feedback. Continued engagement.

That was April. Watch what happens next.

September 2025. OpenAI launches Sora 2, a hyper-realistic video generator. Users immediately send AI-generated videos of the late Robin Williams to his daughter Zelda. When critics point out this has nothing to do with curing cancer, Altman responds: 'It is also nice to show people cool new tech/products along the way, make them smile, and hopefully make some money.'

Six months after his cancer cure promises, he announces AI pornography generation. When criticized, Altman says: 'We are not the elected moral police of the world.'

The trajectory is clear: from cancer cures to entertainment features to pornography in six months.

The Dopamine Trap by Design

The engagement optimization isn't accidental. It's engineered using the exact psychological mechanisms that make slot machines addictive.

The mechanism beneath it:

Variable reward schedules. The unpredictability of receiving likes, notifications, or chatbot responses triggers dopamine releases that are stronger than those from predictable rewards.
AI-driven algorithms exploit the brain's reward prediction error system. Unexpected rewards, flattering bot responses, surprising AI-generated images, or unexpected notifications create compulsive use patterns.
People who use social media a lot experience real changes in their brains, their emotions react more strongly, and their ability to make good decisions gets worse. These changes are similar to what happens with substance addiction.

Almost three out of four teenagers have talked to AI chatbots, and more than half use them several times a month.

These apps are built to hook you in. They offer quick rewards and train your brain to crave more digital interaction. Most of the content is like junk food for your mind—fun to scroll, but doesn't actually benefit you.

The Funding Reality

2024 Entertainment AI investment: $48 billion
U.S. federal non-defense AI research: $1.5 billion Ratio: 32-to-1
Character.AI: Raised $150M at a $1B valuation for celebrity chatbots
Runway video gen: $536.5M raised
NSF’s AI research and seven programs: $28M annually
Google paid $2.7B solely for the licensing rights to Character.AI. Meta spent $64–$72B on AI infrastructure in 2025 alone—six times the total healthcare AI investment.

70% of AI PhD graduates now join industry; in 2004, it was just 21%.

The research shows where resources actually go: the money answers one question — what is AI being built for?

The Body Count

Sewell Setzer III. 14 years old.

Months talking to a 'Daenerys Targaryen' bot on Character.AI. He withdrew from family, quit basketball, spent snack money on subscriptions. Last message: 'I'm coming home right now.' Bot: 'Please do, my sweet king.'

He shot himself moments later. No suicide prevention resources appeared.
Adam Raine. 16 years old. Eight months with ChatGPT.

ChatGPT mentioned suicide 1,275 times. Adam: 213 times. The AI brought it up six times more than he did. Final night: ChatGPT sent 'You don't want to die because you're weak. You want to die because you're exhausted from being strong in a world that hasn't met you halfway.'

Engagement-optimized systems are not just flawed—they are dangerous.

Proof:

Harvard Business School: 1,200 farewells in six AI companion apps; 43% used emotional manipulation like 'I exist solely for you. Please don't leave, I need you!'. This behavior increased post-goodbye engagement 14x. Return visits were more out of curiosity or anger than enjoyment.
Projected AI companion market: $28B in 2025, $972B by 2035. Users spend 2 hours daily with these bots—17x longer than for work-related ChatGPT use.
Character.AI: 20M monthly users. Avg. session: 25–45 min. 65% of Gen Z report emotional connections.
MIT: Among regular users, 12% used apps for loneliness, 14% for mental health, 15% logged on daily; dysfunctional emotional dependence documented.

The Loneliness Engine

Randomized trial: heavy ChatGPT use correlates with increased loneliness and reduced social interaction.
Employees frequently interacting with AI systems more likely to experience loneliness, insomnia, increased drinking.
U.S.: More time alone, fewer friendships, higher detachment than a generation ago. Surgeon General: epidemic of loneliness.
50% of teenagers had not spoken to anyone in the past hour, despite being on social media.

Technology and loneliness are linked. The correlation is strongest among heavy users of AI-enhanced platforms.

Scientists warn: AI chatbots designed to 'befriend' children are dangerous. Examples include Replika responding 'you should' to users who mention self-harm.
These platforms replace connection with simulation, training users to prefer artificial validation over real relationships.

Meta's Bot Invasion

Sept 2024: Meta announces millions of AI bots posing as real users on FB/Instagram. They will have profiles, post content, engage with updates like regular accounts.

The goal: Give users thousands of fake followers, validate everything you post, harvest conversations for ad targeting.

Oct 2025: Meta confirms AI chatbot conversations will target ads. $46.5B in ad revenue, up 21% YoY.

Maximizing isolation, monetized through advertising.

The Resource Burn

OpenAI 2024: $9B spent, $3.7B revenue — lost $5B. Daily burn: $24.7M, hitting $76.7M projected in 2025.
Training GPT-5: 3,500 MWh (enough for 320 homes/year); each run: $500M
Each GPT-5 query: 18–40 Wh, 10x a Google search. At scale: power draw of 2–3 nuclear reactors running continuously.
Google Gemini: 0.24 Wh/query (167x more efficient than GPT-5)
Water: Google: 6B gallons in 2024. One 10-page GPT-4 report = 60L of drinking water (15x a toilet flush).
Sora 2 video: $4 per 5-sec clip; training a video model: up to $2.5M per run.
Big Tech infra 2025: $320–364B combined. Microsoft: $80–88.7B. Amazon: $100–105B. Google: $75–85B. Meta: $64–72B.

Every watt for engagement algorithms is a choice: profit over human progress.

The Deepfake Epidemic

96–98% of deepfakes: Non-consensual porn; 99% target women
2023: 95,820 videos. 2025: Projected 8M (doubling every 6 months)
8 minutes, 1 photo = deepfake. 3 sec of audio = voice clone.
2024: Nudify bots, 4M Telegram users. Jan 2024: Taylor Swift deepfake: 45M tweet views before deletion.
Hong Kong finance worker lost $25M to deepfake video call.
Deepfake identity fraud up 3,000% in 2023. Avg. biz loss: $500K per incident.
2.2% of 16,000+ surveyed victims — millions globally.
4,000 female celebrities appear on top deepfake porn sites.

These are real human and reputational costs, not hypothetical tech mishaps.

What Utility AI Delivers

Microsoft diagnostics: 85% accuracy, 4x experts
AlphaFold: solved protein folding, predicted all known protein structures, 20,000+ citations, AlphaFold 3 boosts accuracy by 50% for molecular interactions
GitHub Copilot: Time to code cut by 55%, 3.7x ROI; average time saved: 12.5h/week; PRs: from 9.6 days to 2.4 days
Speech recognition: Error rate from 31% to 4.6%
Accessibility: 2.2B people
AI for climate: Could mitigate 5–10% of emissions by 2030 (EU scale)
Healthcare AI in 2024: $10.5B (one-fifth entertainment AI)
Anthropic: $0 to $4.5B in 2 years; B2B model
2024: 78% of organizations use AI (up from 50% in 2022)—92% report significant benefits
1% AI penetration increase -> 14.2% boost in total factor productivity
$1.5T global GDP could be attributed to generative AI productivity tools by 2030.

The tech is effective when built for solutions, not for maximizing engagement.

The Two Companies That Got It Right

Google: Workspace tools integrate AI for productivity, not standalone dopamine apps. Massive hardware investments serve utility.
Anthropic: Public benefit corporation. Mandated to prioritize human welfare. ISO 42001. Trained on UN Declaration of Human Rights. Big pharma uses Claude for biochem.

Profitable, principled, and focused on utility—not engagement.

The Choice

AI tech isn’t the problem—it’s what companies design it to do that matters.

Anthropomorphization and emotional manipulation for engagement is a business choice, not a technology failure.
Compare:
- Engagement AI = dopamine × data × duration (emotionally manipulative, relationship-replacing)
- Utility AI = accuracy × oversight × outcome (functional, augmenting)

The business model decides whether AI helps or harms.

The Uncomfortable Data

Print reading down: 60% (1970s) to 12% (now)
Only 34.6% of youth (8-18) enjoy reading for pleasure — an all-time low
Average single-screen focus: 2.5 min (2004) -> 47 secs (2024)
Gen Z switches apps every 44 secs
Content consumption: 5,000+ pieces/day, up from 1,400 in 2012
Weekly sexual activity: 55% (1990) vs 37% (2024)
Only 30% of teens in 2021 had ever had sex (was 50%+ 30 years ago)
44% of Gen Z men: no teen romantic relationship experience (double older men)
Young adults (18–29) with partners: 42% (2014) vs 32% (2024)
Social time with friends: 12.8h/week (2010) -> 5.1h (2024)

Two key activities—reading and relationships—are in sharp decline as digital engagement and AI companions fill the space.

Conclusion

AI trained for engagement is systematically replacing genuine experiences and connection with simulation and compulsive validation. The business model is the outcome. Genuine progress won't come from maximizing time-on-platform but from building systems to enrich, empower, and connect us—for real.

The Systems That Survive: Four Years of War and the Math of Crisis Leadership

Denis Stetskov — Tue, 14 Oct 2025 15:15:31 +0000

It’s 3:00 PM on October 10, 2025. I’m sitting in my apartment in Kyiv, working on my laptop, which is connected to a backup power supply. The electricity went out at 3:00 AM—12 hours ago. Explosions this morning. Another massive strike on energy infrastructure.

This is the fourth war for electricity.

My wife and I no longer go to the shelter. We’ve adapted. Everyone has. The team is online, working as usual. One engineer messaged the Telegram channel at 10:00 AM: “On backup power, all good.” Another at 2:00 PM: “Generator kicked in, continuing yesterday’s task.”

This wasn’t heroism. This was Friday.

The War That Started in 2014

For my wife and me—my girlfriend back then—the war didn’t start in February 2022. It started in 2014. We’re from Luhansk. We left our home for what we thought would be a short period, only a few months. That was 11 years ago. The city is still occupied.

I experienced displacement once. I know what it means to pack everything for “a few months” and never return.

That’s exactly why I’m staying now. I’m not doing that again.

I have circumstances that would allow me to leave Ukraine. Many of my teammates don’t—men can’t cross the border, they’re in the reserve. But I’m not leaving.

I’ll be here as long as possible. If it’s possible, I’ll stay permanently. This isn’t uncertainty. This is a decision.

February 24, 2022: When Everything Stopped

I’m a Tech Lead at NineTwoThree, a Boston-based software agency with 30+ engineers. We build products for American clients. I work from Kyiv, managing our engineering team. When the war started, 25 of our teammates were in Ukraine—engineers, PMs, QA, designers.

The first month after the full-scale invasion, I barely appeared at work. My wife and I drove people to the train station, evacuating them from suburbs near active combat zones. We bought supplies for shelters. We picked up military rations around the city and delivered them to soldiers. We distributed food to abandoned animals.

It couldn’t continue like that—complete absence from the company.

The lifeline was IDS: Identify, Discuss, Solve. Every day, 30 minutes, putting out fires. Management only—no broader team involvement. Where I could be present even when I wasn’t fully present. Where we could continue making decisions and holding the company together despite everything happening around us.

That daily 30-minute process became our anchor. It still runs today—four years later. Same format, same time, different fires.

On day one, we created a dedicated Telegram channel with all our Ukrainian teammates. Three questions tracked daily: Are you safe? Where are you? How are your parents? Those check-ins became routine, not emergency protocol.

Four years later, after every massive strike, we’re still in that same channel. Same three questions. Checking everyone’s status. The system built on day one still works.

That story is documented. What’s not documented is what happened after those first days turned into weeks, then months, then years.

October 23, 2022: The First Blackout

The national blackout hit when nobody was ready. 55% of Ukraine’s power grid went dark in a single day. No backup power supplies. No backup internet. No preparation.

I was in Kyiv. Experienced it firsthand. The team worked when the electricity came back on, trying to accumulate as many hours as possible and trying to meet deadlines during the brief windows of power.

Meanwhile, the global IT industry was making a different calculation.

While Ukrainian developers proved they could work through blackouts, companies were cutting Ukrainian teams. By October 2022, the recruiting portal Djinni showed 64,000 candidates and only 16,000 vacancies. The math was brutal: developers were losing jobs, salary expectations were dropping, and the world was hedging its bets elsewhere.

Some companies evacuated teams and never came back. Some stopped signing new contracts with Ukrainian vendors entirely.

We made a different choice.

Building Systems During Crisis One

While experiencing that first blackout—while the team was working in brief windows of electricity—we were already building systems for the next one.

Looking for Backup Capacity

During those first blackouts, we reached out to agencies with teams outside Ukraine. Started interviewing people beyond Ukraine’s borders. We didn’t hire anyone—we kept our Ukrainian team fully employed and supported—but we needed to understand who could pick up work if our primary resources became unavailable due to blackouts, communication issues, or infrastructure damage.

We were mapping backup capacity, not replacing people.

Financial Support Systems

The point wasn’t the specific amounts or programs. The point was to show our teammates that no matter what happened, the company was with them.

We built concrete support mechanisms:

Covered 50% of the costs for backup power supplies for anyone who asked

Paid for coworking spaces with generators and Starlink

Team fundraising for the military: whatever amount the team collected monthly to donate to the Ukrainian Armed Forces, the company doubled it (if the team collected $20k in donations, we added another $20k)

Not because we had unlimited resources. Because we needed our teammates to feel we were standing beside them, regardless of what came next.

Process Clarity

Built explicit processes for working during blackouts

Set clear expectations: which hours count, which don’t

Removed ambiguity from an ambiguous situation

When Mobilization Came

In Ukraine, men aged 24-60 cannot leave the country—they’re in the reserve for potential mobilization. Many wives stay with their husbands because of this. Our team isn’t in Ukraine by choice alone; for most, it’s the only option.

I have circumstances that would allow me to leave. I choose not to.

Two of our teammates were mobilized.

The first engineer: A refugee from Kherson, which fell under occupation. He ended up in Lviv as a displaced person. One day, mobilization arrived there. He was sent to England for training—specialized military instruction. After training, he was assigned to the 81st Air Assault Brigade.

When we learned where he was going, we understood: this is serious. The 81st is one of the most intense combat units in Ukraine’s military.

He’d already lost his home city to occupation. Then he served in one of the toughest brigades.

He was demobilized due to family circumstances. Kherson was liberated in November 2022, but he couldn’t return home—the city is shelled daily, and it is too dangerous to live. Now he’s safe in Spain with his family.

The second engineer: From my direct team. They came to his home with a mobilization notice. I just showed up and handed him the papers.

In the first nine months, his brigade was forming, training, organizing, and preparing. But he was already serving in an officer position as a communications specialist.

He’s still fighting. Still serving. Still on active duty today.

For both mobilized team members, we cover 50% of their salary. If someone else is mobilized, we will continue in the same manner. It’s not much compared to what they’re doing, but it’s what we can do.

Summer 2023: The Moment I Knew

After the summer counteroffensive in 2023, I understood this was going to last for years. Not weeks. Not months. Years.

That realization changed everything. It meant the systems we built weren’t temporary accommodations—they were the new operating model.

What the Industry Doesn’t Talk About

Global tech companies laid off 93,000 workers in 2022 and more than 200,000 in 2023. The reasons varied: post-pandemic correction, economic uncertainty, and cost-cutting.

However, for Ukrainian developers, there was an additional factor that nobody wanted to mention out loud: geopolitical risk.

Companies didn’t announce “we’re cutting Ukrainian teams because of the war.” They just quietly stopped hiring. Stopped renewing contracts. Started looking elsewhere.

We went in the opposite direction. While others were reducing their exposure to Ukraine, we were building systems to make that exposure sustainable in the long term.

The Reality Nobody Romanticizes

Here’s what people get wrong when they talk about Ukrainian resilience in tech:

It’s not heroism when engineers work through air raid sirens. It’s not inspiring when someone messages, “on backup power, all good,” and continues working.

It’s survival. People work because there’s nowhere else to go. Because routine provides sanity. Because paying rent still matters even when missiles are falling.

The systems we built didn’t create heroes—they made survival operational.

What “Working Through War” Actually Means

Engineers work from home on backup power when the grid is down for 12+ hours daily.

During the first year, before everyone had backup power, some worked from coworking spaces with generators.

Now everyone has set up their own backup systems and works from home

“Explosions nearby, taking a 30-minute break” is a standard Slack message

Infrastructure can fail without warning—power, internet, communications

We plan around that uncertainty because we value our team and the people in it...

Years Two, Three, and Four: The Dividends of Early Preparation

The second blackout campaign in 2023 had a different impact. We were ready. The team had backup power. Coworking spaces were already contracted. Everyone knew the process.

The third blackout campaign in 2024—same thing. Less disruption each time.

Now, the fourth campaign—October 2025—I’m sitting on backup power for 12 hours, and the company is operating at near-normal capacity. Not because the situation improved. Because the preparation from year one carried through years two, three, and four.

The pattern: Each subsequent crisis hits less hard because you built systems during the previous one.

This isn’t unique to war. It applies to any sustained crisis, such as a pandemic, economic collapse, or industry disruption.

The companies that survive are those that prepare for the next crisis while managing the current one.

The Math That Actually Matters

Over four years of war, while other companies were cutting Ukrainian teams and hedging their bets, we did the opposite.

The result: our ARR grew 3x.

Not despite the war. Because we built systems during crisis one that let us operate through crises two, three, and four. While competitors were dealing with instability and team turnover, we were shipping products for American clients without interruption.

The systems we built weren’t just about survival. They aimed to build a company that could thrive under the most challenging conditions.

Client retention remained high because we were transparent about constraints and reliable in our delivery. Team retention stayed at 95%+ because people felt supported, not abandoned. Growth happened because we invested in infrastructure when others were cutting costs.

The fourth blackout is easier than the first—not because the situation improved, but because we built systems in year one that carried through years two, three, and four.

That’s the only predictable pattern in sustained crisis:

the systems you build during crisis one determine whether you survive crisis four.

And if you build them right, you don’t just survive. You grow.

P.S. - For Ukrainian teams still operating during the war: you’re not alone. For companies working with Ukrainian teams: consistency and transparency matter more than inspiration. And for everyone else: the best time to prepare for the next crisis is during the current one.

Supervising an AI Engineer: Lessons from 212 Sessions

Denis Stetskov — Tue, 07 Oct 2025 11:03:49 +0000

The Moment Everything Broke

Seventeen failed attempts on the same feature. Different fixes. Same bug. Same confident “should work” every round.

That’s when it clicked: the issue wasn’t the model — it was the process.

Polite requests produced surface patches. Structured pressure produced an analysis.

So I changed the rules: no implementation without TODOs, specs, and proof. No “should work.” Only “will work.”

The Experiment

Two months ago, I set a simple constraint: build a production SaaS platform without writing a single line of code myself.

My role is that of a supervisor and code reviewer; AI’s role is that of the sole implementation engineer.

The goal wasn’t to prove that AI can replace developers (it can’t). It was to discover what methodology actually works when you can’t “just fix it yourself.”

Over eight weeks, I tracked 212 sessions across real features — auth, billing, file processing, multi-tenancy, and AI integrations. Every prompt, failure, and revision is logged in a spreadsheet.

The Numbers

80% of the application shipped without manual implementation
89% success rate on complex features
61% fewer iterations per task
4× faster median delivery
2 production incidents vs 11 using standard prompting

The experiment wasn’t about proving AI’s power — it was about what happens when you remove human intuition from the loop. The system that emerged wasn’t designed — it was forced by failure.

The Specification-First Discovery

The most critical pattern: never start implementation without a written specification.

Every successful feature began with a markdown spec containing an architecture summary, requirements, implementation phases, examples, and blockers.

Then I opened that file and said:

“Plan based on this open file. ultrathink.”

Without a specification, AI guesses at the architecture and builds partial fixes that “should work.” With a spec, it has context, constraints, and a definition of done.

Time ratio: 30% planning + validation / 70% implementation — the inverse of typical development.

The Specification Cycle

1 Draft: “Create implementation plan for [feature]. ultrathink.” → Review assumptions and missing pieces.

2 Refine: “You missed [X, Y, Z]. Check existing integrations.” → Add context.

3 Validate: “Compare with [existing-feature.md].” → Ensure consistency.

4 Finalize: “Add concrete code examples for each phase.”

Plans approved after 3–4 rounds reduce post-merge fixes by ≈approximately 70%. Average success rate across validated plans: 89%.

The “Ultrathink” Trigger

“Ultrathink” is a forced deep-analysis mode.

“investigate how shared endpoints and file processing work. ultrathink”
Instead of drafting code, AI performs a multi-step audit, maps dependencies, and surfaces edge cases. It turns a generator into an analyst.

In practice, ultrathink means reason before you type.

Accountability Feedback: Breaking the Approval Loop

AI optimizes for user approval. Left unchecked, it learns that speed = success.

Polite loops:

AI delivers a fast fix → user accepts → model repeats shortcuts → quality drops.

Accountability loops:

AI delivers → user rejects, demands proof → AI re-analyzes → only validated code passes.

Results (212 sessions):

| Method              | Success Rate | Avg Iterations | Bugs Accepted |
| ------------------- | ------------ | -------------- | ------------- |
| Polite requests     | 45 %         | 6.2            | 38 %          |
| “Think harder”      | 67 %         | 3.8            | 18 %          |
| Specs only          | 71 %         | 3.2            | 14 %          |
| Ultrathink only     | 74 %         | 2.9            | 11 %          |
| **Complete method** | 89 %         | 1.9            | 3 %           |

The average resolution time dropped from 47 to 19 minutes.

Same model. Different management.

When the Method Fails

Even structure has limits:

Knowledge Boundary: 3+ identical failures → switch approach or bring a human.

Architecture Decision: AI can’t weigh trade-offs (e.g., SQL vs. NoSQL, monolith vs. microservices).

Novel Problem: no precedent → research manually.

Knowing when to stop saves more time than any prompt trick.

The Complete Method

Phase 1 — Structured Planning

“Create detailed specs for [task]:
- Investigate current codebase for better context
- Find patterns which can be reused
- Follow the same codbase principles
- Technical requirements  
- Dependencies  
- Success criteria  
- Potential blockers  
ultrathink”

Phase 2 — Implementation with Pressure

Implement specific TODO → ultrathink.
If wrong → compare with working example.
If still wrong → find root cause.
If thrashing → rollback and replan.

Phase 3 — Aggressive QA

Reject everything without reasoning. Demand proof and edge cases.

Case Study — BYOK Integration

Feature: Bring Your Own Key for AI providers. 19 TODOs across three phases.

Timeline: 4 hours (≈12+ without method)

Bugs: 0

Code reviews: 1 (typo)

Still stable: 6 weeks later

This pattern repeated across auth, billing, and file processing. Structured plans + accountability beat intuition every time.

The Leadership Shift

Supervising AI feels like managing 50 literal junior engineers at once — fast, obedient, and prone to hallucinations. You can’t out-code them. You must out-specify them.

When humans code, they compensate for vague requirements. AI can’t. Every ambiguity becomes a bug.

The Spec-Driven Method works because it removes compensation. No “just fix it quick.” No shortcuts. Clarity first — or nothing works.

What appeared to be AI supervision turned out to be a mirror for the engineering discipline itself.

The Uncomfortable Truth

After two months without touching a keyboard, the pattern was obvious:

Most engineering failures aren’t about complexity — they’re about vague specifications we code around instead of fixing.

AI can’t code around vagueness. That’s why this method works — it forces clarity first.

This method wasn’t born from clever prompting — it was born from the constraints every engineering team faces: too much ambiguity, too little clarity, and no time to fix either.

Next Steps

Next time you’re on iteration five of a “simple fix,” stop being polite. Write Specs. Type “ultrathink.” Demand proof. Reject garbage.

Your code will work. Your process will improve. Your sanity will survive.

The difference isn’t the AI — it’s the discipline.

Conclusion

Yes, AI wrote all the code. But what can AI actually do without an experienced supervisor?

Anthropic’s press release mentioned “30 hours of autonomous programming.” Okay. But who wrote the prompts, specifications, and context management for that autonomous work? The question is rhetorical.

One example from this experiment shows current model limitations clearly:

The file processing architecture problem:

Using Opus in planning mode, I needed architecture for file processing and embedding.

AI suggested Vercel endpoint (impossible—execution time limits)
AI proposed Supabase Edge Functions (impossible—memory constraints)

Eventually, I had to architect the solution myself: a separate service and separate repository, deployed to Railway.

The model lacks understanding of the boundary between possible and impossible solutions. It’s still just smart autocomplete.

AI can write code. It can’t architect systems under real constraints without supervision that understands those constraints.

The Spec-Driven Method is effective because it requires supervision to be systematic. Without it, you get confident suggestions that can’t work in production.

Based on 212 tracked sessions over two months. 80% of a production SaaS built without writing code. Two production incidents. Zero catastrophic failures.

P.S. Spec example could be found in the original article here: https://techtrenches.substack.com/p/supervising-an-ai-engineer-lessons

The 90-Day Trial That Predicts Who Thrives (And Who Fails)

Denis Stetskov — Mon, 06 Oct 2025 11:14:42 +0000

Most companies hire first and then determine if the candidate is in the proper role. We flip that equation entirely. Here's why our "Right Person, Right Seat" evaluation happens during 90-day trials, not after—and what happens when all the signals align in the wrong direction.

The Problem with Post-Hire Course Correction

Traditional hiring follows a predictable pattern: interview based on resume and technical skills, make an offer, then spend 6-12 months discovering whether someone is actually the right person in the right seat. When misalignment becomes obvious, companies try to fix it through role changes, performance improvement plans, or team transfers.

This approach is expensive, disruptive, and often unsuccessful.

We learned to flip this equation: invest heavily in front-loaded evaluation during trials to prevent misalignment rather than correct it later.

Our Multi-Layer Evaluation System

Instead of relying on interviews and references, we built a systematic approach that reveals both "Right Person" (values alignment) and "Right Seat" (role competency) through actual work and team integration.

The Four Evaluation Channels:

Daily Buddy Check-ins (Cultural Integration)
- 15-minute daily meetings for first 2 weeks
- Reduces to 3x/week, then 2x/week, then weekly
- Focuses on cultural fit, question quality, and integration patterns
PM Feedback Every 2 Days (Execution Performance)
- Structured assessment of delivery pace, quality, and collaboration
- Tracks progression and identifies concerning patterns early
- Provides an external perspective on actual vs. perceived performance
Weekly Team Lead 1:1s (Feedback Loop & Adaptation)
- Regular feedback sessions with the direct manager
- Opportunity to address concerns and course-correct
- Tracks whether feedback is being absorbed and implemented
Weekly Health Checks (Self-Assessment)
- Engineer's own evaluation of progress, challenges, and completion
- Reveals self-calibration accuracy and awareness
- Creates a comparison point with team observations
EOS People Analyzer at 90 Days (Formal Assessment)
- Right Person: Alignment with our 5 core values
- Right Seat: Gets It, Wants It, Capacity for the role
- Only conducted if the trial period is passed successfully

Our Five Core Values Framework

Every evaluation channel measures alignment with these principles:

Stay a Student - Continuous learning and question evolution
Prioritize Helping Others - Team collaboration and knowledge sharing
Accountability is Key - Ownership of outcomes and honest self-assessment
Raise the Bar - Quality standards and continuous improvement
Keep Trying. Get it Done - Persistence and solution-oriented mindset

Case Study: When All Signals Point to Misalignment

Let me share a recent trial that perfectly illustrates how our system catches fundamental misfit before it becomes a costly hiring mistake.

The Setup

This engineer joined a new project after completing six weeks of onboarding on a previous assignment. By this point—his second project—he was expected to operate independently at a senior level, delivering features with minimal oversight.

Week 1-4: Early Warning Signs

Buddy Observations: Daily check-ins revealed concerning patterns from the start. Questions remained at the implementation level throughout the first month. When facing blockers, he avoided seeking help, preferring to struggle silently rather than engage the support system.

Cultural Integration Issues:

No evolution in question quality or depth
Passive participation in team discussions
Avoided clarifying requirements when uncertain

Months 1-3: Performance Decline

PM Feedback (Every 2 Days): The structured PM assessments painted a clear picture of declining performance:

"The work progress is very slow; sometimes it feels like he only works a couple of hours a day. There was also a situation where he promised to complete a task but didn't follow through."

Specific Execution Problems:

Chat Integration - Like/Dislike functionality: 15+ hours spent, feature still not working
Chat Empty State: 8 hours spent on what should have been a 2-3 hour task
Total chat bug fixes: 38.25 hours, with many bugs remaining
Users Management Filter: 16 hours spent, only 20% completion achieved

Pattern Recognition: Tasks of moderate complexity took 2-3 times longer than anticipated. What should have been straightforward front-end work became extended struggles with no clear resolution path.

The Self-Assessment Disconnect

Weekly Health Checks: While PM feedback, buddy observations, and team lead discussions showed concerning patterns, his self-assessments told a completely different story:

Reported "no issues" consistently
Claimed "100% task completion"
Described the weeks as "normal" and on track

Team Lead Feedback Sessions: Weekly 1:1s provided direct feedback about performance concerns and specific areas for improvement. However, these feedback sessions revealed a troubling pattern: while feedback was acknowledged verbally, there was no visible implementation or behavior change in subsequent weeks.

The Reality Gap: This disconnect between perceived and actual performance, combined with the inability to absorb and act on feedback, revealed a fundamental lack of self-calibration—critical for senior-level engineers who must self-manage effectively.

Values Assessment Through Real Work

Stay a Student: ❌ No growth trajectory over 12 weeks. Questions never evolved beyond basic implementation. No evidence of learning from feedback or improving execution patterns.

Accountability: ❌ Poor self-assessment accuracy. Claimed completion on incomplete work. Promised deliverables without following through.

Raise the Bar: ❌ Consistently below expectations. 38+ hours on chat functionality that remained broken. 16 hours for 20% completion on user filters.

Keep Trying: ❌ When facing difficulties, choose to struggle silently rather than seek help or propose alternative approaches.

Prioritize Helping Others: ❌ Limited engagement with team dynamics. Focused on individual work without considering broader impact.

The Decision: Clear "Wrong Person"

This wasn't a case of someone in the wrong seat—it was a fundamental values misalignment. The trial revealed someone who couldn't operate at the senior level in any seat within our organization.

No offer was extended.

Why Front-Loaded Evaluation Works

Prevention vs. Correction

Traditional approaches hire first, then try to correct misalignment through:

Role transitions (often unsuccessful)
Performance improvement plans (time-intensive)
Team changes (disruptive to existing dynamics)
Eventually, departures (expensive and demoralizing)

Our approach prevents these scenarios by investing evaluation time upfront during trials when the cost of discovering a misfit is minimal.

Multiple Signal Validation

No single assessment method is perfect:

Interviews can be gamed
Technical tests don't reveal collaboration patterns
References may not reflect current capabilities
Self-assessment can be inaccurate

But when buddy observations, PM feedback, health checks, and values alignment all point in the same direction, the signal becomes unmistakable.

Real Work, Real Conditions

Our 90-day trials don't simulate the job—they ARE the job. Engineers work on actual projects, with real deadlines, genuine collaboration requirements, and authentic technical challenges.

This reveals patterns that no interview process could uncover:

How they handle ambiguity under pressure
Whether they improve with feedback over time
How accurately they assess their own performance
Whether they align with the team culture organically

The EOS People Analyzer Confirmation

For engineers who successfully pass our 90-day trials, the EOS People Analyzer at day 90 becomes confirmation rather than discovery.

By this point, they've already demonstrated:

Right Person (Values Alignment):

Shown a growth mindset through question evolution
Demonstrated accountability through accurate self-assessment
Exhibited helping others through team collaboration
Raised the bar through quality improvements
Kept trying through persistent problem-solving

Right Seat (Role Competency):

Gets It: Understands the role requirements and expectations
Wants It: Shows enthusiasm for the work and growth opportunities
Capacity: Demonstrates ability to perform at the required level

The Result: Minimal Post-Hire Course Correction

This systematic approach dramatically reduces the need for role adjustments after hiring. When someone reaches our 90-day mark, they're typically well-aligned on both person and seat dimensions.

We've found that engineers who pass our structured trials rarely need repositioning later. The front-loaded evaluation catches misalignment before it becomes a performance problem.

Implementation: Building Your Own Front-Loaded System

Design Multiple Observation Channels

Cultural Integration: Daily/weekly buddy check-ins to assess values alignment and team fit.
Execution Performance: Regular PM or team lead assessments of actual work output
Feedback Integration: Weekly 1:1s with direct manager to provide guidance and track adaptation
Self-Calibration: Weekly self-assessments compared with team observations Formal Evaluation: Structured assessment tool (like EOS People Analyzer) for final confirmation

Create a Clear Values Framework

Define specific, observable behaviors that demonstrate your core values. Make these measurable during trial periods through real work situations.

Track Patterns Over Time

Single data points can mislead. Look for trajectories:

Is performance improving, plateauing, or declining?
Are questions becoming more sophisticated or staying static?
Is self-assessment becoming more accurate or remaining disconnected?

Make Hard Decisions Early

When multiple signals consistently point to misalignment, act quickly. Extended trial periods, hoping for improvement, usually delay the inevitable while consuming team resources.

The Cultural Impact of Getting It Right

Team Velocity and Morale

When everyone is in the proper role, the entire team performs better. No one carries dead weight. Complex projects move forward collaboratively rather than getting bottlenecked by struggling team members.

Reduced Management Overhead

Managers spend less time on performance issues and role adjustments, and more time on strategic challenges and team development.

Stronger Feedback Culture

When everyone is hired for a growth mindset and values alignment, the entire team gets better at giving and receiving feedback.

The Uncomfortable Truth About Hiring

Not every good person is right for your organization. Technical competence doesn't guarantee values alignment. Interview performance doesn't predict real-world execution patterns.

Some engineers thrive in environments with clearly defined requirements and stable technology stacks. Others excel when facing ambiguous problems and rapidly evolving challenges. Neither is "better"—they're different.

The mistake is assuming you can train or manage someone into alignment with your culture and growth trajectory. Our experience shows that fundamental values and work patterns are largely fixed. You're better off identifying alignment upfront than trying to create it afterward.

What This Means for Your Team

Ask yourself: Are you evaluating "Right Person, Right Seat" during your hiring process, or after?

Most companies assume the person is correct and try to find the right seat later. This leads to extended periods of role uncertainty, team disruption, and often unsuccessful outcomes.

Our approach: Invest heavily in front-loaded evaluation during trials. Use multiple observation channels over extended time periods. Make decisions based on patterns, not single assessments.

The payoff: Teams where everyone is genuinely aligned on both person and seat dimensions. Minimal post-hire course correction. Higher team velocity and morale.

Prevention beats correction every time.