Forem: Jaskaran Singh

Six Things I Wish Someone Had Told Me Before I Started Working Inside AI

Jaskaran Singh — Sat, 25 Apr 2026 20:16:54 +0000

I did not plan to work in AI.

For five years I was building apps. The kind people actually download and use every day. That was the job. That was the plan. Then the industry shifted and I started paying attention to how these AI tools actually work underneath, not just what they produce on the surface.

Now my job is to test AI. I talk to it, push it, try to break it, and figure out where it goes wrong. I have spent a lot of time watching AI fail in ways that look completely fine at first glance.

And that changes how you see these tools.

Most articles about AI explain these concepts the way a textbook would. Clean, tidy, no mess. I want to explain them the way I actually learned them. Through the moments they surprised me, confused me, or quietly let me down.

1. Tokens: AI Reads in Tiny Bites, Not Full Sentences

Imagine tearing a book into individual syllables and handing them to someone one at a time. That is roughly how AI reads your text.

AI does not process words the way you do. It breaks everything into tiny fragments called tokens. "Hamburger" might be three tokens. "Cat" is one. Spaces, punctuation, even parts of words all count. Every single thing you type is being measured.

Why does this matter?

Every AI model has a budget. A maximum number of tokens it can handle at once. The longer your message, the more tokens it uses. The longer the conversation, the closer you get to that ceiling. When you hit it, the oldest parts of the conversation start disappearing. The AI is not getting confused. It is literally running out of room to hold everything.

I first noticed this when I kept getting strange, off-target answers in long conversations. The AI was not going rogue. It had simply forgotten what I said at the beginning because there was no room left to keep it.

The fix I started using was to keep messages focused. Instead of dumping everything at once, I ask about one thing at a time. Shorter, more specific messages get better answers. Not because the AI got smarter, but because I stopped wasting its budget.

Try it yourself: OpenAI's Tokenizer Playground lets you paste any text and see exactly how many tokens it uses. Paste a long email or paragraph you wrote. The number will surprise you.

2. Context Window: AI Has a Short-Term Memory Problem

Last year I built a small tool that monitors a government immigration website. Canada's immigration program drops new opportunities without warning. No email, no alert, nothing. Miss it and you wait months. My tool checks the website automatically and sends me a message the moment something new appears.

Early on it had a strange problem.

It would work perfectly for a while. Then it would start acting confused. Alerting me about things it had already reported. Missing obvious updates. Behaving like it had completely forgotten what it was supposed to be doing.

The problem was the context window.

Think of the context window like a sticky note. Everything the AI knows about your current conversation lives on that sticky note. Your questions, its answers, anything you shared. The sticky note has a size limit. Once it is full, the oldest things get erased to make room for new ones. No warning. They just disappear.

My tool was filling up that sticky note with every check it ran. After enough cycles, the original instructions were gone. The AI was working without the information it needed and had no idea.

The fix was simple once I understood the problem. Give the AI only what it needs for each task, not everything that has happened so far. Stop letting the sticky note overflow.

If you have ever noticed AI forgetting something you mentioned earlier in a long chat, this is exactly what happened. It did not ignore you. It ran out of space.

Go deeper: Anthropic's model overview explains how much text different AI models can hold at once. Worth a look if you use AI for anything involving long documents or back-and-forth conversations.

3. Temperature: The Dial Between Predictable and Creative

Every time AI writes the next word in a sentence, it is choosing from a list of possible options. Temperature is the setting that controls how adventurous those choices are.

Think of it like ordering coffee.

Low temperature is the AI playing it safe. Always picking the most expected, reliable option. Like a barista who makes your order exactly the same way every single time. Consistent. Dependable. Occasionally boring.

High temperature is the AI experimenting. It reaches for less obvious choices. Sometimes it surprises you with something genuinely better. Sometimes it goes in a direction you did not want at all.

I tested this once by asking the same question twice. Once with a low setting and once with a high one. The low setting gave me a clear, straightforward answer I could use immediately. The high setting gave me something more interesting but also more unpredictable. For some tasks that unpredictability is exactly what you want. For others, like anything factual or precise, it is a problem.

If you are using AI to write something creative or brainstorm ideas, a higher temperature gives you more variety to choose from. If you need a consistent, reliable answer every time, you want it low. Most AI tools do not show you this setting, but knowing it exists explains a lot about why you sometimes get wildly different answers to the same question.

Quick reference: The Anthropic API docs show how temperature works in practice. Even if you never touch the setting yourself, understanding it makes AI much less mysterious.

4. Hallucination: Confident, Fluent, and Completely Wrong

This is the one I worry about most.

AI makes things up. Not occasionally, not as a rare edge case. It happens regularly. And the scary part is not that it happens. It is that the wrong answers look exactly like the right ones.

I have seen AI recommend a restaurant that does not exist, in a neighbourhood it described accurately, with opening hours and a menu. Completely invented. Presented like a fact.

I have seen it cite a news article with a real-sounding headline, a plausible publication name, and a date. The article never existed.

The reason this happens is worth understanding. AI is not looking things up. It is not retrieving facts from a database. It is predicting what text should come next based on patterns it has seen. Most of the time those patterns align with reality. Sometimes they do not. And the model cannot always tell the difference, so it does not flag it. It just keeps going, confidently.

In my job testing AI I spend a lot of time specifically looking for this. The ones that fool me are not the obviously wrong answers. Those are easy to catch. The dangerous ones are the answers that are almost right. Right enough to pass a quick read. Wrong in a way that only becomes clear later.

The practical lesson is to not trust AI output on anything important without checking it against a real source. Not because AI is useless. Because this is simply how it works right now.

5. RAG: Teaching AI to Look Things Up Before Answering

Going back to my immigration monitoring tool. This is where I really understood what RAG does.

RAG stands for Retrieval-Augmented Generation. Ignore the name. The idea is simple.

Instead of asking AI to answer purely from memory, you give it a way to check a source first. It fetches the relevant information, reads it, and then writes its answer based on what it actually just read. Not what it vaguely remembers from training.

My tool works this way. Every time it runs, it does not ask the AI to remember what the immigration website looked like before. It fetches the current page, hands that fresh content to the AI, and asks whether anything has changed. The AI is reading real, current information. Not guessing from memory.

This is the difference between asking a friend to answer from the top of their head versus letting them check their notes first. Same person. Much better answer.

This matters because AI's knowledge has a cutoff date. It does not know what happened last week. It does not know what is in your company's internal documents. It does not know anything it was not trained on. RAG is how you fill that gap by giving the AI something real to read before it responds.

It is why some AI tools can accurately answer questions about recent events, or about your specific business, when a general AI tool would just guess.

Start here: Anthropic's contextual retrieval post explains RAG in plain language with real examples. LangChain also has a beginner-friendly tutorial if you want to see how it actually gets built.

6. Prompting: The Way You Ask Changes Everything

This one took me longer than it should have to take seriously.

Prompting just means how you phrase your request to an AI. And it changes the answer more than most people realize.

Here is an example anyone can test right now.

Ask AI this. "Write me an email."

Then ask this instead. "Write me a short, friendly email to my landlord asking if I can get a pet cat. Keep it under five sentences. Be polite but direct."

The second one gets you something you can actually send. The first one gets you something generic that you will spend ten minutes editing anyway.

Same AI. Same moment. The only difference is how specific you were.

I learned this from the other side. My job involves writing requests specifically designed to trip AI up and expose its weaknesses. That work taught me something. The more vague your request, the more room the AI has to fill in the gaps with whatever is most common. Most common is rarely most useful.

AI does not read between the lines the way a person does. A colleague who knows you might guess what you mean from a half-finished thought. AI takes your words at face value and produces something plausible for exactly what you wrote. Not more. Not less.

The habit worth building is this. Before you ask AI something, spend thirty seconds making the request more specific. What format do you want? What should it avoid? Who is going to read it? That thirty seconds saves you ten minutes of editing on the other side.

None of this requires a technical background. Tokens are the budget AI works within. Context windows are its short-term memory. Temperature controls how predictable or experimental it gets. Hallucination is confident wrongness that looks right. RAG is the look-it-up approach. Prompting is just asking better questions.

I learned all of this the slow way. By watching things break and figuring out why after the fact. You do not have to.

The people getting genuinely useful results from AI right now are not using better tools. They just understand what is actually happening when they hit send.

I write about AI, what it gets wrong, and how to use it without getting burned. Follow along on dev.to or connect on LinkedIn.

Google Cloud NEXT '26 Was a Plumbing Conference. That's Why It Matters.

Jaskaran Singh — Thu, 23 Apr 2026 03:31:40 +0000

This is a submission for the Google Cloud NEXT Writing Challenge

Read enough keynote recaps and the shape of them becomes familiar. Model names, benchmark numbers, a CEO quote about whatever this year's era is called. You close the tab, write a Jira ticket, and wonder whether any of it was about your job.

Today's Google Cloud NEXT '26 opening keynote had all of that. Thomas Kurian in Las Vegas, Sundar Pichai on video, Apple's logo unexpectedly behind the Google CEO's head. TPU generation eight. "The agentic cloud." The Gemini Enterprise Agent Platform — which is mostly what Vertex AI used to be, renamed and consolidated.

Here's what I kept coming back to, though: the announcements that will actually affect what developers ship this year weren't the ones with applause breaks.

They were the boring ones. The plumbing.

What I Mean By Plumbing

The boring infrastructure almost always decides whether a technology ships at scale. HTTP wasn't exciting. TCP/IP wasn't a keynote moment. Nobody clapped for DNS. But that's the layer where things either work reliably or don't work at all.

AI agents are at exactly that point right now. Everyone roughly agrees on what they want agents to do. The part that has quietly killed a hundred enterprise AI projects is different: getting agents to talk to each other across systems, hold context between sessions, and do it without becoming a security nightmare your team has to clean up later.

That's most of what Google actually shipped today. Dressed up in model demos and stage lighting, but the substance is infrastructure.

The A2A Protocol Reaching v1.0

The Agent2Agent (A2A) protocol reached v1.0 today and got handed to the Linux Foundation's Agentic AI Foundation for governance. It got maybe thirty seconds on stage.

A2A answers a question that has made multi-agent architectures genuinely painful: how does an agent on Platform A discover, trust, and delegate to an agent on Platform B, when neither platform knows anything about the other's internals?

The answer is Agent Cards. Each agent publishes a signed card — cryptographically verified via domain signatures — declaring what it can do, what inputs it accepts, and how to reach it. Another agent fetches that card, checks the signature, and delegates with some real basis for trusting that the capability is what it claims.

Before this, "multi-agent" usually meant custom glue code, bespoke APIs, or just hoping two SDKs from the same vendor happened to compose without breaking. The production signal is worth noting: 150 organizations are running A2A in production right now, not in pilots — routing real workloads between agents built on different vendors' stacks. It launched roughly a year ago with 50 partner organizations.

Native A2A support now ships in ADK, LangGraph, CrewAI, LlamaIndex, Semantic Kernel, and AutoGen. That's not a Google-curated list of close partners. That's where developers are actually building agent systems.

The Linux Foundation move deserves more credit than it's getting. When a protocol lives in one company's GitHub, every potential adopter carries a quiet question in the back of their head: what happens when Google gets bored with this? That friction is real, and it's killed protocols before. Handing it to neutral governance before mass adoption removes the question. It's the right call — and it wasn't the obviously self-interested one, since Google could have used the protocol as a lock-in mechanism instead.

ADK v1.0 and What "Stable" Actually Buys You

The Agent Development Kit hit stable v1.0 releases today across Python, Go, and Java, with TypeScript available as well. The announcement was brief. The implications are less so.

The 0.x releases were experimentally useful — people shipped real things with them. But "production-ready" means something specific when your agents are taking autonomous actions: stable APIs you can actually depend on, predictable versioning, and a security model you can explain to someone who isn't you.

v1.0 ships with Model Armor, which defends against indirect prompt injection. This is the attack vector most agent systems ignore until it becomes a real problem — a malicious payload hidden in retrieved content that hijacks agent behavior mid-task. It also puts zero-trust architecture at the protocol level, with access managed through Cloud IAM and full audit logging. When an agent does something unexpected at 2am, you can find out what it did and why, rather than guessing.

If you've been waiting for ADK to stabilize before committing: the spec is frozen, the security model exists, and the governance is neutral. That's what stable means.

A Line from the Keynote Worth Sitting With

Thomas Kurian said this during his talk: "You have moved beyond the pilot. The experimental phase is behind us."

I've been thinking about that framing. It's not a description of where enterprises actually are. It's a description of where Google needs them to be.

Most enterprise AI projects are still in pilot. The gap between a working demo and something that runs reliably across production data, security policies, and organizational complexity has ended more AI initiatives than bad models ever did. That gap is exactly what makes today's less-glamorous announcements worth attention.

Knowledge Catalog grounds agents in actual business context across an entire data estate. Memory Bank gives agents persistent state across sessions, so they don't start from scratch on every interaction. Agent Identity manages agent credentials through the same IAM system that manages human credentials — which means your security team can audit them the same way.

None of this demos well. "Agent credentials managed through IAM with audit logging" doesn't generate applause. But it's what makes an agent your CISO will let near production data, rather than one that stays permanently in sandbox.

Where I'm Skeptical

The word "open" appears a lot today. A2A is an open protocol. ADK is open source. The Model Garden includes 200+ models from multiple vendors, including Anthropic Claude.

All true. And also: the smoothest path through every one of these tools runs directly through Google Cloud. Agent Engine for managed hosting. Apigee as the API-to-agent gateway. Vertex AI as the deployment target.

The protocol is portable. The operational infrastructure is not.

This isn't necessarily a problem — someone has to build the runtime, and Google's is genuinely good. But developers should be clear with themselves about what "open" covers here. The code you write on ADK travels with you. The observability tooling, the managed hosting, the audit trail — those are Google Cloud products. That's a real dependency. Know what you're choosing.

What to Actually Do with This

If you're building agents right now: read the A2A spec before the SDK docs. Understanding Agent Cards — what goes into them, how signing works, what a well-defined skill description looks like — shapes how you design agents from the start. Adding discoverability to a system you built as closed is miserable. The official ADK A2A docs are genuinely readable and worth an hour.

If you're choosing a multi-agent framework: A2A v1.0 in production at 150 organizations, across every major framework, is a meaningful signal about where multi-agent interoperability is actually converging. MCP is worth understanding too — the two solve different layers of the same problem. But A2A is where cross-platform agent composition is happening in production, not in demos.

If you're speccing an enterprise AI project: look at Memory Bank and Agent Identity before you finalize the architecture. Persistent agent state and proper credential management are the two things that most demo architectures quietly skip. If yours skips them too, you'll add them later, under pressure, and it won't go cleanly.

The Part That's Easy to Miss

The keynote demo that got the biggest reaction showed a Gemini agent pulling data from thousands of ingredient PDFs, catching a soy allergen buried in one of them, then calling research agents to build a full market projection — autonomously, while the presenter talked.

That's a real capability and it's impressive. But it works because of things that weren't in the demo: agents that can find each other by capability, verify each other's identity, maintain context between calls, and do it inside an auditable security boundary.

The conference was loud today. TPU naming conventions, Apple on a Google slide, Sundar Pichai explaining that 75% of Google's new code is now AI-generated. That's all interesting. The part that matters for what developers actually ship is quieter: a protocol standard under neutral governance, running in production, with a security story you can defend.

Infrastructure doesn't announce itself. It just works, until the day you need it and it's not there.

The developer keynote is tomorrow at 10:30 AM PT on the DEV homepage. Worth catching for how the ADK and A2A story gets told to a technical audience rather than an executive one.

The Mental Model I Use to Write Prompts That Don't Produce Garbage Code

Jaskaran Singh — Wed, 22 Apr 2026 20:37:08 +0000

Most prompt engineering advice focuses on syntax. Add "think step by step." Specify the language. Say "you are an expert." Some of that helps. None of it addresses the actual reason prompts produce bad code.

The actual reason: the model can only solve the problem you described. Not the problem you have.

Those are different more often than people realize. I know because my job is to find where they diverge. I've spent the last year evaluating AI-generated code professionally: writing rubrics, running adversarial tests, doing multi-turn reviews. The failure modes I see aren't random. They trace back, almost every time, to something in how the task was specified.

There's a reason Andrej Karpathy argued in 2025 we should call this "context engineering" rather than prompt engineering. "In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information." The framing shift matters. A request is something you fire off. A specification is something you construct.

Prompts Are Specifications, Not Requests

When you ask a colleague to write a function, they bring context you never stated. They know the codebase. They know what "production-ready" means at your company. They know the last time something similar broke and why. They fill gaps with judgment.

A model has none of that. It has your words and its training data. When you leave gaps, it fills them with statistically likely defaults. Those defaults are often fine. When they're not, the code looks correct and isn't.

The shift that changed how I prompt: stop thinking of a prompt as a request and start thinking of it as a specification. A specification answers questions the implementer will have whether or not you thought to ask them.

The Four Things Every Code Prompt Is Missing

After a few hundred evaluations, four categories of missing information account for most of the failures I see.

1. The Environment

The model doesn't know where this code lives. The same function needs a completely different implementation depending on whether it runs in a coroutine context, a background thread, a serverless function, or a single-threaded script.

Instead of:

Write a function that fetches user data and caches it.

Try:

Write a Kotlin function that fetches user data and caches it.
Context: this runs inside a ViewModel using viewModelScope.
The app targets Android API 26+. We use Coroutines, not RxJava.

The first prompt produces code. The second produces code that fits where it has to live.

2. The Failure Cases

The model optimizes for the happy path unless you tell it not to. Network calls succeed. Inputs are valid. Caches hit. This isn't laziness. Your prompt described a world where those things are true.

Instead of:

Write a function to parse this JSON response into a User object.

Try:

Write a function to parse this JSON response into a User object.
Handle: malformed JSON, missing required fields, null values on
optional fields, and network timeout. Return a Result<User> so
the caller can handle each failure type explicitly.

You're not asking the model to be more careful. You're describing a more complete problem.

3. The Constraints Nobody Mentions

Performance requirements. Size limits. Thread safety. Backward compatibility. These feel obvious because you carry them in your head. The model doesn't have your head.

Instead of:

Write a function that processes a list of transactions.

Try:

Write a function that processes a list of transactions.
Constraints: the list can contain up to 100,000 items,
this runs on the main thread so blocking is not acceptable,
and we need to support API 21+.

4. What "Done" Looks Like

If you don't define success criteria, the model defines them for you. Usually that means "compiles and handles the obvious case." That's a low bar.

Instead of:

Write unit tests for this repository class.

Try:

Write unit tests for this repository class.
I want coverage for: success path, network failure,
cache hit, cache miss, and concurrent access.
Use JUnit4 and Mockito. Each test should have a
single assertion and a descriptive name.

The Pre-Prompt Checklist

Before I send a code prompt, I run through four questions:

Where does this run? Language, runtime, framework, threading model, platform constraints. If I can't answer this in one sentence, I don't know my own context well enough yet.

What goes wrong? Every function has a set of inputs that break it. State them. If the function touches the network, a database, or user input, those are automatic candidates.

What can't I trade away? Performance floor, security requirements, API compatibility, dependency restrictions. Anything that would make a technically correct solution still unshippable.

How will I know it worked? If I can't describe a test that would fail on a wrong implementation and pass on a correct one, my spec is incomplete.

This takes maybe 90 seconds. It saves much more than that in review time.

The Adversarial Test

After I write a prompt, I read it as if I'm trying to satisfy it with the worst code that technically meets the stated requirements.

If the worst technically-compliant implementation is still unshippable, my prompt is missing something.

Example. Prompt: "Write a function that returns the user's name from the database."

Worst technically-compliant implementation:

def get_user_name(user_id):
    return db.execute(f"SELECT name FROM users WHERE id = {user_id}").fetchone()[0]

This returns a name from the database. It's also a SQL injection vulnerability — still the most common LLM-generated security flaw according to OWASP's Top 10 for LLM Applications — and will throw an unhandled exception if the user doesn't exist.

Both problems are visible if you read the prompt adversarially. Neither shows up if you read it straight.

Better prompt:

Write a function that returns the user's name from the database.
Use parameterized queries. Return None if the user doesn't exist.
Raise a DatabaseError (not a generic exception) if the query fails.

Now the worst technically-compliant implementation is actually safe.

What This Doesn't Fix

Two things this framework won't help with: tasks that require understanding your system's history, and tasks where the right answer depends on a judgment call you haven't made yet. For the first, no amount of prompt engineering substitutes for the model actually knowing your codebase - that's where RAG and long-context strategies come in. For the second, the prompt can't be finished until you've made the decision.

Both are worth recognizing because they tell you when to stop trying to prompt your way out of a problem. Some work needs to stay with you.

The One-Line Version

Describe the problem you have, not the output you want.

The output is a function. The problem is a function that handles these inputs, runs in this context, fails gracefully in these ways, and satisfies these constraints. The model is better at solving the second one than you might think. It just can't infer it from the first.

LinkedIn · Portfolio

AI Agents Are Shipping Features Without You. Now What?

Jaskaran Singh — Wed, 22 Apr 2026 00:12:59 +0000

Jaskaran Singh — Senior Software Engineer, AI Trainer

A few weeks ago I watched an agent open a GitHub issue, write the fix, run the tests, and open a pull request. No human typed a line of code. The PR passed review.

I didn't find this inspiring. I found it genuinely disorienting. I say that as someone who trains AI models for a living and is currently building an agent of my own.

If you're a software engineer in 2026 and you haven't had that moment yet, you will. Agentic AI is being called the third seismic shift in software engineering this century, after open source and DevOps. That framing might be overblown. It might not be. Either way, something real is happening and it's worth thinking clearly about instead of panicking or dismissing it.

The Numbers Stopped Being Theoretical

Source: Unsplash — Luke Chesser

A survey of nearly 1,000 engineers published in early 2026 found that 95% use AI tools at least weekly, 75% use AI for half or more of their engineering work, and 55% regularly use AI agents. That last number is the one that matters. Copilots have been mainstream for two years. Agents are different.

A copilot suggests. An agent acts. It reads your codebase, decides what to do, does it, checks whether it worked, and tries again if it didn't. The feedback loop is closed without you in it.

In 2025, coding agents moved from experimental tools to production systems shipping real features to real customers. In 2026, single agents are becoming coordinated teams of agents.

I've been watching this from an unusual angle. My job involves evaluating AI-generated code for quality: finding the failure modes, writing the rubrics, doing the multi-turn reviews. At the same time I'm building a Python agent that monitors the OINP immigration portal and pushes Telegram alerts whenever a new Masters Graduate stream draw drops. Two different relationships with the same technology, and both have given me a clearer picture than I'd have from either side alone.

What Agents Are Actually Good At

Agents handle implementation tasks well when the problem is well-scoped and verifiable. "Add pagination to this endpoint." "Write tests for this module." "Refactor this class to use dependency injection." Tasks with clear success criteria: the code runs, the tests pass, the interface contract is unchanged. The agent can verify its own work.

Quality still varies. My evaluation work confirms what engineers describe: intuitions for delegation develop over time. People hand off tasks that are easily verifiable or low-stakes. That intuition is real and it matters. Knowing what to delegate is itself a skill now.

Where agents fall apart is anything requiring judgment about what the right problem even is. An agent given an ambiguous brief will confidently solve the wrong version of it. I've seen this pattern repeatedly, not as an occasional edge case but as a consistent failure mode when the task specification has gaps. The agent doesn't ask for clarification. It infers, fills in, and proceeds. Sometimes the inference is right. When it's wrong, it's wrong in ways that are coherent and hard to catch. That's the part that should make you nervous.

The Shift That's Actually Happening to Engineering Teams

Gartner predicts 80% of organizations will evolve large software engineering teams into smaller, AI-augmented teams by 2030. The trajectory is already visible. Teams that used to need eight engineers to maintain a product are running it with four. Not because the other four got fired, but because agent-assisted output per engineer went up enough that the headcount math changed.

The pattern emerging in 2026: software development is moving toward human expertise focused on defining problems worth solving while AI handles the tactical implementation work.

That framing is mostly right but it undersells something. "Defining problems worth solving" sounds clean and strategic. In practice it means writing a spec precise enough that an agent doesn't go off the rails, reviewing agent output at a level that catches subtle correctness issues, and making architecture decisions that hold up when the agent starts filling in implementations you didn't anticipate.

Those are all hard skills. They're also different from the skills that got most of us into engineering. We learned by writing the implementation ourselves. The feedback loop of "I wrote this, it broke, I understand why" is how you build the mental models that make good judgment possible. Whether that judgment transfers cleanly to directing agents at tasks you've never done yourself is an open question. I don't think anyone knows yet.

What This Means If You're Mid-Career

I'm five years in. I've shipped production Android apps, done fintech work, and I'm now working at the AI training layer. The people who seem least threatened by this shift share one thing: they understand systems, not just syntax.

A developer who knows Kotlin and can write Jetpack Compose components is in a different position than one who understands why coroutine cancellation works the way it does, when a ViewModel scope is the wrong choice, and what the architectural consequences of a particular state management approach are three features down the road. The first kind of knowledge is increasingly delegatable. The second is what you need to review what the agent produces.

This is not a comfortable message. It basically says the work that builds deep knowledge is being automated before you've had a chance to accumulate it through repetition. That's a real problem for junior developers and I don't have a clean answer to it. Engineers who actively seek out the "why" behind every pattern they use, even when an agent handed them that pattern, will pull ahead of those who treat agent output as a black box. That's my best guess.

The Security Problem Nobody Talks About

Source: Unsplash — Lewis Kang'ethe Ngugi

Agentic coding is changing security in two directions. As models get more capable, building security into products gets easier. The same capabilities that help defenders help attackers too.

There's a third direction worth adding from my evaluation work: agents introduce security risks through confident implementation of insecure patterns. An agent writing a data pipeline reaches for the most direct path to working code. Input sanitization, parameterized queries, credential management, error handling that doesn't leak internals: these require deliberate thought. Agents do them inconsistently.

The more autonomous the coding pipeline, the more critical it is to have security review that isn't the same agent that wrote the code. I've flagged SQL injection vulnerabilities in agent-generated Python and credential handling issues in agent-generated Kotlin. The code was functionally correct. It would have passed a cursory review. It shouldn't have shipped.

Why I'm Still Building Agents

None of this made me stop building the OINP monitoring bot. It made me more deliberate about it.

The thing I'm building isn't trying to do something clever. It checks a government webpage on a schedule, parses the draw results, compares against the last known state, and fires a Telegram message if something changed. The agent part is the parsing logic: handling inconsistencies in how the page is structured, dealing with cases where the data format shifts slightly. That's a good fit for what these tools are actually good at.

The immigration system in Canada is opaque in ways that are genuinely stressful for people on it. If a monitoring tool reduces that stress even slightly, it's worth the weekend. The judgment about what's worth building and why is still entirely mine.

That's probably the honest answer to "now what." The judgment work is still yours. The implementation is increasingly negotiable.

Jaskaran Singh is a Senior Software Engineer working in AI training and evaluation, with production experience in Android development using Kotlin and Flutter. Currently building a Python-based OINP immigration monitoring agent.

LinkedIn · Portfolio

I Grade AI Code for a Living. Here's What Nobody Talks About.

Jaskaran Singh — Tue, 21 Apr 2026 23:56:05 +0000

Jaskaran Singh — Senior Software Engineer, AI Trainer

I've spent the last year doing something most engineers haven't: reading AI-generated code all day and deciding whether it's actually good.

Not "does it compile." Not "did the tests pass." Good as in, would I be comfortable shipping this to production at 2am on a Friday if something went wrong.

The answer, more often than people want to admit, is no.

I use LLMs myself. But after evaluating enough AI-generated code across Python, Java, Kotlin, and C/C++, I know the failure modes aren't random. They follow patterns. And once you see them, you can't unsee them in AI code or your own.

The Job Nobody Has a Good Title For

My official role is AI Trainer. What that actually means: I'm a human in the RLHF loop.

Reinforcement Learning from Human Feedback works by having engineers like me evaluate model outputs against structured rubrics, then rank and rewrite them so the model learns what "better" looks like. I write adversarial prompts to expose failure modes. I do multi-turn code reviews, meaning I follow an entire back-and-forth between a user and a model across five or ten turns, and assess whether the reasoning held up or quietly drifted off the rails somewhere in the middle.

Less "AI whisperer." More "very opinionated senior reviewer who never runs out of things to flag."

The Pattern That Bothers Me Most

There's a category of bug I call "confident and wrong." The code compiles. It's readable. The variable names are sensible. It even has a comment explaining what it does. And it's still wrong. Not obviously wrong, but wrong in the way that only shows up under load, or with a specific input type, or after three other things happen first.

Here's a real example. Prompt was something like: "Write a function to fetch user details and cache the result."

The model produced:

object UserCache {
    private val cache = HashMap<String, User>()

    fun getUser(userId: String, fetchFn: () -> User): User {
        return cache.getOrPut(userId) { fetchFn() }
    }
}

Clean. Concise. Totally broken in a concurrent environment.

HashMap isn't thread-safe. Two coroutines calling getOrPut simultaneously on the same key can corrupt the map. The model didn't add a mutex, didn't suggest ConcurrentHashMap, didn't even mention the assumption that this runs single-threaded. It just wrote code that works in the demo and fails in production.

The correct version uses ConcurrentHashMap or wraps access with a Mutex if you need atomic get-or-fetch semantics:

object UserCache {
    private val cache = ConcurrentHashMap<String, User>()
    private val mutex = Mutex()

    suspend fun getUser(userId: String, fetchFn: suspend () -> User): User {
        cache[userId]?.let { return it }
        return mutex.withLock {
            // double-checked after acquiring lock
            cache.getOrPut(userId) { fetchFn() }
        }
    }
}

The model's version would pass code review at most places. That's what worries me.

The Edge Case Problem is Structural, Not Random

After a few hundred evaluations, I stopped thinking of missed edge cases as oversights. They're structural. LLMs optimize for the problem as stated. If the prompt doesn't mention null inputs, concurrent access, or network timeouts, the model won't think about them either.

Good engineers treat those as implied. You don't wait to be asked "what if this list is empty." You just handle it.

Here are the categories where models fail most consistently:

Concurrency. Single-threaded assumptions that explode under real-world load. The HashMap example above is the most common flavor.

Failure state propagation. Functions that catch exceptions and return null or false, then callers that don't check the return value, and the whole chain silently fails. The model gets each function right in isolation. It gets the composition wrong.

Resource cleanup. Network connections, file handles, database cursors left open because the happy path worked and nobody wrote the finally block or used the right scoping construct.

Behavioral drift across turns. In turn 1, the model sets up a class a certain way. By turn 4, after a few "can you refactor this" prompts, it has made changes that contradict the original design without acknowledging it. The code still runs. The architecture is now inconsistent in ways that will cause problems in six months.

What I Actually Look For in a Code Review

My rubric has eight criteria. The ones that surface the most issues:

Correctness under adversarial input. Not "does it work with the example." Does it work when the input is empty, null, malformed, enormous, or concurrent? I'll trace through a model's code in my head with the worst inputs I can think of before scoring it.

Explicitness of assumptions. Code that works is not the same as code that communicates its constraints. If a function assumes its input is sorted, that needs to be in a comment, a precondition check, or the function name. The model almost never does this unprompted.

Error handling that means something. There's a specific anti-pattern I call "error theater":

// This is not error handling. This is error cosplay.
try {
    val result = riskyOperation()
    return result
} catch (e: Exception) {
    Log.e("TAG", "Something went wrong")
    return null
}

It looks like error handling. It isn't. The caller has no information. The system has no way to recover. The log message gets ignored. Good error handling changes what the caller can do. It doesn't just muffle the crash.

Security surface. SQL construction via string interpolation, credentials in code comments, user input passed to shell commands without sanitization. These come up. Not constantly, but often enough that I check every time.

The Skill That Transferred Back

I didn't expect this job to change how I write code. It did.

Spending eight hours a day articulating why something is wrong, not just flagging it but writing a clear explanation that a model can actually learn from, builds a habit of internal interrogation that's hard to turn off.

Now, before I submit a PR, I run my own rubric. Is this thread-safe? What happens on retry? Who owns cleanup? Does this function do what its name says, or has it quietly acquired a second responsibility?

That last one is underrated. Functions that do two things are where bugs live. The AI writes them constantly because function names get generated from the prompt context, and prompts often have two goals. "Fetch and validate" is two functions pretending to be one.

Where AI Code Actually Shines

I've been critical, so to be fair.

AI-generated code is genuinely good at boilerplate. Serialization logic, configuration parsing, test scaffolding, adapters between interfaces that differ only in naming. Tedious work that models handle well. If I ask for a Room database entity with a DAO and a repository, the output is usually solid and saves thirty minutes.

// This kind of scaffolding? Models nail it.
@Entity(tableName = "users")
data class UserEntity(
    @PrimaryKey val id: String,
    val name: String,
    val email: String,
    val createdAt: Long = System.currentTimeMillis()
)

@Dao
interface UserDao {
    @Query("SELECT * FROM users WHERE id = :userId")
    suspend fun getUserById(userId: String): UserEntity?

    @Insert(onConflict = OnConflictStrategy.REPLACE)
    suspend fun insertUser(user: UserEntity)
}

Models are also good at surfacing options I'd forgotten about. Not because they know my codebase, but because they've seen enough code to suggest a StateFlow where I was reaching for LiveData, or use runCatching in a context where it genuinely fits.

The mistake is treating it as something that reasons about your system. It doesn't know your system. It knows patterns. Those overlap most of the time and fail in ways that aren't obvious the other times.

Why I Wrote This

A few months ago I started noticing that engineers I respect were shipping AI-generated code without reviewing it seriously. Not because they're lazy. Because the code looked fine. That's the problem. It's calibrated to look fine.

The engineers who work well with AI tooling treat it the way experienced engineers treat a junior developer: capable, useful, not fully trusted without review, and prone to specific failure patterns you learn over time.

That framing changed how I work with it. I think it'll change how you do too.

Jaskaran Singh is a Senior Software Engineer working in AI training and evaluation. Previously built Android fintech apps at Comviva Technologies and Talentica Software. Currently building a Python-based OINP immigration monitoring bot on the side, because immigration status shouldn't require manually refreshing government websites.

Find me on LinkedIn or at my portfolio.