Forem: Olaleye Aanuoluwapo Kayode

From Chatbots to Agents: 5 Architecture Shifts Breaking the "Stochastic Parrot"

Olaleye Aanuoluwapo Kayode — Tue, 10 Mar 2026 10:45:43 +0000

Read the original research paper here

For the last couple of years, the machine learning community has been playing the exact same game: pour more data and compute into a transformer, and watch it get smarter. It’s the classic "Scaling Law" playbook, and honestly, it worked incredibly well.

Until it didn’t.

As we try to push LLMs into open-ended, messy, real-world environments, we are hitting a hard ceiling. We’re officially watching the end of the "stochastic parrot" era the phase where models just give us passive, one-shot predictions based on whatever static prompt we hand them.

The frontier has moved. We aren't just scaling model size anymore; we are scaling test-time interaction. AI is moving away from being a reactive text generator and turning into an autonomous agent that can actually think, verify, and act.

As a systems engineer looking at how we build and deploy these things, this transition completely rewrites our infrastructure playbook. I just finished digging into the research on agentic reasoning, and here are my five biggest takeaways on what this means for the systems we’re building next.

Moving from "Guessing" to the "Think-Act" Loop Traditional models are basically autocomplete on steroids. You ask a question, and the model blurts out the most statistically probable next words in a single pass.

Agentic systems completely flip this on its head by separating the internal "thinking" from the external "doing."

Think of traditional AI like someone shouting out the first answer that pops into their head. Agentic AI is more like a professional drafting an email, reading it over, spotting a mistake, fixing it, and then finally hitting send.

Instead of just spitting out an answer, the model uses an internal scratchpad (a latent reasoning space). It plans ahead, catches its own potential failures, and verifies its logic before making a move. Reasoning isn't just a side effect of generating text anymore; it’s the core engine of the system.

AI That Codes Its Own Tools (The LATM Framework) Right now, if an LLM doesn't have a specific API to do a task, it just apologizes and fails. They are stuck in a closed loop.

Agentic reasoning breaks us out of that through "Self-evolving Tool-use." When a powerful agent hits a bottleneck it can't solve, it doesn't just give up. It autonomously writes a new Python script to solve the problem, packages it up as a function, and hands it off to a smaller, cheaper model to actually run.

Instead of handing the AI a pre-written API, we are essentially teaching it how to build its own tools on the fly.

The Engineering Reality: From a backend perspective, this is both amazing and a complete nightmare. If an AI is writing and executing its own scripts in real-time to solve edge cases, how do we handle CI/CD? We are going to have to engineer entirely new, ultra-secure dynamic sandboxes just to let these agents experiment without taking down production.

We Need to Stop Hoarding Data (Optimized Forgetting) If you work in AI right now, you know everyone is obsessed with RAG (Retrieval-Augmented Generation). We just keep shoving more and more embeddings into massive vector databases. But this paper suggests we’re looking at memory all wrong.

In an agentic system, memory isn't just a passive storage bucket. The model actively learns to manipulate it.

Using frameworks like Memory-R1, agents use reinforcement learning to manage their own cognitive load. They use a "Memory Manager" to figure out what to keep, update, or crucially delete to reduce noise, while an "Answer Agent" uses what's left to actually solve the problem.

The Engineering Reality: This proves that "optimized forgetting" is the actual future. We need to stop building bottomless storage buckets and start engineering intelligent memory filters.

Agents Talking to Agents We are moving from isolated chatbots to collaborative ecosystems (Multi-Agent Systems). In this setup, one agent’s output isn't just text for a human to read; it’s a prompt that triggers the internal thought process of another agent.

You end up with specialized roles: Coordinators breaking down tasks, Executors writing the code, and Evaluators auditing the work for logic flaws.

The Engineering Reality: This completely breaks how we handle AI safety right now. Today, we mostly just filter "bad text" right before it reaches the user. But if agents are planning long-term goals and secretly communicating with each other in the background, those text filters are useless. We have to figure out how to audit their reasoning loops, not just their final outputs.

Thinking Harder, Not Just Training Longer For me, the biggest takeaway is the shift toward scaling test-time compute.

Instead of trying to teach the model absolutely everything before it's deployed (offline pre-training), we are building systems that spend more compute during inference. It's the difference between studying for a year but only having 5 minutes to take a test, versus studying for a month but having 5 hours to carefully work through every question.

The industry is moving toward GRPO (Group Relative Policy Optimization). Instead of needing a massive, separate "judge" model to grade the AI's homework, GRPO lets the model learn complex reasoning just by comparing a bunch of its own generated answers to see which path works best.

The Architecture Shift: A Quick Reference
To visualize how drastically our infrastructure demands are changing, here is how the static Chatbot era compares to the new Agentic reality:

How it computes:

The Chatbot Era: Single forward pass

The Agent Era: Multi-step search & reasoning

How it learns:

The Chatbot Era: Offline pre-training

The Agent Era: Continual & self-evolving

Memory:

The Chatbot Era: Short-term context window

The Agent Era: State tracking & memory editing

Goal:

The Chatbot Era: Reactive (You prompt, it answers)

The Agent Era: Explicit, long-term planning

The Bottom Line for Engineers
Agentic reasoning is taking us from building "text interfaces" to building actual "functional partners."

But the technical hurdles are massive. We have to figure out how to keep a model coherent over weeks of execution without its memory collapsing. And more importantly, we have to ask if our current governance and audit protocols are ready for systems that plan, learn, and collaborate without us in the loop.

The era of the stochastic parrot is wrapping up. It’s time to start building the agents.

Originally published on Medium. If you are building agentic systems or scaling AI infrastructure, let's connect via my Linktree to keep the conversation going!

Why I Moved My ML Model from Flask to AWS Lambda (A Student’s Guide to $0 Hosting)

Olaleye Aanuoluwapo Kayode — Thu, 22 Jan 2026 04:41:35 +0000

Some months ago, I built a machine learning model to predict diabetes risk using the Pima Indians Diabetes dataset. It was a standard student project: a Jupyter notebook, some Scikit-learn code, and a Random Forest Classifier. It worked perfectly on my laptop.

But when I wanted to deploy it so a friend could actually use it, I hit a wall.

The standard advice on the internet is usually "Just Dockerize it and run it on AWS Fargate" or "Spin up an EC2 instance and run a Flask server."

For a student developer in Nigeria, that advice is dangerous.

The "Always-On" Problem
An EC2 instance is like a generator that you leave running 24/7 just in case someone wants to turn on a light switch once a week. You pay for it every second it runs.

If I deployed my Diabetes Predictor on an EC2 t3.medium, it would cost me money even at 3:00 AM when nobody is using it. Operating on a student budget with unpredictable exchange rates, "low cost" isn't just a preference—it's a requirement.

I needed an architecture that was "lazy"—one that only wakes up when it has work to do.

Step 1: The "Easy" Part (Training the Model)
First, I trained the model using a Random Forest Classifier. This part was straightforward. I used the standard dataset from the UCI Machine Learning Repository.

Here is the core of the training logic I used:

Python

Import Libraries

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib

Load Dataset & Preprocess

data_url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(data_url, header=None, names=columns)

X = df.drop('Outcome', axis=1)
y = df['Outcome']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Model Training

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

The most important part for deployment:

joblib.dump(model, "diabetes_predictor.pkl")
Most tutorials stop here. They tell you to save the model and... good luck. But a .pkl file on my hard drive doesn't help anyone.

Step 2: The Pivot to Serverless
I decided to rip out the idea of a dedicated server. I didn't need an OS; I just needed a place to run model.predict().

Here is the architecture I designed to keep costs at effectively $0.00:

The Storage: I uploaded my diabetes_predictor.pkl file to an Amazon S3 bucket. (Cost: Pennies).

The Compute: I wrote a Python function using AWS Lambda.

The Front Door: I connected it to Amazon API Gateway to give it a public URL.

Step 3: The Deployment Code
This was the tricky part. You can't just copy-paste your Jupyter Notebook into AWS Lambda. You have to write a "Handler" that knows how to talk to S3.

I had to write a script that pulls the model from S3 into the Lambda's temporary storage (/tmp) before it can make a prediction.

Here is the actual code running in my Lambda function (which is also in my GitHub repo):

Python
import json
import boto3
import joblib
import os
import numpy as np

Initialize S3 client

s3 = boto3.client('s3')
BUCKET_NAME = os.environ.get('BUCKET_NAME', 'my-diabetes-model-bucket')
MODEL_FILE_NAME = os.environ.get('MODEL_FILE_NAME', 'diabetes_predictor.pkl')

def load_model_from_s3():
"""Downloads the model from S3 to the /tmp directory"""
download_path = f'/tmp/{MODEL_FILE_NAME}'
if not os.path.exists(download_path):
s3.download_file(BUCKET_NAME, MODEL_FILE_NAME, download_path)
return joblib.load(download_path)

model = None

def lambda_handler(event, context):
global model

# 1. Load the model if it's not ready
if model is None:
    model = load_model_from_s3()

try:
    # 2. Parse the incoming JSON body
    body = json.loads(event['body'])

    # Extract features matching training columns
    features = np.array([[
        body['Pregnancies'], body['Glucose'], body['BloodPressure'],
        body['SkinThickness'], body['Insulin'], body['BMI'],
        body['DiabetesPedigreeFunction'], body['Age']
    ]])

    # 3. Predict
    prediction = model.predict(features)
    result = int(prediction[0])

    return {
        'statusCode': 200,
        'body': json.dumps({'prediction': result})
    }

except Exception as e:
    return {
        'statusCode': 500,
        'body': json.dumps({'error': str(e)})
    }

The "Ibadan" Constraint: Battling Latency
The hardest part wasn't the code; it was the latency.

When you use Serverless, the function "goes cold" if nobody uses it for a while. The next time someone clicks "Predict," AWS has to spin up the environment, download Python libraries, and load the model.

On my first test, it took 4 seconds to get a result. On a slow 4G network, that felt like an eternity.

The Fix: I learned a counter-intuitive trick from the AWS docs: Increase the memory to save time.

I bumped the Lambda memory from 128MB to 512MB. I wasn't just getting more RAM; I was getting more CPU power. The function started loading in under 1.5 seconds.

Because AWS bills by the millisecond, the faster function actually cost me slightly less money than the slower, weaker one because it finished the job so much quicker.

Conclusion
This project taught me that Cloud Engineering isn't just about making things work. It's about making them viable for your specific constraints.

By moving to Serverless, I built an app that can scale to thousands of users but costs me nothing when it’s idle. For developers in emerging markets, mastering these "frugal" architectures is a superpower.

Check out the full code on my GitHub: https://github.com/OAKVISUALZ/Prediction-of-Diabetes/