Forem: Nicanor Korir

VibeCheck - Community Help for AI Builders

Nicanor Korir — Sun, 01 Mar 2026 16:47:41 +0000

This is a submission for the DEV Weekend Challenge: Community

The Community

VibeCheck serves "vibe coders" - the growing wave of non-traditional builders using AI tools like Cursor, Bolt, Lovable, and Replit to create real products. They're not "learning to code" - they're building businesses, MVPs, and passion projects with AI assistance.

These builders inevitably hit the "70% wall" where AI can't finish the job. When things break:

Vendor solutions assume they know how to navigate complex codebases
AI tools keep repeating the same broken solutions
No existing community understands their unique experience

They deserve a community that meets them where they are.

What I Built

VibeCheck is a community platform that helps vibe coders get unstuck and learn from each other.

Key Features:

AI Triage Coach - Describe your problem in simple words. Get a "rescue prompt" optimized for your AI tool - not code fixes, but better questions to ask your AI.
Screenshot Analysis - Paste a screenshot of your error. AI vision analyzes it and helps you describe the problem clearly.
Prompt Quality Feedback - Real-time scoring helps you write better problem descriptions before asking for help.
Community Help - Post requests, get suggestions from other users, earn points for helping and interacting.
Prompt Library - Share "what didn't work → what worked" prompt swaps. Learn from others' breakthroughs.

Demo

Video Demo: Checkout this Loom VibeCheck Demo

Live App: https://vibecheck-community.vercel.app/

Code

nicanor-korir / vibecheck-community

VibeCheck - Community Help for AI Builders

VibeCheck

Got 70% done with AI? We'll help you finish the last 30%.

A community platform for vibe coders using Cursor, Bolt, Lovable, Replit, and more. Built for the DEV Weekend Challenge.

The Problem

"Vibe coding" has exploded — the global market for vibe coding platforms is now $4.7 billion, with 63% of users being non-developers. But there's a critical gap:

"Non-engineers can get 70% of the way there surprisingly quickly, but that final 30% becomes an exercise in diminishing returns."

When something breaks, non-developers are stuck:

66% of developers say AI solutions are "almost right, but not quite" — leading to time-consuming debugging
Stuck in loops: copy error, paste, get new error, repeat ("The fix breaks something else. You ask AI to fix that. It creates two more problems.")
Stack Overflow feels intimidating, YouTube tutorials don't answer specific questions

The Solution

We don't replace your AI…

View on GitHub

How I Built It

Tech Stack:

Next.js - React framework with App Router
Supabase - Authentication, PostgreSQL database, Row Level Security
Google Gemini - AI triage coaching and vision-based screenshot analysis
TypeScript - Type-safe development
Tailwind CSS - Styling
Vercel - Deployment

Key Implementation Details:

AI Triage generates "rescue prompts" tailored to specific AI tools (Cursor, Bolt, Lovable, etc.)
Vision API analyzes error screenshots and translates them into clear problem descriptions
Real-time prompt quality scoring uses pattern matching to help users write better requests
Community features include points, rewards for helping, and gamification

Building an interactive robotics portfolio

Nicanor Korir — Thu, 29 Jan 2026 20:03:03 +0000

This is a submission for the New Year, New You Portfolio Challenge Presented by Google AI

About Me

I'm Nicanor Korir, a techie who grew up in the highlands regions in Kenya, fell in love with computers, and became a software engineer. Looking back, I would say Steve Jobs was right, you can only connect the dots while looking backwards.

My tech journey is actually an interesting one. I first interacted with computers at the age of 18, after high school. With the break after high school, I decided to learn basic computer skills to just be familiarised with things like using a computer(this was actually true). It was interesting, actually, it was so awesome that I spent a lot of hours just trying to figure out different things. That's when I decided to pursue something computer-related in my degree, and this led me to computer science.

Studying computer science in my first and second years was hard since everything was new except maths and physics, but I enjoyed it. In my third year i focused on software more, and I fell in love with software engineering, and the rest is history.

Fast forward to today, I've worked with different clients in different industries, solving human problems through technology. Currently, I am doing my master's studies in AI and robotics, and with the current wave in the tech trend,s there have been a lot of things for me to learn and understand about intelligent and interactive systems for the future.

As a part-time student, I am working as a CTO at Alma, a startup backed by AI Nation at Berliin, Germany. This is where I've put in my skills as a leader, to help build and establish a solution in the social lives of the GBV survivors and help them maneuver their daily lives.

This portfolio is a way for me to share my tech journey in a way that tells more about me and how the end user will associate with me.

My portfolios have been changing with time. This Dev Challenge came at the right time, as I've been thinking of setting myself up for my next transition from a student to a professional.

Portfolio

Note: If the embed doesn't load, visit nicanor-170395639051.europe-west3.run.app directly.

How I Built It

The Tech Stack

Layer	Technology	Why
Framework	Next.js 16 (App Router)	Server components, streaming, edge-ready
AI	Gemini 2.0 Flash	Fast, accurate, cost-effective
3D	React Three Fiber	Cyberpunk vision scanner aesthetic
Animation	Framer Motion	Smooth, physics-based transitions
State	Zustand	Lightweight, persists user preferences
Styling	TailwindCSS 4	Design tokens, responsive by default
Deployment	Google Cloud Run	Great DX

Deployment

I played around with Google Cloud Run, it had been a while since I'd explored the Google Cloud environment.

First, I checked out all the services available with the free $300 credits. Then I dove into the Cloud Run console and CLI. I always love Google Cloud docs, there are both CLI gcloud and console guides. My first deployment through gcloud cli landed in us-central1:

First authentication and setting the project id to and from Google Cloud

# Authenticate with Google Cloud
gcloud auth login

# Set your project
gcloud config set project YOUR_PROJECT_ID

Then deployment on the current portfolio folder:

gcloud run deploy nicanor-portfolio \
  --source . \
  --region us-central1 \
  --allow-unauthenticated \
  --set-env-vars "GEMINI_API_KEY=api_key,NODE_ENV=production,NEXT_PUBLIC_GA_MEASUREMENT_ID=G-XXX" \
  --labels dev-tutorial=devnewyear2026 \
  --memory 512Mi \
  --cpu 1

But then I thought, why am I deploying to the US when I'm sitting in Berlin? So I switched to the closest region, Frankfurt (europe-west3):

gcloud run deploy nicanor-portfolio \
  --source . \
  --region europe-west3 \
  --set-env-vars "GEMINI_API_KEY=,NODE_ENV=production,NEXT_PUBLIC_GA_MEASUREMENT_ID=" \ 
  --labels dev-tutorial=devnewyear2026

The latency difference was noticeable.

I got excited and tried to set up a custom subdomain (nicanor.mydomain.com - this is just an example, I can't reveal the unfinished configuration yet). Unfortunately, Frankfurt doesn't support Cloud Run domain mapping 🙃

So I had two options:

Switch back to a different region that supports domain mapping
Update my DNS configurations manually

Eventually, I decided to roll with the Cloud Run URL directly. I'll configure the subdomain later

How I Used Gemini

Backstory, my initial coding agent was Claude, especially the ui and getting the first parts. I had more issues, especially with integrating gemini3 since claude doesn't have much context on gemini3 but gemini2.0-flash, so I had to switch to using gemini as my coding agent on the terminal

Instead of dumping information and hoping visitors find what they need, I decided to go for an interactive and surprise format. The first thing you see is a question:

"What brings you here?"

Check my work → I'll show you metrics, experience, and achievements
Building something? → Let me show you my technical leadership
Fellow engineer? → Let's dive into the architecture
Just curious? → Welcome! Here's my story from Kenya to Berlin
Something else? → Tell me, and Gemini will figure out what's relevant

Once a user chooses a path, then Gemini generates the nodes and a summary of my journey so that the user gets a summary about me. Also, I had a list of common pre-generated questions that I had compiled, and Gemini uses this for the nodes(summary road map) and for the intelligent chatbot.

The AI Architecture

For my llm(Gemini) usage, I didn't want the traditional roadmap, like the server-client architecture

User asks a question → Call Gemini → Return response

Simple, but expensive. Every visitor burns API credits

I built a hybrid chat system that gives a better response to the user. Once a user opens a chatbot, they get greeted with a message from gemini and then the user can use the pre-generated questions or type their own questions. The response will either be fetched from the pre-generated JSON data or from Gemini. This is the user flow for the chatbot interaction

I initially started with gemini-2.0-flash, which worked so well, and then I later decided to switch to gemini-3.0-flash-preview to try out gemini3, especially the Intelligence and intelligence. It worked well, actually, and I am happy with my progress

Intent Analysis

When someone types "something else" and enters their own context, like "I'm researching trauma-informed AI" or "need a keynote speaker", Gemini does something clever like this in the code:

// Actual code from my intent analyzer
const systemPrompt = `Analyze this visitor's intent and:
1. Classify their primary interest
2. Suggest which portfolio sections are relevant
3. Recommend a visual theme (robotics/research/business)
4. Generate a personalized welcome message`;

Here, someone researching AI ethics gets routed to my Alma project (trauma-informed AI for GBV survivors), and someone looking for a speaker sees my events calendar and media kit.

Same portfolio. Personalized experience for different users

The Vision System: Because Why Not?

I'll be honest, the 3D robot-eye scanner thing that greets you? It started as a joke. I was working on face look detections and tracking for a specific robotic solution, and I got hooked on how the image analysis happens. That's when I decided to also experiment with almost the idea on the ui website, and my portfolio was the best.

I built a cyberpunk-inspired "vision system" that:

Scans my current image (not really, but it looks like it does)
Shows "HUMAN DETECTED" with confidence scores
Adapts its detection labels based on your chosen path

Is it necessary? No.
Is it memorable? I hope so

Hallucination Prevention

AI portfolios have a problem: they lie. Ask any LLM about a random developer, and it might confidently describe projects that don't exist.

I implemented multiple safeguards:

Grounded System Prompt: 170+ lines of verified facts about my career, with explicit "NEVER claim" instructions
Pre-flight Checks: Before returning any response, I scan for:
- Placeholder text ([insert link])
- Email addresses (privacy)
- False company claims ("worked at Google/Meta/etc")
Exact Link Enforcement: Every URL in the system prompt is real. Gemini is instructed to use them verbatim or not at all.

What I'm Most Proud Of

1. Near-Perfect Lighthouse Scores

I'm genuinely proud of the performance optimization. Here are the actual Lighthouse results:

Metric	Desktop Score	Mobile Score
Performance	99	93
Accessibility	96	96
Best Practices	96	96
SEO	100	100

And the Core Web Vitals:

First Contentful Paint: 0.3s
Largest Contentful Paint: 0.6s
Total Blocking Time: 10ms
Cumulative Layout Shift: 0
Speed Index: 1.0s

2. The Personalization Actually Works

With the pre-generated JSON data, I restructured the prompt as per my story and aligned the ui/ux for storytelling. I tested the personalization with a friend of mine, and she was happy that it was personalized to tell more about me. I am happy about the integration of Gemini and how it intelligently works to give personalised information to the users

3. The API Efficiency

Running Gemini for every interaction would cost a fortune, so I simulated a temporary AI memory through JSON data. My hybrid approach means:

Common questions: 0ms latency, 0 API cost
Complex questions: Full Gemini intelligence
Estimated: 0.2 API calls per visitor(tested by tests and manual tests)

4. The Story It Tells

The goal of this portfolio is to tell my Story about my career and my background. The theme tells what I am currently working in and someone can relate to that. Right from when the user enters the site to the end, it tells my story

It's like the latest chapter in a book I could be writing that might end in suspense.

5. It's Actually Useful

I started as a challenge, but now I am going to be using it fully as my portfolio and to tell my stories and relate with any of my audience.

6. UI/UX

I am happy about the robotic(computer vision) theme in general. Right from the mouse, vision scanner, the 3D on the hero section, and the map nodes to tell different story paths for those who want to explore different routes of my journey

Try It Yourself

The code is open source. Feel free to fork it and build your own intent-aware portfolio(you might want to remove most of my data, I'll structure it better for open-source):

GitHub: github.com/nicanor-korir/portfolio

Live Demo: nicanor-170395639051.europe-west3.run.app

Key files to explore:

src/lib/gemini.ts - The hybrid AI architecture
src/app/api/analyze-intent/route.ts - Intent classification
src/components/3d/HeroVisionSystem.tsx - The vision scanner
src/components/sections/Hero.tsx - Personalized path selection

Thanks for reading! If you made it this far, you're exactly the kind of curious person this portfolio was built for.

Come say hi: nicanor-170395639051.europe-west3.run.app/#contact

From Shaky Farm Videos to Sharp Diagnoses - Building a Client-Side Media Pipeline

Nicanor Korir — Sat, 24 Jan 2026 23:53:33 +0000

A farmer stands in her field, phone in hand, recording a quick video of a sick tomato plant. The camera shakes. The sun creates harsh shadows. Her thumb accidentally covers the corner of the frame for three seconds.

That video contains maybe 900 frames. Maybe ten of them are actually usable for plant disease diagnosis. The other 890 are blurry, redundant, or partially obscured.

The question that drove weeks of development: how do you automatically find those ten good frames without uploading 900 to a server?

Why Client-Side Processing

The obvious architecture: upload the raw video, process it on the server with Python or FFmpeg, and send back extracted frames.

For my users, that architecture fails:

Bandwidth costs real money. On metered data plans common in rural Africa, uploading a 30MB video might cost more than the farmer earns that day. Extracting frames locally and uploading only the useful ones—maybe 500KB total—changes the economics entirely.

Networks are unreliable. A large upload over 2 GB with frequent drops means failed transfers, wasted data, and frustrated users. Smaller uploads succeed more often.

Latency compounds. Upload time plus server processing time plus download time. On slow networks, this becomes intolerable. Processing locally eliminates the round-trip.

So everything happens in the browser. Image compression, video frame extraction, blur detection, frame selection—all client-side JavaScript using Canvas APIs.

Image Compression: The Foundation

Every image, regardless of source, goes through compression before upload. The target: 1024 pixels on the longest dimension, JPEG at 80% quality.

Why these specific numbers?

1024 pixels is the sweet spot for Claude Vision. Larger images don't improve diagnostic accuracy—Claude doesn't need to count individual pixels to identify disease symptoms. Smaller images lose the detail needed to spot early infections. I tested this extensively: 1024px captures everything diagnostically relevant.

80% JPEG quality is where compression artifacts become invisible to humans but file size drops dramatically. At 90% quality, files are 50% larger with no visible benefit. At 70%, subtle disease symptoms might be obscured by compression artifacts. 80% hits the sweet spot.

The result: a 4000x3000 pixel PNG (roughly 15MB) becomes a 1024x768 JPEG (roughly 100KB). That's a 150x reduction in data transferred. On a slow network, that's the difference between a 30-second upload and a 2-second upload.

Video Frame Extraction: The Hard Part

Videos are information-dense but mostly redundant. A 10-second clip at 30fps contains 300 frames, but probably only 5-10 are worth analyzing.

The naive approach—grab every Nth frame—fails in practice

Blur clustering. If the camera moves at the 3-second mark, you get a blurry frame. The sharp frames at 2.8 and 3.2 seconds are skipped.

Redundancy. If the camera holds steady for 5 seconds, you might extract 2 nearly identical frames while missing coverage of different angles.

Temporal bias. Fixed intervals ignore content. The user might have shown three different angles at 5, 15, and 25 seconds. Fixed extraction at 0, 10, 20, 30 might miss all three.

The solution requires two innovations: blur detection to find sharp frames, and temporal filtering to ensure diversity.

Blur Detection: The Laplacian Trick

To find sharp frames, you need to measure sharpness. The computer vision community solved this decades ago with the Laplacian operator.

The intuition: sharp images have many edges—sudden transitions between light and dark. Blurry images have few edges because transitions are smoothed out.

The Laplacian operator computes, for each pixel, how different it is from its immediate neighbors. In a flat region (like a solid color), Laplacian values are near zero. At an edge (like a leaf vein against leaf tissue), values are high.

For each pixel, the calculation is: 4 * center - top - bottom - left - right. If the center pixel is similar to its neighbors, this equals zero. If the center pixel is dramatically different (an edge), this equals a large positive or negative number.

By computing the variance of Laplacian values across the entire image, you get a single "sharpness score." High variance means many edges, which means sharp. Low variance means few edges, which means blurry.

In testing, I found these thresholds:

Variance above 500: reliably sharp, excellent for analysis
Variance 100-500: acceptable, usable if nothing better available
Variance below 100: too blurry, likely unusable

The calculation happens on downscaled frames (1024px max dimension) to keep processing fast. Full-resolution Laplacian computation would be too slow for real-time frame scoring on mobile devices.

Temporal Diversity: Not Just The Sharpest

Selecting the 10 sharpest frames sounds right, but fails in practice. They often cluster in one time window when the camera happens to be stable.

For disease diagnosis, temporal diversity matters. The user naturally shifts perspective while recording—showing the top of the leaf, then the underside, then the stem. A diverse frame selection captures this variation.

The algorithm:

Score all candidate frames by sharpness (Laplacian variance)
Sort by sharpness, highest first
Select the sharpest frame
Calculate minimum time gap: video duration divided by (desired frames × 2)
Skip any frame too close in time to already-selected frames
Select the next-sharpest that passes the temporal filter
Repeat until you have enough frames
Sort selected frames chronologically for display

For a 30-second video, selecting 5 frames enforces at least 3-second gaps between selections. The result: frames are both sharp AND temporally diverse, capturing different moments and angles from the recording.

The Complete Video Pipeline

When a user uploads a video:

Load metadata: Get duration and dimensions without loading the full video into memory
Calculate sample points: Divide duration by max frames to get interval
Seek and capture: For each sample point, seek the video element and draw the current frame to the canvas
Score sharpness: Compute Laplacian variance for each captured frame
Select best frames: Apply the temporal diversity filter to choose final frames
Compress frames: Each selected frame goes through image compression (1024px, 80% JPEG)
Return for analysis: Final frames ready for Claude Vision

The whole process typically takes 3-5 seconds for a 30-second video on a mid-range phone. Users see a progress indicator showing frames being extracted, then their selected frames are displayed for confirmation before analysis.

Handling Multiple Media Items

Users can upload multiple images or videos in a single session. The system needs to know: same plant from different angles, or different plants entirely?

For images, the UI asks directly. "Different Plants" produces N separate diagnoses. "Same Plant" produces one comprehensive diagnosis considering all evidence.

For video input, extracted frames are treated as "the same plant" by default. The user was recording continuously, so frames presumably show the same subject. This makes video the fastest path to a comprehensive diagnosis—point and record, get analysis.

Error Handling: Fail Visibly

Real-world media processing fails constantly. Files are corrupted. Formats are unsupported. Memory runs out on old phones.

Unsupported format: Check MIME type before processing. Provide clear error listing supported formats (JPEG, PNG, GIF, WebP for images; MP4, WebM, QuickTime for video).

Corrupt file: Wrap processing in try-catch. If the image fails to load or the video fails to seek, provide specific feedback.

Processing failure: If compression fails, retry with lower quality settings. If that fails, skip the problematic file but continue with others.

Memory pressure: Process one item at a time rather than parallelizing. Release canvas references after each operation. Revoke blob URLs immediately after use. This is slower but prevents crashes on memory-constrained devices.

Unusable output: If blur detection determines all extracted frames are below the usability threshold, warn the user and suggest re-recording with a steadier hand or better lighting.

The principle: fail visibly, never silently. A user who sees "Video too blurry, try recording again in better light" can fix the problem. A user whose analysis silently uses garbage frames loses trust in the whole system.

Performance: Making It Work on Old Phones

The target device isn't the latest iPhone. It's a three-year-old Android phone with 2GB of RAM running Chrome.

Canvas reuse: Creating a new canvas element for each operation is expensive. The pipeline reuses canvas elements across operations, clearing and resizing as needed.

Downscale first: Blur detection doesn't need full resolution. A 4000x3000 image downscaled to 1024x768 gives equally valid sharpness scores at 1/12th the processing cost.

Progressive loading: For videos, metadata loads first (duration, dimensions), then frames extract one at a time with progress feedback. Users see activity immediately rather than waiting for complete processing.

Explicit cleanup: Large media can exhaust mobile browser memory. The pipeline explicitly nulls references after use. Video object URLs are revoked immediately after frame extraction. This prevents memory from accumulating across multiple uploads.

What I Learned

Test with garbage input. Development used well-lit, centered photos taken by someone who knows what they're doing. Production receives shaky videos recorded while walking, photos with fingers partially covering the lens, and screenshots of screenshots. Robustness only emerged after I deliberately tried to break the system with the worst possible input.

Client-side processing is more feasible than expected. My initial assumption was that "real" image processing needed server resources. Modern browsers have Canvas APIs that handle common operations efficiently. Even blur detection—fundamentally a computer vision algorithm—runs fast enough on phone CPUs when you're smart about resolution.

Users prefer control over automation. Early versions automatically selected "best" frames without showing users what was selected. Users didn't trust it. Showing extracted frames and letting users confirm or re-record built confidence in the system. The extra step is worth the trust it creates.

Fail visibly. Silent failures are the worst. A processing error that shows "Something went wrong, try again" is infinitely better than one that silently produces garbage output. Users can recover from visible failures; invisible ones erode trust permanently.

The Invisibility Goal

The media pipeline is invisible when it works. Users upload a shaky video, see some frames appear, confirm their selection, and get a diagnosis. They don't think about Laplacian variance or temporal diversity or JPEG compression ratios.

That invisibility is the goal. The complexity exists so that users don't have to understand it. They just need to point their phone at a sick plant and get help.

Every technical decision—client-side processing, blur scoring, temporal filtering, compression ratios—serves that goal. Not because the techniques are elegant, but because they make the experience work for farmers with old phones, slow networks, and unsteady hands.

Technical complexity that serves users is engineering. Technical complexity that serves itself is self-indulgence. The media pipeline exists to turn chaos into clarity, invisibly.

This completes the Shamba-MedCare technical series.

The full system represents months of iteration across architecture, prompt engineering, accessibility, and media processing. The code is open source. The problems are documented. The opportunity—bringing agricultural AI to farmers who need it most—is massive.

If you're working on similar problems, I'd love to hear from you.

Source code on GitHub

The Architecture Behind a Stateless AI Application

Nicanor Korir — Mon, 01 Dec 2025 23:39:47 +0000

This project has really been awesome to work on. I made an architectural decision early in Shamba-MedCare that felt risky at the time: no backend database. There was no need for user data at this moment, and getting the user response was the most important.

Every tutorial, every architecture guide, every "best practice" document assumes you'll store user data on a server. User accounts, session management, and data persistence all living in PostgreSQL or MongoDB, or DynamoDB.

But I kept asking myself: why? What user data does this application actually need to persist across devices? The answer was... nothing. And that realization shaped everything that followed.

The Three-Layer Split

Here's how the system actually works:

Frontend handles all user interaction, data persistence, and UI state. It compresses images before upload, manages the multi-step wizard flow, stores history locally, and renders results.

Backend does exactly one thing: transform image data into diagnosis data. It receives a request, builds a prompt, calls LLM, parses the response, and returns structured JSON. No state. No sessions. No database.

AI Layer is Claude Vision. It receives images with carefully crafted prompts and returns detailed diagnostic information.

Each layer has one job. Mixing responsibilities, like having the backend store history or the frontend call LLM directly, would create complexity without benefit.

Data Interaction

Simplicity is the goal, here is how it works

History and setting never leave the user's device. The API key passes through my server, but is never stored.

The Multi-Step Wizard: Why State Machines

The scan flow has five potential steps: plant part selection, crop type selection, media upload, analysis mode (for multiple images), and context entry. Implementing this as a traditional form with step numbers would be a nightmare.

Here's the problem with step numbers:

If the user uploads a single image, step 4 is context entry. If they upload multiple images, step 4 is mode selection and step 5 is context entry. The step numbers become meaningless because they depend on runtime conditions.

The solution is a state machine. Each state has a meaningful name: "part", "crop", "media", "mode", "context", "analyzing". The UI doesn't care about step numbers. It renders whatever state it's in.

The progress indicator ("Step 3 of 5" vs "Step 3 of 4") is computed dynamically based on whether mode selection will appear. Users see accurate progress without the code caring about arbitrary step numbers.

Storage Architecture: Three Tiers

Different data have different lifetimes. I implemented three distinct tiers:

Session storage holds consent flags. When you close the browser, consent expires. Next session, you make an active choice again. For health-related applications, I think users should consciously opt in each time rather than relying on consent from months ago.

Local storage holds everything that should persist: scan history, accessibility settings (font size, voice preferences), and the API key.

Embedded deep cache is a design choice that trips people up. Each history item doesn't just store a reference to results, it stores the complete diagnosis. All 25+ fields. Treatments, prevention tips, the full thing.

This bloats storage but enables true offline access. A farmer can reference last week's treatment recommendations without any network connection. That's critical for rural users.

The math works out: each scan is about 20-30KB with a thumbnail. At 50 scans maximum, that's roughly 1.5MB, well under the 5MB browser quota. Older scans rotate out automatically.

The Single Endpoint Philosophy

The backend has one API endpoint: POST to /api/v1/analyze. That's it.

Why not multiple endpoints? I considered having /analyze/single for one plant, /analyze/batch for multiple plants, and /analyze/video for video input. But here's the thing: they all do the same underlying operation. They all send images to LLM and return structured results.

The only difference is in the prompt construction and response handling. A mode parameter handles that cleanly. Multiple endpoints would mean:

Client-side routing logic
Duplicate validation code
Versioning complexity when the format changes
Documentation for three endpoints instead of one

One endpoint with a mode parameter is simpler to understand, test, and maintain.

User-Provided API Key

This is, of course, temporary, and since the app is still in testing and development stages, we need to prevent overuse of our API Credits

Cost transparency: Users see exactly what they're paying. No hidden markup. No surprise bills.

No key management: I don't need a database to store keys, rotation logic, or access controls, this reduces operational complexity by a lot.

Scalability: No shared rate limits. Each user has their own Anthropic quota.

Trust: Users control their own credentials. I literally cannot run up their bill unexpectedly.

The downside is friction. Users must create an Anthropic account and generate an API key before using the app. For sophisticated users (the current audience), this is fine. For mass-market adoption, I'd need to revisit this decision.

Batch vs Single Mode: The Same Plant Problem

When users upload multiple images, the system needs to know: are these different plants (analyze separately) or the same plant from different angles (analyze together)?

This isn't something AI can reliably infer. A tomato leaf photographed from above and below might look completely different. Are they the same plant? Only the user knows.

So the UI asks directly. "Same plant, different angles" sends all images in one request, LLM sees everything together and produces one diagnosis that synthesizes evidence across views. "Different plants" sends each image separately, three images means three independent diagnoses.

For video input, extracted frames default to "same plant" mode. The user was recording continuously, so frames presumably show the same subject.

Response Parsing: Handling Imperfection

Here's something I learned the hard way: LLM sometimes wraps JSON responses in markdown code blocks. Even when the prompt explicitly requests raw JSON.

The prompt says: "Return ONLY valid JSON, no markdown formatting."

LLM occasionally returns:

json
{"health_score": 45, "disease": "Early Blight"...}

The backend strips markdown code block markers if present. It handles both raw JSON and markdown-wrapped JSON identically.

The broader principle: prompts aim for perfection; parsing assumes imperfection. Every field has a fallback. Missing health score defaults to 0. Missing severity defaults to "moderate". A partial response is better than a crashed request.

The Trade-off Table

I believe architecture is really just trade-off documentation. Here's the honest accounting:

Decision	What We Gave Up	What We Gained
No backend database	Cross-device sync, unlimited history	Privacy by design, simpler operations
User-provided API keys	Frictionless onboarding	Cost transparency, no key management
Optional plant/crop selection	Guaranteed input accuracy	Accessibility, faster expert workflow
Full results cached in history	Smaller storage footprint	True offline access to treatments
Single API endpoint	Clear operation separation	Simpler integration, less client logic

None of these decisions is universally correct. They're correct for this application's specific constraints: privacy-conscious users, offline usage requirements, cost sensitivity, and a small development team.

Different constraints would lead to different choices. A hospital app would need a database. An enterprise tool would need centralized key management. A children's app would need user accounts.

Architecture isn't about finding the "right" pattern. It's about understanding your constraints and making explicit choices that fit them.

The Result

The system that emerged is simple in a specific way. Not simple as in easy to build, simple as in each piece does one thing.

Frontend handles users. Backend handles transformation. LLM handles intelligence. Storage tiers handle different lifetimes. The scan wizard handles a multi-step flow. Each piece is testable in isolation and replaceable without affecting others.

That's the goal of architecture: not clever abstractions, but clear separations. Not perfect patterns, but honest trade-offs.

When I look at this system, I can explain every decision. That's what makes it maintainable, not the code, but the clarity of intent behind it.

Source code on GitHub

Building Shamba-MedCare AI app for Real Users

Nicanor Korir — Mon, 01 Dec 2025 22:57:52 +0000

From the research, farmers are interested in the results.

I spent about two days perfecting the health score animation. A smooth circular progress bar that fills with color, green for healthy, red for critical.

The Users I'm Actually Building For

Here's the uncomfortable truth about my target users:

This isn't edge-case accessibility. This IS the use case. A 55-year-old farmer with reading glasses she can't find, soil under her fingernails, standing in bright sunlight with one bar of signal, that's who needs this app most.

So I built the accessibility system around her, not around developers reviewing my code.

The Voice Button That Changed Everything

The single most impactful feature I built was embarrassingly simple: a button that reads the diagnosis/results aloud

I used the Web Speech API, which is built into every modern browser. The implementation took maybe an hour. But here's what I learned: the voice doesn't just help users who can't read. It helps everyone.

My mom tested it. She CAN read perfectly well. But she said hearing "Your tomato has early blight. This is a fungal disease. Severity is moderate" felt more trustworthy than reading the same words. Like getting advice from a person instead of a screen.

The voice script matters too. I don't just dump the JSON response into speech. I wrote it conversationally:

"Your tomato leaf has a health score of 45 out of 100. I detected Early Blight, which is a fungal disease. The severity is moderate. For treatment, you can use neem oil spray, mix two tablespoons with one liter of water, and spray every seven days."

Pauses between sections. Specific measurements. No jargon. The AI generates this because I asked it to in the prompt, "provide practical, actionable treatment steps that farmers can follow."

Font Scaling Without Breaking Everything

The accessibility settings panel lets users choose font sizes: Normal, Large, or Extra Large. Sounds trivial. The implementation taught me something about CSS architecture.

I changed the font size on the document root element. Every component using rem units automatically scales. No prop drilling. No context providers. One line of JavaScript, and the entire app responds.

The trick is building components with rem from day one. If you've hardcoded pixel values everywhere, retrofitting accessibility becomes a rewrite. I got lucky, Tailwind's default classes use rem, so most of my UI scaled correctly without changes.

The settings persist in localStorage. A farmer who needs large fonts shouldn't re-enable them every session. That would be the kind of "technically accessible" that's practically useless.

Touch Targets for Rough Hands

Apple says 44px minimum for touch targets. I went bigger.

Here's why: I watched my uncle try to use a banking app. His fingers are thick from decades of farm work. He kept hitting the wrong buttons. Not because he's clumsy, but because the app was designed by someone with smooth developer hands typing on a MacBook.

My button sizes:

Standard buttons: 48px minimum height
Primary actions (like "Analyze"): 56px minimum
Bottom navigation: 64px minimum

The bottom navigation placement matters too. Thumbs naturally rest at the bottom of the phone. Putting the main navigation there means one-handed use actually works. Web convention says nav goes at the top. Mobile ergonomics says that's wrong.

Color That Works Without Reading

The health score uses a five-color severity system:

A user can glance at the color and know how worried to be. No reading required. The traffic light metaphor is universal, green means go, red means stop.

I chose these specific colors for three reasons:

Colorblind-safe progression: The luminance (brightness) decreases from green to red, so even without color perception, the severity reads correctly.
Sunlight visibility: High saturation colors remain distinguishable in bright outdoor light. Pastels wash out.
Cultural familiarity: Traffic lights exist everywhere. The metaphor translates.

Progressive Disclosure

A full diagnosis from Claude contains 25+ fields. Disease name, scientific name, confidence score, symptoms, causes, spread risk, urgency, treatments with ingredients and preparation steps, prevention tips, regional availability notes...

Showing all of that at once would overwhelm anyone, let alone someone who struggles with text-heavy interfaces.

The Settings That Actually Exist

I built an accessibility settings panel with five toggles:

Font Size: Normal / Large / Extra Large
High Contrast: On / Off
Reduced Motion: On / Off
Voice Mode: Off / On Request / Always On
Simple Mode: On / Off

These settings persist in localStorage and apply instantly. The high contrast and reduced motion modes add CSS classes to the document root, the same pattern as font scaling.

"Simple Mode" hides secondary information and removes visual flourishes. For users who find options themselves overwhelming, fewer choices mean clearer choices.

What I Actually Built vs. the future

I call it the future since some of these features can still be added

What's working:

Voice read-aloud for results (Web Speech API)
Font size scaling (root-level CSS)
Large touch targets (Tailwind minimums)
Color-coded severity (5-level system)
Progressive disclosure (collapsible sections)
Settings persistence (localStorage)
Bottom navigation (thumb-friendly placement)

What's partially done:

High contrast mode (setting exists, CSS needs work)
Reduced motion (setting exists, animations still run)

What's not built yet:

Voice INPUT (speech-to-text for context entry)
Multi-language support (Swahili, German, etc.)
Offline mode for the actual analysis

Source code on GitHub

Shamba-MedCare Prompt Engineering

Nicanor Korir — Mon, 01 Dec 2025 21:30:44 +0000

Some background context:

I am building a simple plant disease diagnosis solution using AI, inspired by my farming background and advancements in intelligent technological tools

You can check out the Shamba-MedCare App here. Sorry for testing, you'll have to use your own api keys until the public launch is available. The keys are stored in the browser's local storage, so they are private

For context here, whenever you read LLM(Large Language Model), I mostly Claude. I like to use LLM since it's generic, and this solution can be fitted to any LLM.

I played around with several prompts in order to nail the best results. This is how I transformed my prompt engineering journey with Shamba-MedCare:

My first prompt to LLM Vision was embarrassingly naive:

"What disease does this plant have?"

The response was a 2,000-word essay about plant pathology in general. Helpful for a textbook. Useless for a farmer with a dying tomato plant. Getting AI to return structured, actionable, budget-aware diagnoses took iteration. Here's what I learned.

The Architecture

Two prompts matter: the system prompt (who LLM(e.g. claude pretends to be) and the analysis prompt (what to do with this specific image).

System Prompt: Creating "Shamba"

Prompts work better with a persona. I created Shamba persona, an agricultural pathologist who:

You are Shamba, an expert agricultural pathologist. You analyze
plant images to identify diseases, pests, and nutrient deficiencies.

Your expertise includes:
- 50+ crop types worldwide
- Fungal, bacterial, viral, and physiological disorders
- Traditional and modern treatment methods
- Practical advice for resource-limited farmers

Guidelines:
1. Always include at least one FREE/traditional treatment
2. Describe WHERE symptoms appear (for visual mapping)
3. Be honest about uncertainty—use confidence scores
4. Recommend professional help for severe cases

The key line: "Always include at least one FREE/traditional treatment."

Without that explicit instruction, the LLM defaulted to commercial products. Helpful for a suburban gardener. Useless for a farmer who can't afford a $15 fungicide.

Failure #1: The JSON Nightmare

My first structured attempt asked LLM to return JSON, which it did pretty well. Wrapped in markdown code fences. With helpful commentary before and after.

Here's my analysis:

json
{ "disease": "Early Blight" }


This is a common fungal disease...

My parser choked. The fix was explicit:

Return ONLY a valid JSON object. No markdown, no commentary,
no text before or after. Start with { and end with }

Still failed 10% of the time. So I added backend parsing that:

Strips markdown fences if present
Extracts JSON from surrounding text
Validates against the expected schema

Failure #2: Location Descriptions

For the visual heatmap feature, I needed LLM to describe WHERE damage appeared. My prompt asked for "affected regions."

LLM returned: "The affected area is significant." This was not helpful, I needed the exact coordinates, and I tried out several solutions. This was close to perfect:

Describe affected regions with:
- Location(helpful for heatmaps): top-left, center, lower-right, edges, margins
- Coverage: percentage of area affected (e.g., "35%")
- Spread direction: "Moving from lower leaves upward."

Now LLM returns:

{
  "affected_regions": [
    {
      "location": "lower-left",
      "severity": "severe",
      "description": "Dark brown lesions with concentric rings",
      "coverage": 15
    },
    {
      "location": "center",
      "severity": "moderate",
      "coverage": 20
    }
  ]
}

That's enough to generate a heatmap overlay for now

Failure #3: Treatment Cost Blindness

Early on, treatments came out randomly ordered. Sometimes the $50 systemic fungicide appeared first. Sometimes, the free wood ash remedy.

The problem: LLM has no inherent understanding of budget constraints. I had to structure it:

Provide treatments in EXACTLY this order:
1. FREE TIER: Traditional/home remedies ($0)
2. LOW COST: Basic solutions ($1-5)
3. MEDIUM COST: Commercial organic ($5-20)
4. HIGH COST: Synthetic/professional ($20+)

Each tier must have at least one option if applicable.

The response schema enforced this:

{
  "treatments": [
    {
      "method": "Wood ash paste",
      "cost_tier": "free",
      "estimated_cost": "$0",
      "ingredients": ["Wood ash", "Water"],
      "application": "Apply directly to affected areas",
      "availability": "Common from cooking fires"
    },
    {
      "method": "Neem oil spray",
      "cost_tier": "low",
      "estimated_cost": "$1-3"
    }
  ]
}

The Plant Part-Specific Prompt Strategy

Different plant parts reveal different problems, and I needed the right prompt to get the right problem with the best remedies. My prompt adapts:

For leaves:

Examine: color patterns, spot shapes, curling, holes, coating
Common issues: fungal spots, viral mosaic, nutrient chlorosis, pest damage

For roots:

Examine: color (white=healthy, brown/black=rot), texture, galls, structure
Common issues: root rot, nematode damage, waterlogging

This focus improves accuracy dramatically. Asking LLM to look for "anything wrong" produces vague results. Asking it to specifically check for concentric ring patterns in leaf spots? Now we're diagnosing Early Blight.

The Final Prompt Structure

[SYSTEM PROMPT]
You are Shamba, an agricultural pathologist...

[ANALYSIS PROMPT]
Analyze this {plant_part} image from a {crop_type} plant.

User's context: {additional_context}

Provide:
1. Image validation (correct plant part? good quality?)
2. Health score (0-100)
3. Disease identification with confidence (0.0-1.0)
4. Affected region locations for visual mapping
5. Treatments by cost tier (FREE mandatory)
6. Prevention tips

Return as JSON following this schema:
{response_schema}

What I'd Do Differently

Start with the output format first. I designed prompts around what I wanted LLM to do. I should have designed around what the farmer needed to see.

The heatmap feature was an afterthought. If I'd planned for it from day one, the location description format would have been baked in, not retrofitted. This is actually a useful feature for farmers if you can imagine an affected plant, there are heatmaps on the heavily affected areas

Test with bad photos early. My development photos were well-lit, centered, single-issue plants. Real farmer photos are blurry, shadowy, and show three problems at once. The robustness I needed only emerged after testing with garbage inputs.

Source code on GitHub

Why I Built Shamba-MedCare (And What I Learned About Solving Real Problems)

Nicanor Korir — Mon, 01 Dec 2025 20:50:41 +0000

I grew up around farms in the Kenya highlands region. Of course, I am a farm boy 😂, and I watched farmers lose entire harvests because they couldn't identify a disease until it was too late. By the time they reached an expert, the damage was done.

Most plant disease apps scan leaves and miss the essential parts of the plant, e.g., the branch, roots, and entire leaf area. This is what I can think of current solutions(With the least research I've done, of course)

Root rot starts underground, Stem borers tunnel through stalks, bark cankers spread silently. By the time symptoms reach the leaves, the farmer is already losing the war.

So I built Shamba-MedCare—"Shamba" (farm) + "Dawa" (medicine) in Swahili, a simple solution focusing on helping farmers, scientists, etc. Checkout here Shamba-MedCare App

The Three Approaches I Considered

Option 1: Train a Custom CNN

The PlantVillage dataset has 50,000+ labeled images, and MobileNetV3-small can hit 99.5% accuracy at just 1MB.

The catch? The images must have perfect lighting and clean backgrounds. My accuracy tanked the moment I tested with real field photos—muddy roots, partial shadows, multiple issues on one plant.

Option 2: Use a Pre-built API (Plantix, PlantVillage Nuru)

These are some of the existing solutions that work on different use cases. They give a classification with a confidence score, Classification alone doesn't save crops.

Option 3: Multimodal LLM (Claude Vision)

This is where things got interesting.

Claude doesn't just classify, it reasons. I can ask it:

"Analyze this tomato leaf. The farmer says spots appeared 2 weeks ago after heavy rain. They can only afford traditional remedies. What's wrong and what should they do?"

And it actually incorporates that context.

The Trade-off I Made

Feature	Custom CNN	Claude Vision
Works Offline	Yes	No
Contextual Explanations	No	Yes
Novel Disease Handling	No	Yes
Per-Request Cost	Free	~$0.01-0.05

I chose Claude, not because it's perfect, but because a detailed explanation that is helpful to the farmer and saves a crop is worth more than a fast classification that misses context.

What I Built

The core flow:

Every diagnosis includes:

Health score (0-100)
Disease identification with confidence
Visual heatmap showing WHERE damage is
Treatment tiers: FREE → Low → Medium → High cost
Traditional remedies that farmers already trust

Inclusion

I almost built this for all the farmers to be generic. Then I remembered, a 55-year-old farmer in rural Kenya with basic literacy is not the same as a 25-year-old agronomist with a smartphone addiction.

So I added:

Voice mode: Results read aloud in clear speech while the user is scrolling through
Huge touch targets: 44px minimum, because field work means rough hands and busy farmers
Bottom navigation: One-handed operation while holding a plant
Icon-first design: Pictures over text, and pictures with text

Tech that farmers can't use is just tech for my portfolio

Next in this series: How I structured the prompts to get consistent, budget-aware diagnoses (and damn, the 3 times it completely failed)

Source code on GitHub

Optical Flow: How Robots (and maybe your Phone) See Motion

Nicanor Korir — Fri, 14 Nov 2025 14:39:13 +0000

Okay, so here's a weird question: how do you know something is moving?

Like, right now, if I threw a ball at you, you'd catch it or try to. Not because you're doing complex calculations. You just see it moving. Your brain processes the motion instantly, and your hands know where to be.

But how? What's actually happening when you perceive motion?

That's optical flow. And honestly? Understanding optical flow changed how I think about vision in general. Let me explain.

The Coffee Cup Experiment

Imagine you're sitting at a table with a coffee cup in front of you. The cup isn't moving. You're not moving. Everything is still.

Now, I walk past you. As I walk, from your perspective, the background behind me seems to shift. The wall behind me appears to move in the opposite direction I'm walking. The floor seems to slide past.

But here's the thing, nothing is actually moving except me. The wall isn't really moving. The floor isn't sliding. Your brain knows this because it's processing relative motion.

What you're actually seeing is, the pixels in your visual field are changing position over time.

Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (you, a camera, a robot, etc) and the scene.

Optical flow is basically asking, "Which pixels are moving, and in which direction?"

Why is Optical Flow So Important?

Here's where it gets practical. Imagine a robot navigating through a hallway. How does it know it's moving forward?

One way: it has odometers on its wheels, or it uses GPS, or it has a motion sensor. But what if those sensors break? Or what if it's in an environment where GPS doesn't work?

Another way: the robot looks at what it sees and figures out, "Hey, everything in my visual field is moving away from the center. That means I'm moving forward." This is optical flow.

If a robot is trying to catch a moving object, it needs to know:

Is the object moving, or am I moving?
In which direction is it moving?
How fast?
Will it hit something?

All of this can be extracted from optical flow.

Similarly, when your phone stabilizes video, it's using optical flow to detect camera shake and compensate for it. When a drone hovers in place without GPS, it's using optical flow to stay put.

But: What's Actually Happening?

Let's go back to basics, you're looking at a video, let's say it's a video of a person walking across a room.

Frame 1: You see the person at position X.
Frame 2: You see the person at position X+5 pixels to the right.
Frame 3: You see the person at position X+10 pixels to the right.

Optical flow is literally: "The person moved 5 pixels to the right between frame 1 and 2, and another 5 pixels between frame 2 and 3."

But it's not just about where things moved. It's about the pattern of motion across the entire image.

Think of it like this: imagine you're looking at a piece of paper with arrows drawn on it. Each arrow points in a direction, and its length shows how far something moved.

In a video of a person walking toward the camera:

The edges of the image show motion outward (things moving away)
The center shows less motion
The person's limbs show rapid motion (arms swinging)

When you visualize all these arrows together, you get a motion field.

Okay, But How Do You Actually Calculate It?

The Basic Principle:

A pixel's brightness (or color) doesn't change much between consecutive frames, unless something moves.

So if you see a pixel that was bright white in frame 1, and it's also bright white in frame 2, but a few pixels to the right, you can infer: "That pixel content moved to the right."

This is called the brightness constancy assumption: the intensity of a pixel remains constant as it moves.

In math terms:

I(x, y, t) = I(x + dx, y + dy, t + dt)

This just means: "The brightness at position (x, y) at time t equals the brightness at the new position after movement."

The Lucas-Kanade Method (One Popular Approach)

There are many ways to calculate optical flow, one of the most famous is Lucas-Kanade. Here is how it works:

Look at a small window of pixels (like a 3x3 or 5x5 grid)
Find the best motion vector (how far it moved, in which direction) that explains the change between frames
Repeat for every pixel in the image
You get a motion field, every pixel has an associated motion vector

It's like saying: "For this window, the best explanation for the change I see is that everything shifted 3 pixels to the right and 1 pixel down."

The Dense vs. Sparse Problem

Sparse Optical Flow: Track only a few key points (like corners or features). You end up with arrows pointing from frame 1 to frame 2 for a few hundred points.

Advantage: Fast, works even with significant motion.
Disadvantage: Doesn't tell you about the entire scene, just key points.

Dense Optical Flow: Calculate motion for every pixel, every single pixel gets a motion vector.

Advantage: Complete picture of motion.
Disadvantage: Computationally expensive, can fail with large motion or occlusions.

For a robot navigating a hallway? Sparse is usually enough. You just need to know the general motion pattern.

A Real Example: Following a Ball

Let's say you're building a robot that needs to track a tennis ball.

Frame 1: The ball is at position (100, 150) in the image.
Frame 2: The ball is at position (115, 148) in the image.

Optical flow detected: The ball moved 15 pixels right, 2 pixels up.

Frame 3: The ball is at position (130, 145).

Optical flow detected: The ball moved 15 pixels right, 3 pixels up.

Now the robot can predict: "The ball is moving consistently to the right and slightly upward. At this rate, in the next frame it will be around (145, 142)."

Extrapolate further, and the robot can predict where the ball will be and position itself to catch it. Optical flow is the vision equivalent of prediction.

The Challenges: When Optical Flow Fails

Challenge 1: Occlusion

Imagine someone walks behind a tree. From the camera's perspective, the person disappears. Optical flow can't track what it can't see. The motion vectors suddenly stop.

Robots have to be smart about this: "The person disappeared, but based on the last known motion vector, I predict they'll emerge here."

Challenge 2: Lighting Changes

Remember the brightness constancy assumption? It breaks if the lighting changes. If a cloud passes overhead and the entire scene gets darker, optical flow gets confused.

It might think things moved when really just the lighting changed.

Challenge 3: Large Motion

If something moves really fast between frames, optical flow struggles. It expects motion to be small and smooth, think of fast action footage. Optical flow can't always keep up with rapid motion. This is why video codecs that use optical flow sometimes struggle with fast cuts.

Challenge 4: Texture-less Regions

If you're looking at a blank wall, there are no features to track. Optical flow can't tell if the wall moved or not because there's nothing distinctive to latch onto.

Challenge 5: Reflections and Transparency

Mirrors, windows, water, these break optical flow because the brightness doesn't correlate with actual motion.

Uses of Optical Flow

1. Autonomous Vehicles

Self-driving cars use optical flow to understand their motion relative to the scene. "The lane markings are flowing backward, which means I'm moving forward." It's also used to detect obstacles e.g. "That region isn't flowing like the background—something is there."

2. Video Compression

When Netflix streams a video to you, it doesn't send every pixel every frame. It uses optical flow to predict motion: "In the next frame, these pixels will probably be here based on the motion I detected." Then it only sends the changes.

This saves massive amounts of bandwidth.

3. Video Stabilization

Your phone camera detects motion between frames using optical flow. If it detects motion that seems like camera shake (small, jittery motion), it digitally shifts the image to compensate.

4. Robotics Navigation

Mobile robots use optical flow to navigate when other sensors fail. "I can see the environment is flowing past me, so I know I'm moving forward. If the flow pattern changes, something is blocking me."

5. Action Recognition

If you're building a system that understands "What is happening in this video?", optical flow helps. Running looks different from walking looks different from falling, and these differences show up in the motion patterns.

6. Frame Interpolation

Ever seen a slow-motion video created from a regular video? Sometimes it uses optical flow to predict intermediate frames. "Between frame 1 and frame 3, based on the motion I see, frame 2 probably looked like this."

Resources

OpenCV documentation on optical flow: https://docs.opencv.org/master/d4/dee/tutorial_optical_flow.html
Research paper (accessible): "An Introduction to Image Processing" by Gonzalez & Woods covers optical flow basics
YouTube channel Sentdex has OpenCV tutorials including optical flow
RAFT paper (modern deep learning approach): https://arxiv.org/abs/2003.12039

The reason why optical flow matters in robotics is that it's one of the fundamental ways a robot can understand the world without relying on explicit sensors. A robot with just a camera can:

Know it's moving
Detect obstacles
Track objects
Understand its environment

CNNs: from a beginner's point of view

Nicanor Korir — Wed, 12 Nov 2025 18:09:12 +0000

I've learnt this topic about 20 times now, some are a bit confusing, and of course, I know some core things. In this article, I am going to break down CNN to make it easy to understand the basics and maybe the advanced CNN.

Okay, from your perspective, how do you recognize your friend's face in a crowded room?

Like, genuinely, what's happening in your brain? You're not calculating pixel values or comparing feature vectors. You just see them and instantly think, "Oh, that's Sarah."

Your brain is doing something incredibly sophisticated without you realizing it. And that's exactly what CNNs (Convolutional Neural Networks) are trying to do. They're trying to teach computers to see and understand images the way your brain does.

What's the Problem We're Trying to Solve with CNNs?

Before CNNs, people tried to use regular neural networks (fully connected networks) to process images. Here's how it worked: take an image, flatten it into a long list of numbers (every pixel becomes a number), and feed that into a neural network.

Sorry, this will rush things, but stay with me here

An image that's 224x224 pixels has about 150,000 pixels. If you have an RGB image (3 color channels), that's 450,000 numbers. If your first hidden layer has 1000 neurons, you now have 450 million weights to learn just in the first layer.

That's massive. Your network becomes incredibly expensive to train, slow to run, and prone to overfitting (memorizing instead of learning).

But here's the thing, and this is important: images have structure. Pixels next to each other are related. An eye is an eye, whether it's in the top-left or bottom-right of your image. Your brain doesn't relearn what an eye looks like every time it's in a different position.

So the question becomes: How do we build a neural network that understands this spatial structure and reuses knowledge across the image?

That's where convolutions come in

In mathematics, convolution is an operation that combines two functions to produce a third function, showing how one modifies or overlaps with the other as it shifts across it. In CNNs, this idea is used to slide a filter across an image to detect features such as edges and patterns.

The Core Idea

Imagine you're looking at a painting. Instead of analyzing every millimeter of it at once, you use a small window to look at it piece by piece. You slide that window across the painting, examining each region.

You might notice:

In this region, there are strong diagonal lines (could be an arm)
In that region, there's a curved edge (could be a face)
Over there, there's a specific color pattern (could be hair)

Now, imagine you're looking for specific patterns, edges, corners, colors, and shapes. As you slide your window across the image, you're asking: "Does this pattern appear here? How strongly?"

That's convolution

In math terms, you have:

An image (the painting)
A filter/kernel (your small window, usually 3x3 or 5x5)
A convolution operation (sliding the filter across the image and computing a value for each position)

The **filter **is like a feature detector, different filters detect different features:

One filter might detect horizontal edges
Another detects vertical edges
Another detects corners
Another detects specific textures

Here's the magic: the network learns what these filters should be. You don't hard-code "detect an edge." The network figures out, "To recognize images well, I should learn these specific filter patterns."

Let's say you have a 5x5 image (tiny, for illustration):

And a 3x3 filter (kernel):

1  0 -1
2  0 -2
1  0 -1

(This is actually a real filter, the Sobel filter, that detects vertical edges)

Convolution works like this:

Place the filter on the top-left of the image
Multiply each element of the filter by the corresponding image element
Sum all those products
That sum is the output for that position
Slide the filter one position to the right, repeat
When you reach the end of a row, move down and start from the left

After sliding through the entire image, you get a new, slightly smaller image. That new image highlights where the filter's pattern appears strongly in the original image.

Do this with multiple filters, and you get numerous feature maps. Each one shows where different patterns appear in the image.

Why This Is So Powerful

Because you're sliding the same filter across the image, you're using the same weights everywhere. This means:

Fewer parameters: Instead of 450 million weights, maybe you have 9 (for a 3x3 filter) × number of filters
Weight sharing: The network learns that certain patterns are important, and it looks for them everywhere
Translation invariance: An edge detector works whether the edge is in the top-left or bottom-right

Your network becomes smaller, faster, and smarter.

Layering

Here's where it gets interesting. You don't just do one convolution, you stack them.

After the first convolution, you get feature maps that detect simple patterns (edges, corners). Then you apply another convolution to those feature maps. Now you're detecting patterns of patterns.

Maybe the second layer detects "edges arranged in a circular pattern" (detecting circles). The third layer might detect "circles with specific textures" (detecting eyes or wheels).

By the time you're 10 layers deep, you're detecting high-level features: "This looks like a face," "This looks like a car," "This looks like a dog."

This is the hierarchy of features:

Layer 1: Edges and corners
Layer 2: Simple shapes (circles, lines arranged together)
Layer 3: Textures and patterns
Layer 4: Parts of objects (wheels, fur, eyes)
Layer 5+: Whole objects (cars, animals, faces)

This mirrors how your brain works. You see edges first, then recognize that those edges form a nose, then recognize that a nose is part of a face.

Pooling

Pooling in CNNs reduces the size of feature maps by summarizing small regions (like taking the maximum value in a 2×2 area), so the network keeps the most important information while becoming more efficient. It helps make feature detection more stable, even if an object shifts slightly in the image. The most common method is max pooling, where you take the maximum value in that region.

Why? Because:

It reduces the spatial size (fewer numbers to process)
It makes the network more robust to small shifts (if a feature moves slightly, max pooling will still find it)
It emphasizes the strongest features (the maximum value is usually the most important)

An Actual CNN Architecture

Let me show you what a simple CNN looks like:

Input Image (224x224x3)
    ↓
Convolution (32 filters, 3x3) → Output: 224x224x32
ReLU activation
Max Pooling (2x2) → Output: 112x112x32
    ↓
Convolution (64 filters, 3x3) → Output: 112x112x64
ReLU activation
Max Pooling (2x2) → Output: 56x56x64
    ↓
Convolution (128 filters, 3x3) → Output: 56x56x128
ReLU activation
Max Pooling (2x2) → Output: 28x28x128
    ↓
Flatten → 28*28*128 = 100,352 values
    ↓
Fully Connected Layer (256 neurons)
ReLU activation
    ↓
Fully Connected Layer (10 neurons) → Output: probabilities for 10 classes
    ↓
Softmax → Final prediction

Each layer is making the data smaller but richer. By the end, instead of 224x224 pixels, you have 10 numbers representing "how confident am I that this is a [cat/dog/bird/etc]?"

Okay, But How Do You Actually Train This?

The process is similar to regular neural networks, but the convolutions make it special:

Forward pass: Image goes through the layers, producing a prediction
Loss calculation: Compare prediction to ground truth. "I said dog, it was actually a cat. That's wrong."
Backpropagation: Calculate gradients through all the layers, including the convolutional layers
Update filters: Adjust the filter weights so they become better at detecting useful features
Repeat: Do this thousands of times until the network gets better

The network automatically learns what filters to use. You don't tell it "detect edges." It figures it out because detecting edges helps it recognize objects better.

Real-World Applications

Medical Imaging: Detecting tumors in X-rays, CT scans
Autonomous Vehicles: Detecting pedestrians, traffic signs, and lane markings. CNNs can process camera feeds in real-time
Social Media: Instagram uses CNNs for content recommendation, Facebook for face detection, and TikTok for understanding video content.
Satellite Imagery: Detecting changes in landscapes, tracking deforestation, and counting crops
Quality Control: Manufacturing plants use CNNs to detect defects in products at superhuman speeds.
E-commerce: Product recognition, visual search (take a photo of something, find similar items online)

The Limitations

I don't want to oversell this, CNNs have real limitations:

1. They Need Lots of Data

Unlike humans, who can learn from a few examples, CNNs need thousands. Transfer learning helps, but it's still data-hungry.

2. They're Brittle

A CNN trained to recognize a dog might be completely fooled by a tiny, carefully crafted perturbation of the image. Humans see it as obviously still a dog.

3. They Don't Understand Context

A CNN might recognize all the objects in an image perfectly, but miss the relationship between them. It sees "cat," "couch," but doesn't understand "cat sitting on couch."

4. They're Black Boxes

You can visualize what they learned, but explaining why a specific prediction was made is hard. This matters for medical or legal applications where you need explainability.

5. They're Computationally Expensive

Running inference requires significant resources, especially for complex models.

What next?

For me, I'll create a practical example. I am also doing product recognition and categorization for a warehouse using different tools and technologies. For you, you might tell me in the comments or on social media, and we can chat about

Robot Immitation: A gentle Intro

Nicanor Korir — Wed, 12 Nov 2025 06:47:20 +0000

You know that feeling when you're trying to learn something new, and the best way is just to watch someone do it first? That's kind of what robot imitation is about.

Robot imitation, or "learning from demonstration," is when a robot watches a human (or another robot, or even a video) perform a task, and then tries to reproduce that same task. The robot is basically saying, "I saw what you did, now I'm going to do it too."

But here's where it gets interesting, the robot isn't just recording your movements like a video playback. It's learning the underlying pattern of what you're doing. It's figuring out, "Oh, I see, when the human's hand moves here, the gripper opens. When it moves there, the gripper closes. When there's resistance, the force increases."

The robot is extracting the meaning behind your actions, not just copying pixel-for-pixel movements.

Why Does This Matter? Why Not Just Program Everything?

If you wanted a robot to pick up a coffee mug, you could:

Option A: Program it manually

Calculate the exact coordinates where the mug is
Program the exact angle to approach it
Set the exact force to grip it without breaking it
Account for different mug sizes, weights, and handle positions
Do this for every single object the robot might encounter

This takes forever, and the moment something changes, a slightly different mug, a mug in a slightly different position, the whole thing breaks.

Option B: Show the robot how to do it

Grab a mug for yourself a few times
Let the robot watch and learn
The robot figures out the pattern
Now it can grab mugs it's never seen before

Option B is way more efficient, right? This is why robot imitation is becoming so important.

The Basic Idea

When you show a robot how to do something, you're teaching it several layers of information. Let me break this down:

1. Perception: What does the robot see?

The robot needs to understand what's in front of it. This usually involves:

Computer vision (cameras looking at the scene)
Identifying objects ("That's a mug")
Understanding spatial relationships ("The mug is to the left of the plate")

This is actually one of the hardest parts. Humans do this instantly. Robots? They need to be trained to recognize what they're looking at.

2. The Action: What does the robot do?

Once it understands the scene, what moves does it make?

Hand/gripper positioning
Force applied
Speed of movement
Timing of actions

The robot records: "When I see object X at position Y, I move my arm like this, with this amount of force."

3. The Logic: Why does the robot do it that way?

This is the tricky part. The robot needs to understand not just what you did, but why you did it that way.

For example, if you're picking up a mug:

You grab from the handle (not the hot part of the mug)
You move slowly (not jerky movements)
You apply enough force to hold it, but not crush it

A good imitation learning system figures out these principles and applies them to new situations.

How Does the Robot Actually Learn This?

Okay, so the robot is watching you. But how does it translate what it sees into something it can do?

There are a few main approaches:

Approach 1: Behavioural Cloning (The Simplest Way)

This is basically supervised learning. Here's how it works:

A human demonstrates a task multiple times (let's say, picking up different objects)
The robot records: what it sees (camera input) and what the human does (hand movements, gripper position, force)
This becomes training data: "When you see this image, the action is this"
We train a neural network: "Learn the pattern between images and actions"
Now the robot can predict: "I see this, so I should do that"

It's like learning to drive by watching tons of videos of good drivers. You see what they do in different situations, and your brain learns the pattern.

The limitation? The robot learns to copy exactly what it saw. If something is slightly different, a different angle, a different object, it might fail.

Approach 2: Learning the Underlying Policy

Instead of just copying, the robot tries to learn the rules of what's happening.

Think of it like learning a recipe, not just watching someone cook once. You're trying to understand:

What's the goal?
What are the important steps?
What can vary, and what can't?

The robot learns to generalize. It doesn't just copy, it adapts.

Approach 3: Inverse Reinforcement Learning (The Sneaky Approach)

This one's wild. Instead of the robot learning "do this," it learns "what is the human trying to optimize for?"

Here's the idea: when a human does a task, they're implicitly optimizing for something. When you pick up a mug carefully, you're optimizing for "don't break the mug and don't spill coffee." The robot tries to figure out what you're optimizing for, then uses that as a reward signal.

The robot is essentially asking: "What's the hidden objective here?"

This is more advanced, but it's powerful because the robot learns the intent, not just the movements.

Real-World Example: Teaching a Robot to Cook

Let me make this concrete with something you might actually do.

Imagine we want to teach a robot to make scrambled eggs. Here's how imitation learning would work:

Step 1: Demonstration

A chef (or you) makes scrambled eggs in front of the robot. The robot's cameras record:

Where the ingredients are
How the chef moves
What the chef is looking at
The timing of actions (when to stir, when to stop)

The robot also records data from sensors:

Heat level of the pan
How long do things cook
The texture of the eggs

Step 2: Feature Extraction

The system figures out what matters:

"Okay, the chef stirred when the edges started to solidify"
"The chef removed it from the heat when it looked creamy"
"The chef tasted it to check doneness"

These are the meaningful patterns for the robot, of course

Step 3: Learning

The robot creates a model: "When I see eggs with these characteristics, I should stir. When they look like this, I should stop."

Step 4: Execution

The robot makes scrambled eggs on its own. It might not be exactly like the chef made it (maybe slightly different timing), but it captures the essence of what makes good scrambled eggs.

Step 5: Adaptation

If the first batch isn't perfect, the system can learn from the mistake. "Oh, I stirred too late, next time I'll stir earlier." This is where imitation learning becomes even more powerful, it's not just one-shot learning, it improves over time.

The Challenges: Why Robot Imitation Is Still Hard

I'm not going to pretend this is easy. Some of the problems include:

1: The Distribution Shift

The robot learns from demonstrations, but the real world is messier. What if the mug is in a slightly different position? What if the lighting is different? What if the object is a different size?

When the robot encounters something different from what it trained on, it often fails. This is called "distribution shift", the robot is good at things that look like the training data, but bad at things that don't.

This is a huge research problem right now.

2: The Human-Robot Gap

Humans have bodies that are very different from robot bodies. Humans have 206 bones and are incredibly skilled at working together. Most robots have maybe 6-7 degrees of freedom (ways to move).

When a human shows you how to pick something up, they're using their whole body, balance, finger flexibility, and tactile feedback. Translating that to a robot is non-trivial, this is the biggest challenge right now, although there are a few breakthroughs

One way researchers handle this is that they map human movements to robot movements. "When the human's hand moves like this, the robot's gripper moves like this." But it's imperfect.

3: The Reward Problem

How do you know if the robot did the task "correctly"? For some tasks, it's obvious (did the egg get cooked?). For others, it's fuzzy (did you fold the laundry neatly enough?).

Defining what success looks like is harder than it sounds.

4: Data Quality

Garbage in, garbage out is the norm in robotics. If your demonstrations are bad, your robot will learn bad behavior. If you show the robot ten different ways to do something without explaining why you did it differently, it gets confused.

Getting good demonstration data is actually a real bottleneck in robot imitation learning.

Where Is Robot Imitation Being Used Right Now?

1. Industrial Robots

Companies are using imitation learning to train robots for assembly tasks. Instead of programming every detail, they show the robot the task, and it learns. This dramatically cuts down setup time.

2. Robotic Manipulation (Grasping and Picking)

There's active research on robots that can pick objects they've never seen before by learning from human demonstrations. This is used in warehouses and manufacturing.

3. Robotic Surgery

Surgeons perform procedures, and the system records their movements. This data helps train surgical robots to assist or even automate certain tasks. Obviously, this requires extreme precision and validation.

4. Autonomous Vehicles

Self-driving cars learn by watching human drivers. The car observes: "In this situation, the human turned the wheel like this, at this speed." Over millions of miles of data, the car learns driving patterns.

5. Robot Learning from Videos

Researchers are now training robots using YouTube videos and internet-scale data. The robot is learning from millions of human demonstrations. This is cutting-edge stuff, but it's happening.

Resources to Learn More

If you want to dive deeper (and you should):

Papers: Robot Immitation from Human Action, Imitation Learning for Robotics: Progress, Challenges, and Applications in Manipulation and Teleoperation
Books: "Robotics, Vision and Control" by Peter Corke, Imitation Learning for Robots: Building a Strong Foundation by Von Jacob

Getting started with Robotics

Nicanor Korir — Tue, 11 Nov 2025 06:24:31 +0000

You've probably seen a robot doing something cool. Maybe it's one of those automatic vacuum cleaners that somehow knows when you've spilt coffee on the floor and, boom, five minutes later, it's sparkling clean. Or maybe you've seen those robodogs on the internet and thought, "That's insane, I want to control that." Well, guess what? You can.

Here's the truth, though: robotics looks intimidating from the outside, but it's actually both hard and easy at the same time. Hard if you try to understand everything at once. Easy if you focus on individual pieces and build from there.

Introduction to Robotics

What is a Robot?

A robot is basically a machine that can sense its environment, make decisions, and take action. That's it. Sounds simple, right? But the magic happens in how you combine these three things.

Think about that vacuum cleaner again. It has sensors (cameras, bump sensors) that tell it "hey, there's dirt here" or "I hit a wall." It has a controller (the robot's brain) that processes this information and decides what to do. And it has actuators (motors) that actually do the work, spinning brushes, and moving wheels. All three working together make one smart robot.

How Robots Impact the World Today

Honestly? Robots are everywhere now. In factories, they're assembling everything from cars to phones. In hospitals, they're assisting with surgery. Amazon warehouses are packed with robots moving packages. And in your home, you've got that vacuum, maybe a robot lawn mower, perhaps a smart speaker that's technically a robot too.

The impact isn't just about replacing human jobs (though that conversation is real). It's also about doing things humans can't do efficiently, repetitive tasks, dangerous environments, and precision work at scale. Robots are the reason manufacturing is faster, safer surgeries happen, and, honestly, why you can get packages delivered so quickly.

The Different Branches of Robotics

Robotics isn't one thing. It's actually several fields working together:

Industrial Robotics: Factory robots, assembly lines, manufacturing
Mobile Robotics: Robots that move around (vacuum cleaners, delivery robots, drones)
Manipulation: Robotic arms and hands (think surgical robots or factory arms)
Humanoids: Robots that look and act like humans (still mostly experimental)
Autonomous Vehicles: Self-driving cars and similar tech
Swarm Robotics: Multiple robots coordinating together

You don't need to master all of these. Pick one that excites you and start there, anyway, the basics are the same in all of them

Core Robotics Concepts

Sensors & Actuators: How Robots Sense and Move

Every robot needs to know what's happening around it. That's where **sensors **come in. They're like the robot's eyes, ears, and skin.

Common sensors include:

Cameras: Let the robot "see"
LiDAR: Measures distance using light (great for navigation)
IMU (Inertial Measurement Unit): Detects motion and orientation
Ultrasonic Sensors: Measure distance using sound
Bump Sensors: Simple "did I hit something?" switches

Once the robot knows what's happening, it needs to do something about it, and voila, actuators. These are the motors and mechanisms that make the robot move or manipulate things.

Types of actuators:

DC Motors: Simple, common, good for wheels
Servo Motors: Precise positioning, great for robotic arms
Stepper Motors: Very precise, often used in 3D printers
Linear Actuators: Push or pull in a straight line

Sensors tell the robot what's happening, actuators make it happen.

Controllers: The Robot's Brain

The controller is the decision-maker. It's the microcontroller or computer that reads sensor data and decides what the actuators should do.

Common controllers:

Arduino: Great for beginners, affordable, tons of tutorials
Raspberry Pi: More powerful, can run full operating systems
Real-time Controllers: For complex industrial robots
Specialized Chips: NVIDIA Jetson for AI-heavy tasks

I've been playing around with Arduino, and I'll create a few articles on Arduino and Raspberry Pi in the future

Power Systems

Robots need power, without power, nothing much will happen

A few basics(always come with the manufacturer's instructions, depending on your kit):

Voltage and Current: Don't mix them up. Voltage is pressure, current is flow. Too much of either can fry your robot or hurt you.
Batteries: Usually 9V, 12V, or LiPo batteries. Match the voltage to your robot's needs.
Fuses: These are your safety net. They blow if something goes wrong.
Heat Dissipation: Motors and controllers generate heat. Ventilation matters.

Hardware vs. Software in Robotics

Here's the thing: robotics is 50/50 hardware and software. You can have amazing code, but if your motors are wired wrong, nothing happens. You can have perfect hardware, but without good control software, your robot is just a paperweight.

Both matter equally. This is why hands-on learning is crucial, you can't just read about robotics, you have to build.

Robotics Skills for Beginners

Programming Basics

You'll need to code and Python is the best language to start with, it's readable, forgiving, and widely used in robotics.

You don't need to be a master programmer. Basic concepts are enough:

Variables and data types
Loops and conditionals
Functions
Working with libraries

Spend a week or two learning Python fundamentals. Then move to robotics-specific libraries. Below, I'll introduce the Robotics software to give you the mojo to kickstart robotic life.

Electronics Fundamentals (Voltage, Current, Motors)

You don't need a degree in electrical engineering. Just understand:

Voltage: Think of it as electrical pressure (measured in volts)
Current: How much electricity flows (measured in amps)
Resistance: Opposition to flow (measured in ohms)
Ohm's Law: V = I × R (this is important)

Practical skills:

Read a circuit diagram
Use a multimeter to check voltage
Solder wires together
Connect motors to controllers

YouTube has tons of beginner electronics tutorials. Watch a few before you touch anything.

Basic Mechanics (Gears, Joints, Motion)

Physics matters, understanding how gears work, how joints move, and how force transfers makes you a better roboticist.

Basic concepts:

Gears: Transfer power and change speed/torque
Joints: Allow movement in specific directions
Friction: Affects movement and efficiency
Torque: Rotational force (important for motors)

Introduction to AI in Robotics

AI is becoming central to robotics. Your robot needs to make decisions based on sensor input. That's where AI comes in.

For beginners:

Start with simple logic (if this, then that)
Move to basic machine learning (object detection with pre-trained models)
Eventually explore reinforcement learning (robot learning through trial and error)

You don't need to implement cutting-edge AI. Use existing libraries like TensorFlow or PyTorch.

Simulation & Virtual Robotics

Why Use Simulators?

Simulations are cheap, fast, and forgiving. You can crash a simulated robot a thousand times without spending a dime. You can test algorithms in minutes instead of hours. And you can focus on the software without worrying about hardware limitations.

Popular Simulators:

Gazebo: Open-source, free, industry-standard for ROS
Webots: Beginner-friendly, good documentation
PyBullet: Physics engine, great for reinforcement learning
NVIDIA Isaac Sim: Cutting-edge, free, powerful
Unity ML-Agents: Game engine + AI training
MujoCo: Physics-based, research-oriented
CoppeliaSim: Versatile, good for learning

For beginners, I'd recommend PyBullet, Webots, or Gazebo + ROS. They have gentle learning curves and tons of tutorials.

Build Your First Virtual Robot

Pick a simulator and follow its beginner tutorial. I don't want to add a complete guide here, let me link to an article on how to create your first virtual robot, step by step, later. You'll learn the basic workflow:

Create a robot model
Add sensors and actuators
Write control code
Run and observe

It'll feel like the real thing, but without the crashes. I've done these a ton of times with different platforms, they are always fun to play with.

Robotics Software

What is ROS/ROS2?

ROS (Robot Operating System) is like the "operating system" for robots. It's a framework that makes it easier to write robot software.

ROS handles:

Communication between different robot components
Managing sensors and actuators
Running multiple programs simultaneously
Lots of pre-built tools and libraries

Is ROS necessary? Not for your first robot. But it's industry-standard, and learning it early pays off.

ROS2 is the newer version, cleaner, and more modern. If you're starting fresh, go with ROS2.

Working with URDF Models (Robot Representation)

URDF (Unified Robot Description Format) is basically XML that describes your robot. It tells the system: "Here's my robot, it has these joints, these links, these sensors."

You write:

<robot name="my_robot">
  <link name="body"/>
  <link name="wheel_left"/>
  <joint name="left_wheel_joint" type="revolute">
    <!-- joint details -->
  </joint>
</robot>

This might look intimidating, but it's just describing geometry and connections. Tools can visualize it for you.

Example Robot Control Program

A simple example in Python with ROS:

import rospy
from sensor_msgs.msg import LaserScan
from geometry_msgs.msg import Twist

def callback(msg):
    # msg contains laser scan data
    # If something is close, stop; otherwise, move forward
    if min(msg.ranges) < 0.5:
        stop()
    else:
        move_forward()

rospy.init_node('obstacle_avoider')
rospy.Subscriber('/scan', LaserScan, callback)
rospy.spin()

This listens to a laser scanner and avoids obstacles. Simple, right?

Robotics in the Real World

Robotics in Healthcare, Space, Manufacturing, Entertainment

Healthcare: Surgical robots (Da Vinci), rehabilitation robots, delivery bots in hospitals
Space: Mars rovers, satellite deployment robots, exploration drones
Manufacturing: Assembly lines, welding robots, material handling
Entertainment: Robodogs, humanoid entertainers, theme park attractions

Each domain has unique challenges. Healthcare robots need to be incredibly precise and safe. Space robots need to operate autonomously with limited communication. Manufacturing robots need to work 24/7 without breaking down.

Ethical and Safety Considerations

As robots become more powerful and autonomous, we need to think about:

Safety: What happens if a robot malfunctions?
Bias: If a robot uses AI, does that AI have biases?
Autonomy: How much decision-making should we give to robots?
Displacement: What about workers whose jobs are replaced?

Resources
I am not going to recommend any courses right now, since I am using various tools, lectures, and books, maybe in the future. There are tons of materials out there if you need.

Kits:

LEGO Mindstorms: Great for learning
Arduino Starter Kits: Affordable, beginner-friendly
ROSbot: Pre-built mobile robot, good for ROS learning
Donkey Car: Open-source autonomous car project

Communities:

ROS Discourse: Official ROS community
Reddit: r/robotics is helpful
GitHub: Browse robotics projects, contribute
Local meetups: Find robotics groups in your city

Competitions and Open-Source Projects
Some of these are really interesting to follow:

RoboCup: International robotics competition
FIRST Robotics: High school competition (they have adult divisions too)
Sparkfun AVC: Autonomous vehicle competition
Open-source: Contribute to projects like Donkey Car, OpenDog, etc.

Robotics is wide, and we can't finish learning in one article. But this guide should get you started. Pick a project, grab some components, and build something. That's how you really learn.

Learning AI in the world of fast-moving AI

Nicanor Korir — Mon, 10 Nov 2025 04:35:57 +0000

This is gonna be a tough one. I mean, writing without using ChatGPT, Claude, etc, to generate this blog post for me. I'll create an article on how I am using AI in my studies and work

I'm currently pursuing a Master's in Artificial Intelligence with a Robotics specialisation, which I genuinely love, but let me tell you: learning in the age of fast-moving AI is both exhilarating and disorienting at the same time.

What does "fast-moving AI" even mean?

Now?
We have tools
We have automation
We have AI helping us build AI

Here's the thing, AI has been practically real for about the last two years now. And it's getting crazier every single day. Right now, AI isn't some distant future thing anymore, it's embedded in almost every industry. Discoveries drop constantly. Everything is accelerating. It's like trying to read a book while someone keeps flipping the pages faster and faster.

For me I am studying both the present and the past simultaneously, and they're moving in different directions. The past stuff is the foundational theory, which is crucial because it's what gives you actual understanding. You need to know why things work, not just that they work. But the current stuff? That's where it gets wild. You're learning cutting-edge applications while also reverse-engineering them: What's the origin? How did we get here? What does this mean for what comes next?

The generational shift in AI education

If I compare my experience to someone who did a Master's in AI ten years ago, the difference is huge. Back then, you had to build almost everything from scratch. You did the research, prepared your own datasets, and wrote your own implementations. Speed wasn't even part of the equation; research took time.

Now? We have foundational models, pre-trained systems, automated pipelines, and off-the-shelf tools that would've taken years to develop a decade ago. But here's the double-edged sword: we still need to understand how to build these things manually, so I still do the old model stuff, but using the current baked tools, fancy for me. The difference is that we also have the option to leverage automation. So the game has shifted, it's not about reinventing the wheel anymore; it's about knowing when to build wheels, when to use existing ones, and how to integrate them intelligently.

I'm planning to write deeper dives into exactly what I'm learning and how it's different from the "old way"

Robotics is its own beast

Now, robotics is a different animal entirely. Unlike NLP, where pre-trained models and massive datasets have democratised the field, robotics hasn't had that same tooling revolution. You can't just download a pre-trained robot. You have to do the practicals, actually build, test, and iterate. With lots of practicals, robotics isn't just one field, it contains multiple fields working together. I'll link to a Getting Started with Robotics article here.

The foundation matters even more in robotics because you're combining computer vision, control systems, mechanics, and AI reasoning all at once. The good thing right now is that there are excellent simulation environments that help to understand the practical part.

Keeping up without burning out

There's this underlying anxiety in AI right now: if you stop learning for a week, you've missed something important. But here's what I've learned: keeping up is actually manageable if, and this is crucial, you have a strong foundation. Every breakthrough, every new model, every new technique. They're all built on the same fundamentals of computer science, mathematics, and physics. The basics haven't changed. The applications have exploded, but the principles are solid

These are just high-level thoughts on how I personally learn AI while immersed in the AI spree. An article on how I am using AI for learning and work would be awesome to break down into low-level details