Forem: Kumaravelu Saraboji Mahalingam

🚀How I Built an AI Data Chat Tool in My Portfolio App Using Gemma 4 Open Weight Model

Kumaravelu Saraboji Mahalingam — Fri, 22 May 2026 19:22:16 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

What If You Could Just... Ask Your Data a Question? 🤔

Most people who need insights from a data file are blocked by one simple thing: they don't know SQL. Even technically strong users often don't want to stop, inspect schema manually, write queries, debug syntax, and format results just to answer a quick question like "Which category has the highest revenue?" or "Show me null rates by column."

This project removes that friction entirely. Upload your file, type your question in plain English, and let a Gemma 4-powered agentic backend inspect the schema, generate DuckDB SQL, execute it, and return the results — right inside a clean chat interface. 🎯

🛠️ The Full Stack at a Glance

Before diving deep, here's the architecture that powers this tool:

Layer	Technology
Frontend	Next.js (Portfolio App)
Edge Middleware	Supabase Edge Functions
Backend API	FastAPI on Google Cloud Run
AI Agents	CrewAI
SQL Engine	DuckDB
LLM	Gemma 4 via Hugging Face Inference API
File Formats	CSV, XLSX, Parquet, JSON, Arrow IPC

Step 1 — The Next.js Frontend

The portfolio app at databro.dev/backend/ai-data-chat/ provides a split-panel chat UI:

Left panel: File upload zone (drag & drop), attached file display with name and size, LLM Settings panel with Provider + Model selectors
Right panel: Chat interface with starter prompt buttons, conversation history with SQL panels and output tables

The user picks a file, selects a Gemma 4 model via Hugging Face, and types a natural-language question. The frontend packages everything into FormData (file + user intent + provider + model) and sends it to a Supabase Edge Function.

Step 2 — Supabase Edge Function

A lean TypeScript Edge Function validates the multipart request and proxies it to the Google Cloud Run backend. This keeps the frontend thin, avoids exposing Cloud Run endpoints directly to the browser, and centralizes auth concerns at the edge.

Step 3 — FastAPI on Google Cloud Run

The Cloud Run backend receives the uploaded file and user intent. It:

Stores the file in a temporary directory
Detects file type (CSV, Parquet, Arrow, JSON, XLSX)
Loads the data into DuckDB as a data table
Runs DESCRIBE data to extract the full schema
Counts total rows
Passes schema context + row count + user intent to the CrewAI agent pipeline

Step 4 — CrewAI Agent Pipeline

The agentic pipeline runs two coordinated tasks:

Task 1 — Schema Analysis
An agent uses a DuckDB tool to introspect the uploaded file and returns column names, types, and a row count. This grounding step is critical — the LLM sees real column names before generating SQL, which dramatically reduces hallucination.

Task 2 — SQL Generation + Execution
The schema and user intent are passed to the Gemma 4 model via Hugging Face. The LLM returns a valid DuckDB SQL query. The agent then executes that SQL against the uploaded file and serializes the results.

Step 5 — Response Back to the UI

The backend returns a structured JSON response containing:

model — the resolved model used
sql — the exact DuckDB SQL that was generated and run
schema — column definitions
total_rows — row count of the uploaded file
result — the query output rows

The frontend renders the SQL in a dark code panel and the results in a scrollable table. Transparency is built in — you always see exactly what query was run.

How the data workflow works 🔄

This project uses DuckDB as the execution engine over uploaded files. The backend detects file type, loads the data into a temporary data table, runs DESCRIBE data to extract schema, counts rows, and then uses the resulting schema context as grounding for SQL generation.

That design is important because it reduces hallucination risk. The LLM is not inventing column names from imagination; it sees the actual schema first, then turns the user’s natural-language intent into DuckDB SQL that is executed against the uploaded file.

At a high level, the flow looks like this:

This is the kind of architecture that makes AI feel grounded in engineering reality. Every step is inspectable, and every answer is backed by an actual query. 🛠️

Demo

👉 Try the AI Data Chat Tool 🚀

Code

⭐ View Source Code on GitHub

How I Used Gemma 4

I used Gemma 4 26B A4B (google/gemma-4-26B-A4B-it) as the core reasoning engine for this project, accessed via the Hugging Face Inference API.

🧠 What the Model Does in This Stack

Gemma 4 sits at the heart of Task 2 in the CrewAI agent pipeline — the SQL generation step. After the Schema Analysis agent introspects the uploaded file and extracts real column names and types, that schema context is passed to Gemma 4 along with the user's natural language question. The model's job: turn intent into a valid, executable DuckDB SQL query.

No templates. No hard-coded patterns. Just the schema, the question, and Gemma 4 reasoning about what SQL to write. 🧩

🎯 Why Gemma 4 26B A4B — Not E4B, Not 31B

Consideration	Why 26B A4B Wins
⚡ Speed	MoE activates only ~3.8B params per token — much faster than 31B Dense on the same hardware
🧩 SQL reasoning quality	Consistently produces correct `HAVING`, `date_trunc`, `COUNT(DISTINCT)`, multi-column aggregations from plain English
📄 Context window	256K tokens — plenty of room for schema + row samples + user intent without truncation
💰 Inference cost	MoE efficiency = more queries served per dollar vs 31B Dense
🔁 Flexibility	Same Hugging Face endpoint pattern — users can switch to 31B mid-session via the UI

The E2B and E4B models are excellent for on-device/edge use cases with audio support, but for structured SQL reasoning in a server-side agentic pipeline, the 26B A4B's larger parameter pool and 256K context window made it the clear choice.

The 31B Dense is available in the UI as an option for complex queries — but for the vast majority of natural-language-to-SQL tasks in this tool, 26B A4B delivers the same quality at a fraction of the inference cost. 🎯

🔌 Integration Pattern

# CrewAI LLM config pointing to Gemma 4 26B A4B via Hugging Face
llm = LLM(
    model="huggingface/google/gemma-4-26B-A4B-it",
    api_key=os.environ["HF_TOKEN"]
)

The model is initialized once per request and passed to the SQL generation agent. The system prompt frames Gemma 4 as a senior data engineer with deep DuckDB knowledge — grounding it before any user input is processed. 🚀

I Built an AI Data Chat Tool in My Portfolio App Using Gemma 4, CrewAI, DuckDB, Supabase Edge Functions & Google Cloud Run 🚀

Kumaravelu Saraboji Mahalingam — Fri, 22 May 2026 00:03:33 +0000

What If You Could Just... Ask Your Data a Question? 🤔

👉 Try the AI Data Chat Live Demo 🚀

🛠️ The Full Stack at a Glance

Before diving deep, here's the architecture that powers this tool:

Layer	Technology
Frontend	Next.js (Portfolio App)
Edge Middleware	Supabase Edge Functions
Backend API	FastAPI on Google Cloud Run
AI Agents	CrewAI
SQL Engine	DuckDB
LLM	Gemma 4 via Hugging Face Inference API
File Formats	CSV, XLSX, Parquet, JSON, Arrow IPC

🌟 Meet the Gemma 4 Model Family

Gemma 4 is Google DeepMind's latest family of open models — and they are genuinely exciting for agentic builders. Unlike earlier iterations, Gemma 4 is built with logic-heavy, reasoning-oriented workflows in mind, which makes it a natural fit for tasks like natural-language-to-SQL generation.

Here's a builder-friendly overview of the lineup:

Model	Active Params	Context Window	Input Modalities	Ideal For
Gemma 4 E2B 🎙️	2.3B (5.1B total)	128K tokens	📝 Text + 🖼️ Image + 🔊 Audio (Speech)	Ultra-mobile, edge, browser & IoT deployment — e.g. Pixel, Chrome
Gemma 4 E4B 🎙️	4.5B (8B total)	128K tokens	📝 Text + 🖼️ Image + 🔊 Audio (Speech)	On-device agents with richer capability — real-time audio/visual processing
Gemma 4 26B A4B 🎬	3.8B (25.2B total)	256K tokens	📝 Text + 🖼️ Image + 🎬 Video (up to 60s)	High-throughput MoE reasoning — advanced agentic & long-context workflows
Gemma 4 31B 🎬	30.7B	256K tokens	📝 Text + 🖼️ Image + 🎬 Video (up to 60s)	Maximum open-model capability — frontier-grade reasoning on local hardware

🔊 Audio support (E2B & E4B): These models include a built-in USM-style conformer encoder that processes up to 30 seconds of audio at 16kHz. They handle automatic speech recognition (ASR), multilingual speech-to-text translation, and audio Q&A natively on-device — no cloud API call required. Each second of audio costs 25 tokens.

🎬 Video support (26B & 31B): The two larger models can analyze video sequences up to 60 seconds at 1 fps, covering scene understanding, temporal reasoning, and chart reading across frames.

🖼️ Image support (all models): Variable aspect ratio and resolution, configurable token budgets (70–1,120 tokens per image), OCR, document parsing, chart comprehension, and UI/screen understanding.

The AI Data Chat tool in this project exposes google/gemma-4-31B-it and google/gemma-4-26B-A4B-it via Hugging Face endpoints — the two most capable text+reasoning models in the family, ideal for the complex DuckDB SQL generation tasks you'll see below. 🔥

🤖 A Brief Foundation on CrewAI

CrewAI is an open-source framework for orchestrating autonomous AI agents in structured multi-agent workflows. Instead of a single giant prompt doing everything, CrewAI lets you break a problem into specialized, coordinated responsibilities.

Three building blocks to understand:

🧑‍💼 Agents

Specialized workers, each with a defined role. Think of them as employees on your AI team — one might be a "Schema Inspector," another a "SQL Writer," another a "Result Formatter."

📋 Tasks

Units of work assigned to agents. A task has a clear description, expected output, and the agent responsible for it. Examples:

"Analyze the uploaded file and return its schema"
"Given this schema and user intent, write a valid DuckDB SQL query"
"Execute the SQL and format the result for the user"

🔧 Tools

Capabilities agents can invoke to act on the world — file inspection utilities, DuckDB query executors, schema extractors, etc.

This model is powerful because it creates transparent, inspectable pipelines instead of black-box AI magic. Every step has a purpose, and every output is traceable.

🏗️ How the Tool Is Built — End to End

Step 1 — The Next.js Frontend

The portfolio app at databro.dev/backend/ai-data-chat/ provides a split-panel chat UI:

Left panel: File upload zone (drag & drop), attached file display with name and size, LLM Settings panel with Provider + Model selectors
Right panel: Chat interface with starter prompt buttons, conversation history with SQL panels and output tables

Step 2 — Supabase Edge Function

Step 3 — FastAPI on Google Cloud Run

The Cloud Run backend receives the uploaded file and user intent. It:

Stores the file in a temporary directory
Detects file type (CSV, Parquet, Arrow, JSON, XLSX)
Loads the data into DuckDB as a data table
Runs DESCRIBE data to extract the full schema
Counts total rows
Passes schema context + row count + user intent to the CrewAI agent pipeline

Step 4 — CrewAI Agent Pipeline

The agentic pipeline runs two coordinated tasks:

Step 5 — Response Back to the UI

The backend returns a structured JSON response containing:

model — the resolved model used
sql — the exact DuckDB SQL that was generated and run
schema — column definitions
total_rows — row count of the uploaded file
result — the query output rows

The frontend renders the SQL in a dark code panel and the results in a scrollable table. Transparency is built in — you always see exactly what query was run.

Even when the implementation keeps the flow compact, the mental model is powerful. Readers instantly understand that the system is not “AI guessing at data,” but a coordinated pipeline where each step has a purpose. ✅

How the data workflow works 🔄

At a high level, the flow looks like this:

This is the kind of architecture that makes AI feel grounded in engineering reality. Every step is inspectable, and every answer is backed by an actual query. 🛠️

🎬 Live Demo — Real Queries, Real Results

I tested the tool with a real e-commerce sales CSV (49 rows, 18 columns covering orders, customers, products, categories, regions, sales, profit, discounts, and returns). Here's what Gemma 4 + CrewAI + DuckDB produced for 6 progressively complex natural language questions.

✅ Test Setup — File Uploaded, Gemma 4 31B Selected

The file sample_test_file.csv is attached (8.7 KB). Provider set to Hugging Face, model set to Gemma 4 31B Instruct. The chat interface is ready with starter prompts visible.

💡 The LLM Settings panel (visible by scrolling right on the chat panel) lets you switch between Gemma 4 31B Instruct and Gemma 4 26B A4B Instruct mid-session.

🟢 Query 1 — "Show me the top 10 rows"

What Gemma 4 generated:

SELECT * FROM data LIMIT 10

Result: Processed 49 CSV rows, returned 10 of 10 matching rows. Clean tabular output showing order_id, customer_id, customer_name, customer_email, customer_segment and more.

🟢 Query 2 — "What are the key columns and null rates?"

This is where things get interesting. Instead of a simple SELECT, Gemma 4 understood intent — it generated a full null-rate audit query covering every single column:

What Gemma 4 generated:

SELECT
    count(*) AS total_rows,
    (count(*) - count(order_id)) * 100.0 / count(*) AS order_id_null_rate,
    (count(*) - count(customer_id)) * 100.0 / count(*) AS customer_id_null_rate,
    (count(*) - count(customer_name)) * 100.0 / count(*) AS customer_name_null_rate,
    (count(*) - count(customer_email)) * 100.0 / count(*) AS customer_email_null_rate,
    -- ... (every column covered)
    (count(*) - count(returned)) * 100.0 / count(*) AS returned_null_rate
FROM data

Result: 49 total rows, 0% null rate across all 18 columns — a complete data quality snapshot in one shot, without the user writing a single line of SQL.

🤯 This is the magic moment. A non-technical user typed a plain English question, and the system produced a production-quality data quality audit query autonomously.

🟡 Query 3 — "Show top 5 categories ranked by total sales and profit margin percentage"

What Gemma 4 generated:

SELECT
    category,
    SUM(sales) AS total_sales,
    (SUM(profit) / SUM(sales)) * 100 AS profit_margin_percentage
FROM data
GROUP BY category
ORDER BY total_sales DESC
LIMIT 5

Result: 3 categories returned with calculated profit margins:

category	total_sales	profit_margin_percentage
Furniture	4438.45	-4.08% ⚠️
Technology	4228.69	+20.16% ✅
Office Supplies	918.77	+37.42% 🏆

Insight unlocked: Furniture is the highest-revenue category but is running at a loss. Technology and Office Supplies are both profitable — Office Supplies has the best margin.

🟡 Query 4 — "Show me monthly revenue trend with total orders count and average order value grouped by month"

What Gemma 4 generated:

SELECT
    date_trunc('month', order_date) AS month,
    SUM(sales) AS total_revenue,
    COUNT(order_id) AS total_orders,
    AVG(sales) AS avg_order_value
FROM data
GROUP BY date_trunc('month', order_date)
ORDER BY month

Result:

month	total_revenue	total_orders	avg_order_value
2024-01-01	6471.26	30	215.71
2024-02-01	3114.65	19	163.93

Gemma 4 correctly used DuckDB's native date_trunc('month', ...) function — not a workaround, not a SUBSTR hack — real DuckDB date intelligence applied from natural language.

🔴 Query 5 — "Which customers have placed more than one order? Show their name, order count and total spend ranked by highest spend"

What Gemma 4 generated:

SELECT
    customer_name,
    COUNT(DISTINCT order_id) AS order_count,
    SUM(sales) AS total_spend
FROM data
GROUP BY customer_name
HAVING COUNT(DISTINCT order_id) > 1
ORDER BY total_spend DESC

Result:

customer_name	order_count	total_spend
Alice Johnson	3	163.85

Gemma 4 correctly identified the need for HAVING COUNT(DISTINCT ...) > 1 — a classic SQL pattern that trips up many developers, generated correctly from plain English.

🔴 Query 6 — "Find all returned orders, show total profit loss by category and ship mode, sorted by biggest loss first"

What Gemma 4 generated:

SELECT
    category,
    ship_mode,
    SUM(profit) AS total_profit_loss
FROM data
WHERE returned = 'Yes'
GROUP BY category, ship_mode
ORDER BY total_profit_loss ASC

Result:

category	ship_mode	total_profit_loss
Furniture	Second Class	-200 🚨
Furniture	Standard Class	-155 ⚠️
Furniture	First Class	-20

Every single return came from Furniture. Second Class shipping returns caused the most financial damage. This is exactly the kind of insight a business analyst would spend hours finding — delivered in under 30 seconds via plain English.

💡 Why This Is Genuinely Useful — No SQL Expertise Needed

Look at what just happened across those 6 queries:

What the user typed	What DuckDB actually executed
"Show me the top 10 rows"	`SELECT * FROM data LIMIT 10`
"What are the key columns and null rates?"	18-column null rate audit with `(count() - count(col)) 100.0 / count(*)`
"Top categories by sales and profit margin"	`SUM` + division for margin `%`, `GROUP BY`, `ORDER BY`
"Monthly revenue trend"	`date_trunc('month', order_date)` + `AVG` + `COUNT`
"Customers with multiple orders"	`COUNT(DISTINCT ...)` + `HAVING` clause
"Returned orders profit loss by category"	`WHERE returned = 'Yes'` + `GROUP BY` + `ORDER BY ASC`

Why this is useful to end users 📊

The biggest value is accessibility. A user does not need to know SQL, DuckDB syntax, schema introspection, or data-loading APIs to get insights from a file. They can simply ask questions like “Show me the top 10 rows,” “What are the null rates?”, “Find duplicates,” or “Give me a distribution summary,” which the UI even suggests as starter prompts.

That changes the audience dramatically. Analysts can move faster, product managers can self-serve, recruiters or business users can inspect exported CSVs, and students can explore datasets without first learning query languages.

It also improves transparency compared with many “AI chat with your data” demos. The app returns the generated SQL alongside the tabular output, so users can verify what was run and build trust in the response rather than blindly accepting a paragraph from a model. 🔍

Why this makes readers curious about Gemma 4 🌟

There is something deeply compelling about watching an open model turn vague human language into a working analytic query. Once readers see Gemma 4 participating in a real stack — frontend, edge function, backend, SQL engine, and agent workflow — the model stops being an abstract release and starts feeling like a tool they can ship with.

That is where curiosity grows. If Gemma 4 can help power natural-language analytics over local files today, what else could it do tomorrow — internal BI copilots, document intelligence, edge-side assistants, debugging copilots, data quality triage, or multimodal workflow agents? 🤯

And because Gemma 4 is available in multiple sizes and distributed through platforms like Hugging Face, it invites experimentation instead of gatekeeping it. For builders, that is a powerful combination: capability, openness, and a very tangible path from prototype to product.

Final thoughts ✍️

This project started as a portfolio build, but it became a strong example of where agentic AI gets genuinely useful: reducing the distance between a person’s question and the data-backed answer they need.

Gemma 4 gives that experience an extra spark. It feels like the kind of model family that makes developers want to try one more workflow, one more agent, one more product idea — and that is exactly the energy a great open model should create. 🚀

Query CSV, Excel, Parquet, and Arrow files in the Browser with DuckDB-Wasm + Next.js 🦆✨

Kumaravelu Saraboji Mahalingam — Fri, 15 May 2026 12:45:54 +0000

I recently built a browser-based SQL tool in my portfolio site: DataBro SQL Query Tool 🚀

It lets you load local files, run SQL directly in the browser, and explore data without sending the raw file to a backend first. That idea felt too interesting not to write about—because once you see DuckDB-Wasm in action, the browser starts looking a lot less like a UI shell and a lot more like a tiny analytics engine 🦆✨

What if a browser tab could become a tiny analytics engine? 🤯 DuckDB-Wasm brings DuckDB into the browser through WebAssembly, which means a web app can run analytical SQL locally on the user's machine instead of sending every file to a backend first.

That one idea changes a lot. In the tool above, a Next.js SPA can let users upload CSV, Excel, Parquet, and Arrow and run SQL instantly while keeping the raw data private in the browser session 🔒

This post walks through that pattern using my portfolio tool as the example—a Next.js app that behaves like a lightweight local data workbench, powered by DuckDB-Wasm and browser-side file processing.

What is DuckDB-Wasm? 🦆

DuckDB-Wasm is the browser-friendly version of DuckDB, compiled to WebAssembly so it can run analytical SQL inside a web app.

Instead of treating the browser as a thin UI layer, it turns the browser into a place where real data work can happen.

That is the part that sparks curiosity. Most people expect SQL engines to live on servers, behind APIs, databases, and cloud infrastructure. DuckDB-Wasm flips that assumption by letting a user open a page, load a file, and query it locally in the same environment where the UI is rendered.

In practical terms, this means a browser app can feel surprisingly close to a desktop analytics tool. It can scan files, preview structured data, run aggregation queries, and return results quickly without forcing users through an upload-first workflow.

Why this is exciting for web apps ⚡

Most web apps that handle tabular data follow the same pattern: upload the file, store it temporarily, process it on the server, then return a preview.

That works, but it introduces latency, backend complexity, and privacy concerns.

With DuckDB-Wasm, a lot of that flow can move directly into the browser. That means less infrastructure for exploratory data tasks and a much better experience for one-off analysis, internal tools, and privacy-sensitive workflows.

It also creates a more delightful product experience. In my tool, you can load a file and ask it questions with SQL immediately—that feels more like magic than middleware ✨

The privacy angle is the real story 🔐

Speed is nice, but privacy is the bigger win.

When DuckDB-Wasm runs in the browser, the SQL engine runs locally alongside the UI. In my app, uploaded files are handled inside the client-side page, registered locally for querying, and processed without requiring a traditional server-side query path.

That makes this pattern especially attractive for sensitive exports, spreadsheets with internal business data, quick inspections of customer-provided files, and any workflow where users do not want to upload raw datasets to a remote service just to answer a few SQL questions.

A good line to emphasize here:

Your browser is not just rendering the UI—it is running the SQL privately on your machine 🧠🔒

That idea is what makes readers pause.

How different file types become queryable 📂

One of the most interesting engineering details in the tool is that not every file format takes the same route into SQL.

CSV takes the simplest path: register the file and query it with DuckDB using automatic CSV detection. Excel is parsed in the browser first and transformed into a query-friendly structure before being registered for SQL access. Parquet can be queried more directly, while Arrow and Avro can go through browser-side transformations before querying.

This creates a good teaching moment: the UI can feel simple and unified even when the ingestion logic is format-specific under the hood. That contrast is exactly the kind of detail that makes technical readers keep scrolling 😄

Why a Next.js SPA is a great fit 🚀

This project is also a nice example of how capable a client-side Next.js page can be.

The same page can handle file upload, SQL editing, query execution, result rendering, and browser-side file processing without needing a heavy backend for the core experience.

For developers building internal tools, this is the exciting part. A SPA can start to feel like desktop-grade software when paired with browser-native processing and DuckDB-Wasm, opening the door to private data inspectors, validation tools, and self-serve analytics experiences.

Conclusion 🎯

DuckDB-Wasm makes a Next.js SPA feel surprisingly powerful. A local file can become a queryable dataset in seconds, and the most interesting part is not just the SQL—it is the fact that the work can stay private inside the browser.

That makes this pattern more than a fun demo. It is a practical blueprint for building modern data tools that feel fast, useful, and much easier to trust.

If you want to see the idea in action, here is the tool again: DataBro SQL Query Tool

TOON File Format Anatomy: Schema-Once, Data-Many for LLM Pipelines 🎯📄

Kumaravelu Saraboji Mahalingam — Sat, 02 May 2026 15:32:52 +0000

If you work with RAG pipelines, agent tools, or LLM APIs, you’ve probably noticed something frustrating: sometimes the biggest cost in a prompt is not the data itself — it’s the repeated JSON structure wrapped around it.

That is exactly the problem TOON tries to solve.

TOON (Token-Oriented Object Notation) is a compact, human-readable encoding of the JSON data model designed for LLM prompts. It keeps the same logical structure as JSON, but reduces token overhead by declaring structure once and streaming the data in a denser format.

In this post, we’ll break down the anatomy of the TOON format, explain where it fits in modern AI pipelines, and compare it with JSON, Arrow, and Parquet so you know when it is a smart choice — and when it is not.

Why TOON matters ⚡

In many LLM workflows, especially RAG, the bottleneck is not storage size on disk. It is prompt size, token cost, and how much useful context you can fit into the model window.

JSON is great for APIs and interoperability, but it becomes repetitive fast when you are passing arrays of objects. If every retrieved chunk repeats keys like id, title, source, score, and text, the model spends tokens reading syntax that carries very little new information.

TOON tackles that by using a simple idea: declare structure once, stream values many times.

Start with the big picture 🗂️

The easiest way to understand TOON is to think of it as a hybrid of:

JSON’s data model.
YAML-style readability and indentation.
CSV-style compact rows for uniform arrays.

That combination gives TOON a very specific sweet spot: uniform arrays of objects with primitive-valued fields.

So instead of writing something like this in JSON:

[
  {"id": 1, "name": "Alice", "role": "admin"},
  {"id": 2, "name": "Bob", "role": "user"}
]

TOON can express the same structure much more compactly like this:

users[2]{id,name,role}:
  1,Alice,admin
  2,Bob,user

That is the core TOON mental model right there: length + fields + rows.

The core anatomy 🧱

At a high level, a TOON tabular section is made of three important parts:

Array length using [N].
Field declaration using {field1,field2,...}.
Data rows that follow the declared field order.

Here is a simplified Mermaid view:

This is one of the most important design ideas in TOON. Instead of repeating object keys on every row, the schema is declared once and every subsequent line becomes mostly pure data.

Array length: small detail, big impact 🔢

The [N] part is more useful than it first appears. TOON documentation explicitly notes that the array length helps models answer dataset-size questions and detect truncation or malformed output.

That makes TOON interesting not just for compactness, but also for LLM guardrails. If a model was supposed to emit 50 rows and only returns 32, the mismatch becomes immediately visible.

This is a subtle but powerful improvement over plain CSV snippets in prompts, because CSV usually has no built-in count or schema declaration at the array boundary.

Field declaration: schema once 🏷️

The {fields} header is where TOON behaves a little like a lightweight schema language. It defines the expected columns and the order in which row values must appear.

That matters for both humans and models. Humans can scan the header once and understand the shape of the data; models can use that header as a structural constraint when interpreting each row.

For uniform, tabular payloads, this gives TOON a “column header + rows” feel that is much denser than JSON without losing meaning.

Rows: where the token savings come from 🚀

The real token savings show up in the rows. Once the field names are declared, every additional object no longer needs repeated key names, braces, quotes, and punctuation-heavy JSON structure.

This is why TOON often reports savings in the 30% to 60% range compared with JSON for suitable payloads, especially arrays of similarly structured objects used in RAG or tool outputs.

It is not magic. It is just removing repeated syntax and shifting the payload closer to “schema once, values many.”

Why TOON fits RAG especially well 🧠

RAG systems often retrieve multiple chunks with repeated metadata fields like chunk id, document id, title, source, section, score, and text. That is exactly the kind of repeated-object structure where JSON becomes noisy and expensive.

A practical pattern is this:

Store source data in a durable format such as Parquet or a database table.
Retrieve relevant rows or chunks for a query.
Convert the final prompt payload from JSON-like objects into TOON before sending it to the LLM.

That means TOON is usually not your storage layer. It is your LLM-facing delivery layer.

A practical RAG example 🧪

Imagine your retriever returns five chunks like this:

[
  {"chunk_id": 101, "doc": "policy.pdf", "section": "refunds", "score": 0.93, "text": "Customers can request refunds within 30 days..."},
  {"chunk_id": 205, "doc": "policy.pdf", "section": "cancellations", "score": 0.90, "text": "Cancellation fees apply after processing..."}
]

The same payload in TOON could look like this:

chunks[2]{chunk_id,doc,section,score,text}:
  101,policy.pdf,refunds,0.93,"Customers can request refunds within 30 days..."
  205,policy.pdf,cancellations,0.90,"Cancellation fees apply after processing..."

Same information, less repeated scaffolding. That usually means you can fit more retrieved chunks inside the same context window, which is one of the most practical reasons TOON is interesting for RAG.

TOON is not Parquet or Arrow 🚫

This is the most important framing if you want to write about TOON alongside Parquet and Arrow.

TOON is not a binary analytical file format. It is not trying to replace Parquet for storage or Arrow for in-memory interchange. It is a prompt-optimized text representation for structured data.

That means TOON belongs closer to the LLM boundary, while Parquet and Arrow belong deeper in the data platform stack.

A simple mental model is:

Parquet stores analytical data efficiently on disk.
Arrow moves typed columnar data efficiently between systems and memory spaces.
TOON presents structured data efficiently to language models.

A useful pipeline mental model 🔄

For a data engineer, the most realistic production story is not “TOON everywhere.” It is something more like this:

This architecture lets each format do what it is best at. Parquet stays the durable analytical format, Arrow can still be the fast in-memory interchange layer inside your engine, and TOON becomes the compact final-mile representation sent to the model.

Comparison with JSON, Arrow, and Parquet ⚖️

Here is the practical difference between the formats:

Format	Primary goal	Best for	Strength	Limitation
JSON	General-purpose structured interchange	APIs, config, documents	Ubiquitous and flexible	Repeats keys heavily in prompt payloads.
TOON	Token-efficient structured prompt representation	RAG context, tool outputs, LLM inputs	Compact, human-readable, schema-once row encoding.	Best on uniform arrays; less compelling for irregular nested data.
Arrow	In-memory columnar interchange	Dataframes, engines, cross-language analytics	Typed, fast, buffer-oriented interchange.	Not human-readable; not meant as prompt text format.
Parquet	Compressed analytical storage	Data lake and warehouse storage	Efficient on-disk analytics and selective reads	Not prompt-friendly and not human-readable in raw form.

If you are explaining this to readers in one sentence, the short version is: JSON is universal, TOON is LLM-friendly, Arrow is execution-friendly, and Parquet is storage-friendly.

Where TOON shines most ✨

TOON shines when your payload is dominated by repeated records that share the same shape. That is common in retrieval results, catalog-like datasets, logs, evaluation samples, classification inputs, and agent tool outputs.

It is especially attractive when every token matters — either because of context window limits, API cost, or the need to fit more relevant examples into one prompt.

In other words, TOON is most compelling when the structure is repetitive and the consumer is an LLM.

Where TOON is weaker ⚠️

TOON is not a universal replacement for JSON. Its strongest form is the tabular encoding for uniform arrays, and that means its benefits are smaller for deeply nested, irregular, or highly heterogeneous payloads.

It is also still early in ecosystem maturity compared with JSON, Arrow, or Parquet. That means you should think of it as a targeted optimization layer rather than a default foundation for every application format.

Final mental model 🧠

If you only remember one thing, remember this:

JSON repeats structure with every object.
TOON declares structure once and streams rows compactly.
Arrow optimizes typed in-memory interchange.
Parquet optimizes durable analytical storage.

That is why TOON is interesting. It is not competing with Parquet at the storage layer or Arrow at the execution layer. It is optimizing the final stretch where structured data becomes prompt context for a model.

If your stack already stores data in Parquet and processes it with Arrow-backed tools, TOON can be a neat final-mile format for presenting retrieved rows to an LLM with less token overhead and clearer structure.

Want to convert between JSON & TOON file formats? Check out my tool here - https://databro.dev/tools/toon-json-converter/

Apache Arrow File Anatomy: Buffers, Record Batches, Schemas, and IPC Metadata Explained 🏹📦

Kumaravelu Saraboji Mahalingam — Thu, 23 Apr 2026 12:01:37 +0000

If you work with Pandas, PyArrow, DuckDB, Spark, Polars, or data APIs, you’ve probably heard that Apache Arrow is fast because it is in-memory and columnar. That’s true, but just like Parquet, the real value starts to click when you understand how Arrow is physically organized.

Under the hood, an Arrow file is not just “serialized table data.” It is a structured binary format built around schemas, record batches, arrays, buffers, and IPC metadata. That structure is what makes Arrow efficient for zero-copy reads, fast interchange between systems, and high-performance analytical processing.

In this post, we’ll break down the anatomy of the Arrow IPC file format from the file boundary down to buffers in memory, and then connect those pieces back to the performance behavior you see in modern engines and libraries.

Why Arrow matters ⚡

A lot of modern data work is less about long-term storage and more about moving tabular data efficiently between systems. Apache Arrow was designed for exactly that: a language-independent, columnar memory format that different tools can share without expensive conversion steps.

That matters because data conversion is often the hidden tax in analytics. When one system has to transform rows into columns, or Python objects into JVM objects, or one internal memory layout into another, performance drops fast. Arrow reduces that overhead by standardizing a format that many systems can read directly.

Start with the big picture 🗂️

The easiest way to understand an Arrow file is to think of it as a hierarchy:

An Arrow IPC file contains a schema and one or more record batches.
Each record batch contains one array per column.
Each array is backed by one or more buffers that store validity bits, offsets, and values.
The file also contains IPC metadata and a footer so readers can locate batches and support random access.

That may sound abstract at first, so here is the mental model I use: an Arrow file is like a neatly packed parts crate 📦. The schema is the packing list, record batches are the grouped shipments, arrays are the per-column components, and buffers are the raw binary materials those components are made from.

The two IPC variants 🔀

Apache Arrow defines two IPC representations: the streaming format and the file format. The streaming format is meant for an arbitrary sequence of record batches processed from beginning to end, while the file format is meant for a fixed number of batches and supports random access.

That distinction matters a lot. If you are sending Arrow data over a socket or pipe, the stream format is a natural fit. If you are storing Arrow data on disk and want to jump to specific batches efficiently, the file format is the better mental model.

For this article, the focus is the Arrow IPC file format — the one typically associated with .arrow files.

The physical file layout 🧩

At a high level, an Arrow IPC file stores the schema and record batch messages in a structured binary layout and includes a footer so a reader can discover the batches and access them efficiently later.

Here is a simplified view of the file layout:

In the file format, the footer keeps the locations of dictionary and record batch blocks, which is what enables random access instead of purely sequential reading.

That footer is one of the big differences between the Arrow file format and the Arrow stream format. A file can support random access because it records where the batches live, while a stream is generally meant to be consumed sequentially from start to end.

So the file-level mental picture looks like this:

[Magic][Schema][Record Batch 1][Record Batch 2]...[Footer][Magic]

The exact binary details are format-specific, but the important idea is simple: the file contains enough metadata for a reader to understand the schema and jump to batches without replaying the whole stream from the beginning.

Schema: the contract first 📘

An Arrow file begins by describing the schema — the column names, data types, and optional metadata that define how to interpret the data that follows.

This is important because Arrow is strongly typed. A column is not just “some values”; it is an int64, string, timestamp, list, struct, or another explicit Arrow type. Readers need that contract before they can interpret the underlying buffers correctly.

Schema metadata can also carry key-value annotations. In practice, that means producers can attach extra information while still keeping the core columnar structure intact.

Record batches: the first major building block 🧱

A record batch is one of the central units in Arrow IPC. It is a tabular chunk with a fixed schema where all columns have the same row count.

If a dataset has one million rows, it may be written as multiple record batches instead of one giant monolithic block. That improves manageability and lets readers process data batch by batch rather than loading everything at once.

You can think of it like this:

Arrow File ├── Record Batch 1 -> rows 1 to 250,000 ├── Record Batch 2 -> rows 250,001 to 500,000 ├── Record Batch 3 -> rows 500,001 to 750,000 └── Record Batch 4 -> rows 750,001 to 1,000,000

This is one of the key differences from Parquet. In Arrow IPC, record batches are the main repeatable unit of serialized tabular data, whereas in Parquet the equivalent discussion starts with row groups.

Arrays: one column at a time 🧵

Inside a record batch, each column is represented as an Arrow array. So if your schema has id, country, and amount, the batch contains one array for id, one for country, and one for amount.

This is where Arrow’s columnar nature becomes concrete. Instead of storing rows as full records one after another, Arrow stores values in per-column structures that are easier for vectorized processing and cross-language interoperability.

That design is a big reason Arrow works so well as an interchange layer between systems like Python, C++, R, and database engines. The representation is already columnar and typed before any query engine starts doing extra work.

Buffers: where the actual bytes live 🧠

The most important low-level concept in Arrow is the buffer. Arrays are logical data structures, but the actual bytes usually live in one or more buffers, such as a validity bitmap buffer, an offsets buffer, and a values buffer depending on the data type.

Here is a simplified zoom-in of one record batch block:

This is the key idea behind Arrow’s internals: a record batch is a table-shaped chunk, each column is stored as an array, and each array maps to one or more physical buffers depending on the data type.

For example, a fixed-width numeric column may mainly need a validity bitmap and a contiguous values buffer. A variable-length string column usually needs validity bits, offsets showing where each value begins and ends, and a data buffer containing the concatenated string bytes.

This design is what makes Arrow feel so fast in practice. The memory layout is compact, explicit, and predictable, which helps CPUs and libraries process columns efficiently without reconstructing every row as a heavyweight object.

A simple string-column mental model 🔤

Imagine a country column with values like US, IN, CA, and one null. Arrow does not need to store those as four separate language-level string objects. Instead, it can represent the column using compact buffers that describe validity, positions, and raw bytes.

A simplified picture looks like this:

Validity bitmap: 1 1 0 1
Offsets: 0 2 4 4 6
Values buffer: USINCA

That means Arrow can represent variable-length data while still keeping the storage contiguous and efficient. It is one of the clearest examples of why Arrow is more than “just a binary table dump.”

Why zero-copy matters 🚀

Arrow reading is often described as zero-copy, and that phrase is important. Apache Arrow documentation notes that reading Arrow IPC data is inherently zero-copy when the source allows it, such as in-memory buffers or memory-mapped files, except in cases where transformations like decompression are required.

In plain language, zero-copy means a reader can often point directly at existing bytes instead of allocating new memory and rewriting the data into another layout. That reduces CPU overhead, memory churn, and latency.

This is why Arrow is so valuable in data interchange scenarios. The format is optimized not just for storage, but for sharing data structures efficiently between processes, runtimes, and libraries.

Why file format and memory mapping fit together 🧭

The Arrow file format supports random access, and Arrow documentation explicitly highlights that this is useful with memory maps.

That combination is powerful: a process can memory-map a .arrow file, inspect the footer, locate the record batches, and access the underlying buffers with minimal copying. This is very different from text-based formats that typically require parsing and conversion before the data becomes analytically useful.

So when people say Arrow is fast, a big part of the answer is not just “columnar.” It is columnar plus typed plus buffer-oriented plus random-access-friendly.

Arrow file vs stream format 📂

Here is the practical difference between the two IPC variants:

Format	Best for	Access pattern	Core behavior
Arrow IPC Stream	Pipes, sockets, sequential transfer	Sequential	Processed from start to end, no random access support.
Arrow IPC File	Disk persistence, `.arrow` files, memory mapping	Random access	Stores a fixed number of batches with footer-based access.

They are related, but not interchangeable. Apache Arrow and IANA documentation both emphasize that applications should know which format they are processing.

A concrete example 🧪

Let’s say you have a simple table with three columns: id, country, and amount. In an Arrow IPC file, that data would be written using the schema plus one or more record batches, and each batch would hold one array per column backed by buffers.

That means the amount values are already stored in a contiguous typed column representation, while country might be represented using offsets and values buffers. A consumer reading the file does not need to guess the schema or re-tokenize text rows the way it would with CSV.

This is why Arrow is so useful as a transport layer between systems. The producer writes structured columnar data once, and downstream consumers can often reuse that structure directly.

Where Arrow shines most ✨

Arrow is especially strong when the goal is fast interchange and in-memory analytics, not necessarily long-term compressed storage. The format preserves Arrow’s in-memory representation and helps avoid conversion overhead when moving data between systems.

That is why Arrow shows up so often in Python data libraries, query engines, database integrations, and dataframe systems. It is the connective tissue that lets many of those tools exchange columnar data efficiently.

If Parquet is often the answer for durable analytical storage on disk, Arrow IPC is often the answer for moving already-columnar data around with as little friction as possible.

Arrow vs Parquet: what’s actually different? ⚖️

At a glance, Arrow and Parquet can look similar because both are columnar and both show up constantly in analytics stacks. But they are optimized for different jobs, and that difference explains a lot of the behavior you see in real systems.

Parquet is primarily a storage format optimized for compressed analytical reads on disk, with structures like row groups, column chunks, pages, and footer statistics that support column pruning and predicate pushdown. Arrow IPC is primarily a serialization and interchange format built around Arrow’s in-memory columnar representation, using schemas, record batches, arrays, buffers, and file/stream metadata for efficient data sharing between systems.

Here is the practical mental model:

Aspect	Parquet	Arrow IPC
Primary goal	Efficient analytical storage on disk.	Fast in-memory interchange and serialization.
Main structural unit	Row groups, then column chunks, then pages.	Record batches, then arrays, then buffers.
Metadata for skipping	Footer statistics help with row-group pruning and predicate pushdown.	No Parquet-style row-group statistics model; usually not the main pruning layer.]
Random access story	Readers inspect footer and plan selective reads.	File footer enables random access to batches; stream format is sequential.
Best fit	Data lake storage, warehouse files, long-term analytics.	Data exchange between engines, memory-mapped analytics, dataframe interoperability.
Parallelism shape	Engines often parallelize across row groups and files.	Engines often process multiple record batches and fragments in parallel.

This is why the two formats often appear together instead of competing directly. Parquet is frequently the durable storage layer, while Arrow is the fast in-memory representation used to move data between readers, execution engines, APIs, and dataframes.

If you want a simple rule of thumb: Parquet helps you store analytical data efficiently; Arrow helps you move and process analytical data efficiently.

A common misconception 🚫

A common misconception is that Arrow file format is basically just a faster Parquet. That is not quite right. Parquet is a columnar storage format optimized for compressed analytical persistence, while Arrow IPC is a serialization and interchange format built around Arrow’s in-memory columnar representation.

They are related because both are columnar and both are useful in analytics, but they are optimized for different jobs. Arrow emphasizes interoperability and low-overhead access to typed buffers, while Parquet emphasizes compact, statistics-aware storage for scan-heavy workloads.

Final mental model 🧠

If you only remember one thing, remember this:

Schema defines the column names, types, and metadata.
Record batches break a dataset into tabular chunks with a shared schema.
Arrays represent one column at a time inside each batch.
Buffers hold the real bytes for values, offsets, and null tracking.
IPC file metadata and footer help readers locate batches and support random access.

Once that clicks, Arrow becomes much easier to reason about. Zero-copy reads, memory mapping, fast interchange, and vectorized execution all trace back to this physical structure.

If you work in PyArrow, DuckDB, Polars, Spark, or data APIs, understanding Arrow internals is one of those low-level concepts that pays off repeatedly. The format is doing much more than simply storing bytes — it is shaping how modern systems share and process tabular data.

👉 Want to inspect this visually? Try it here: https://databro.dev/tools/arrow-inspector-plus/

🚀 I Built a Browser-Local AI Assistant in Next.js with WebLLM, WASM, ONNX Runtime, Web Workers, and RAG

Kumaravelu Saraboji Mahalingam — Tue, 14 Apr 2026 20:28:14 +0000

Most AI chat widgets are just frontends for a remote API.

This one is different.

My assistant runs its core retrieval and generation pipeline inside the browser using WebLLM, Web Workers, WASM, ONNX Runtime Web, and a local RAG architecture built in Next.js.

You can try it here:

👉 https://databro.dev/?chat=open

What makes this fun is not just that it works locally.

It is that the browser is doing real AI work:

loading model artifacts
reusing browser cache
embedding queries
retrieving relevant chunks
reranking candidates
generating grounded answers
returning data back to the UI

That turns the browser from a thin client into an actual inference runtime.

🎯 Why I built it this way

Most website assistants follow the same pattern:

User enters a prompt.
Frontend sends it to a backend.
Backend calls an LLM API.
Response comes back.

That works. But it also means:

extra round trips
recurring inference cost
more infrastructure
more privacy tradeoffs
less control over local behavior

I wanted a different architecture: a browser-local AI assistant that could answer from a curated knowledge base without depending on server-side inference for the main path.

That is where WebLLM, Web Workers, WASM, ONNX Runtime Web, and RAG start fitting together really well.

🧠 The core idea

This is not “an LLM in a webpage.”

It is a layered browser-native AI system where each part has a very specific role:

Next.js widget → chat UI and state

Web Worker → orchestration and background execution

WebLLM → local generation runtime

WASM → efficient low-level browser execution

ONNX Runtime Web → browser inference for embedding and reranking tasks

RAG pipeline → grounding answers against the knowledge base

Caching → making repeat sessions practical

Once I started thinking about the architecture this way, the implementation became much cleaner.

🔥 What WebLLM actually is

A lot of people hear “WebLLM” and assume it is the model.

It is not.

WebLLM is the browser-side runtime used to load and execute supported language models locally.

That means the model and the runtime are two different things:

WebLLM = execution engine
Llama / Phi / Gemma / Mistral = model loaded into that engine

This distinction matters because it changes how you think about browser inference.

You are not calling a model API.

You are creating a local runtime, loading a model into it, and then sending prompt messages into that runtime.

That framing made a huge difference for me.

📦 Does WebLLM need to be downloaded?

Yes — and this is one of the most important practical details in browser-local AI.

On first use, the browser usually needs to download:

the selected model artifacts
runtime support assets
related files required to initialize inference

That means browser-local AI comes with a real first-run cost.

But after that, things get better fast.

Once those assets are cached, later sessions are much faster. This is one of the biggest UX wins in local inference: the browser starts behaving more like an application runtime than a stateless page.

⚙️ How the WebLLM runtime gets created

At a high level, the runtime lifecycle looks like this:

Create the WebLLM engine.
Select a supported model.
Download artifacts if they are not already cached.
Load the model into the engine.
Send structured prompt messages for generation.
Return the generated output.

So the runtime is not just a helper utility.

It is the execution container for the model.

That is why I think of WebLLM as a browser-native inference runtime rather than a simple wrapper library.

🧱 Why WASM matters

WASM (WebAssembly) is one of the hidden pillars of browser-local AI.

A lot of browser AI articles mention it in passing, but it deserves more attention than that.

WASM gives the browser a compact, efficient way to execute compute-heavy logic closer to native speed than ordinary JavaScript. That matters because local inference is not light work.

Tasks like these become much more realistic with performant browser execution paths:

model runtime support
tensor-heavy execution
embedding pipelines
reranking workloads
token generation infrastructure

Without efficient lower-level execution, the entire local inference stack becomes much harder to make practical.

🧠 WebLLM vs WASM vs ONNX Runtime Web

These are related, but they are not the same thing.

A simple way to separate them:

Layer	Responsibility
WebLLM	Local runtime for browser-based LLM generation
WASM	Efficient low-level execution layer in the browser
ONNX Runtime Web	Browser inference runtime for ONNX-backed model workloads
Web Worker	Background execution boundary that protects UI responsiveness

So WASM is not competing with WebLLM.

It is one of the technologies helping make browser-native inference feasible.

🧮 What ONNX Runtime Web is doing here

One of the easiest mistakes in local AI architecture is treating every model task as if it belongs to the same runtime.

It does not.

Generation is one kind of workload.

Embedding and reranking are different workloads.

That is why I like this split:

WebLLM for generation
ONNX Runtime Web for retrieval-side transformer execution

This is a strong design because retrieval-side tasks often need a different execution path than token-by-token LLM generation.

In practice, browser-local RAG is rarely “one model doing everything.”

It is a pipeline of specialized responsibilities.

🧵 Why Web Workers are non-negotiable

If you run browser-local AI on the main thread, the UI will eventually remind you that this was a bad idea.

Maybe not immediately.

Maybe not on your development machine.

But once model loading, chunk scoring, reranking, and generation pile up, the experience starts to degrade fast.

That is why Web Workers are essential.

A worker gives you a separate execution context for heavy tasks so the main thread can stay focused on:

rendering
input handling
scrolling and interaction
animation
state updates

For AI-heavy browser apps, that separation is not a nice-to-have.

It is architecture.

⚛️ Creating a Web Worker in Next.js

Workers are browser APIs, so they should only be created on the client side.

That means your widget component should be client-rendered and the worker should be created lazily when the chat experience actually begins.

This pattern works especially well in Next.js because it lets you keep rendering concerns in the UI layer while moving heavy orchestration into a background execution boundary.

I also prefer lazy worker creation because it avoids paying the initialization cost for users who never open the assistant.

📨 Widget-to-worker messaging

Once the worker exists, the widget should communicate with it using structured messages rather than trying to share runtime state directly.

That message boundary matters a lot.

The UI sends things like:

prompt text
serialized KB context
request identifier

The worker sends back:

final answer
citation metadata
failure state if something breaks

That separation keeps the frontend simpler and makes the worker a clean orchestration boundary instead of an implementation detail leaking into the UI.

🧠 Worker orchestration is where the real engineering happens

The worker is not just a background script.

It is the orchestration layer of the entire local AI lifecycle.

This is where things become much more than “I loaded a model in the browser.”

The worker is responsible for coordinating:

model/runtime initialization
cache-aware model reuse
KB artifact loading
embedding availability
retrieval and score fusion
reranking
confidence checks
prompt assembly
answer generation
packaging result metadata back to the widget

That orchestration layer is what transforms separate browser AI technologies into an actual product.

This, honestly, is where most of the engineering value lives.

🏗️ What the worker is really managing

The worker owns the lifecycle of the expensive parts of the system.

That typically includes:

the generation engine
embedding model access
reranker access
parsed or cached KB context
warm in-memory session state

This is important because the main thread should not be responsible for managing heavy AI runtime objects.

The UI should care about:

input
loading states
response rendering
citations
interaction flow

The worker should care about:

initialization
orchestration
reuse
inference sequencing

That separation is one of the biggest reasons the app feels stable instead of fragile.

🗂️ Overall architecture

🔄 Prompt lifecycle

Let’s walk the whole journey.

1) The user opens the widget

The Next.js app renders the chat interface and waits for interaction.

2) The worker is created lazily

Only when the user opens or uses the assistant does the app create the worker.

3) The worker warms up the AI stack

It checks whether engines, pipelines, and context state already exist.

4) Browser cache is consulted

If model assets are already cached, startup is faster.

If not, the first-run downloads happen here.

5) KB vectors are loaded

The worker loads precomputed vector artifacts or rebuilds what it needs.

6) The user enters a prompt

The widget sends the prompt and context payload to the worker.

7) The embedding model encodes the query

The prompt is turned into a dense vector representation.

8) Retrieval begins

Dense retrieval and sparse retrieval identify candidate KB chunks.

9) Hybrid scoring narrows the pool

The system fuses semantic and lexical signals.

10) Reranking refines the candidates

The best chunks are rescored for prompt-specific usefulness.

11) Confidence gating runs

If the candidates are weak, the system can fall back safely.

12) Grounded context is assembled

The final chunk set is turned into the context window for generation.

13) WebLLM executes generation

The worker sends system rules, prompt, and grounded context into the local runtime.

14) The worker packages the result

The answer text and citation metadata are returned to the widget.

15) The UI renders the final response

The user receives a grounded answer without needing the main inference path to leave the browser.

That full lifecycle is what turns a browser-local model into a practical assistant.

🛠️ What I learned building this

Here is the short version of what mattered most:

Treat WebLLM as a runtime, not as “the model.”
Expect first-run downloads and design for cache reuse.
Keep heavy work off the main thread.
Use workers as orchestration boundaries, not just compute bins.
Separate generation from retrieval-side inference.
Precompute KB vectors whenever possible.
Use reranking if grounded quality matters.
Add confidence gates before you need them.
Design for warm sessions, not just cold starts.

These are the choices that made the assistant feel like a product instead of a demo.

🎉 Final thought

The most exciting part of this project is not that one library made it possible.

It is that several browser-native technologies now fit together well enough to build a real local AI product.

That stack, for me, looks like this:

WebLLM for generation
WASM for efficient browser execution
ONNX Runtime Web for embedding and reranking paths
Web Workers for orchestration
RAG for grounding

Put together, they turn the browser into something much more powerful than a UI shell.

They turn it into the runtime.

🔗 Try it

https://databro.dev/?chat=open

Apache Parquet File Anatomy: Row Groups, Column Chunks, Pages, and Metadata Explained 🧱📦

Kumaravelu Saraboji Mahalingam — Fri, 10 Apr 2026 17:41:10 +0000

If you use Spark, Athena, Iceberg, Snowflake, DuckDB, or Pandas, you’ve probably worked with Parquet hundreds of times. But most of us first learn Parquet as a simple rule of thumb: it’s columnar, compressed, and great for analytics. That’s true, but it leaves out the most interesting part — why Parquet performs so well in the first place.

Under the hood, a Parquet file is not just a blob of compressed data. It has a deliberate internal structure made of row groups, column chunks, pages, and footer metadata, and that structure is exactly what enables column pruning, predicate pushdown, and efficient scans in modern query engines.

In this post, we’ll break down the anatomy of a Parquet file from the file boundary all the way down to individual pages, and then connect those pieces back to the real-world performance behavior you see in Spark, Iceberg, and Athena.

Why Parquet matters ⚡

Most analytical queries do not read every column and every row. They usually select a subset of columns, filter by a few predicates, and aggregate over large volumes of data. Parquet is designed specifically for that style of access, which is why it outperforms row-oriented formats like CSV for analytics-heavy workloads.

Instead of storing each record end-to-end, Parquet stores data column by column, while still grouping rows into larger units for efficient processing. That combination improves compression, reduces unnecessary I/O, and allows engines to skip chunks of data using metadata rather than brute-force scanning.

Start with the big picture 🗂️

The easiest way to understand Parquet is to think of it as a hierarchy:

A file contains one or more row groups.
Each row group contains one column chunk per column.
Each column chunk contains one or more pages.
The file ends with a footer that stores schema and metadata about those structures.

That may sound abstract at first, so here is the mental model I use: a Parquet file is like a mini warehouse 🏭, where rows are divided into sections, each section stores columns separately, and the catalog for the whole warehouse sits at the very end of the file.

The physical file layout 🧩

At the physical level, a Parquet file starts with a magic marker, stores row-group data in the body, and ends with footer metadata, the footer length, and another magic marker. Apache Parquet documents this structure explicitly with PAR1 at both the beginning and the end of the file.

Here is the high-level layout:

[PAR1][Row Group Data ...][File Metadata][Metadata Length][PAR1]

That footer-at-the-end design is more important than it looks. A reader can jump to the end of the file, inspect the metadata, understand the schema and row groups, and plan an efficient read before touching most of the actual data blocks.

A file-level diagram 🏗️

This is the skeleton of every Parquet file: data first, metadata last.

Row groups: the first major building block 🧱

A row group is a horizontal partition of rows inside a single Parquet file. If a file contains one million rows, those rows may be split across multiple row groups, and each row group becomes a self-contained unit for reading and processing.

This matters because row groups are a natural unit for parallelism. Distributed engines can assign different row groups to different tasks, and metadata associated with each row group can help decide whether that row group needs to be read at all.

You can think of it like this:

Parquet File
├── Row Group 1 -> rows 1 to 250,000
├── Row Group 2 -> rows 250,001 to 500,000
├── Row Group 3 -> rows 500,001 to 750,000
└── Row Group 4 -> rows 750,001 to 1,000,000

The important nuance is that a row group is not stored row-by-row internally. It is still columnar inside, which is where column chunks come in.

Column chunks: where columnar storage shows up 🧵

Inside each row group, every column gets its own column chunk. That means for a row group containing id, country, and amount, Parquet stores one chunk for id, one for country, and one for amount.

This is the mechanism behind column pruning. If your query only needs country and amount, the engine can skip the id chunks entirely, which reduces both I/O and deserialization work.

Here is a simple view:

At this point, you can already see why Parquet is so effective for analytics. Analytical queries rarely need every field in every row, and Parquet’s internal structure mirrors that reality.

Pages: the smallest units inside a chunk 📄

Each column chunk is further divided into pages, which are the smallest units used to store encoded data. Pages hold the actual values, and depending on the encoding being used, they may also be preceded by a dictionary page.

That means a column chunk is not one monolithic blob. It is a sequence of smaller blocks that can be encoded and compressed efficiently, while still fitting the overall columnar structure.

A useful diagram looks like this:

In practice, this page-level organization helps Parquet balance storage efficiency with read efficiency. The format can encode and compress data in manageable units instead of treating each column chunk as a single continuous stream.

Dictionary pages and encoding 📚

One of the most common Parquet optimizations is dictionary encoding. Instead of writing repeated string values over and over, Parquet can write a dictionary of unique values once and then store compact references in the data pages.

For a column like country, the dictionary might contain US, IN, and CA, and the data pages would store something closer to 0, 0, 1, 2 than full repeated strings. That reduces storage size and often improves downstream compression too.

This is one reason categorical columns often compress especially well in Parquet. Repeated patterns are easier to encode when similar values are physically grouped together in the same column chunk.

The footer: the real brain of the file 🧠

The most important part of a Parquet file is arguably not the data body but the footer. That footer stores file metadata such as the schema, row-group descriptions, and column-level information needed by readers to interpret the file efficiently.

Because the footer is written at the end of the file, readers can retrieve it first, inspect the contents, and decide what to read and what to skip. That is a huge part of why Parquet feels smart rather than brute-force.

At a high level, the footer can tell a reader:

What the schema is.
How many row groups exist.
Where each column chunk lives in the file.
What encodings and compression settings were used.
What statistics are available for pruning.

Metadata is what powers skipping 🚦

Parquet’s metadata is not just descriptive. It is actionable. The row-group and column metadata often includes statistics such as minimum value, maximum value, and null count, which allows query engines to avoid reading irrelevant data.

For example, if a row group’s event_date has a minimum of 2026-01-01 and a maximum of 2026-01-31, then a query filtering for March 2026 can skip that row group entirely. The engine does not need to inspect every row to know there is no match.

That optimization is the foundation of predicate pushdown and row-group pruning. Instead of reading first and filtering later, engines can use metadata to avoid unnecessary reads in the first place.

Predicate pushdown diagram 🎯

This is one of the most important performance ideas in Parquet. The file is designed so engines can make good decisions before scanning the full payload.

A concrete example 🧪

Let’s say you have this table:

id	country	amount
1	US	100
2	US	120
3	IN	900
4	CA	80

In a row-based file, those values are stored as complete records one after another. In Parquet, the values are stored by column inside each row group, so the country values sit together and the amount values sit together rather than being interleaved row-by-row.

Now imagine this query:

SELECT country
FROM sales
WHERE amount > 500

A Parquet-aware engine can use metadata to identify which row groups might contain amount > 500, read the relevant amount column chunks for filtering, and then read only the country column for matching records. It does not need to read every column for every row the way a plain text row format typically would.

Why compression works so well 🗜️

Parquet’s storage efficiency comes from a combination of columnar layout, encoding, and compression. Similar values tend to sit next to each other within a column, which usually makes them more compressible than mixed-value row-based storage.

For example, a status column containing repeated values like SUCCESS, SUCCESS, FAILED, SUCCESS is far easier to encode compactly when those values are grouped together than when they are scattered across full records containing timestamps, IDs, and free-form text.

That is why Parquet often ends up dramatically smaller than CSV while also being faster to scan for analytical use cases. Its internal organization works with compression instead of fighting it.

Why row group size is a tuning lever 🎛️

Row groups are not just a format detail. They are also a performance tuning lever. Larger row groups often improve compression and reduce metadata overhead, but they can reduce pruning granularity. Smaller row groups allow finer skipping and often more parallelism, but they introduce more metadata and may hurt compression efficiency.

This is one of the reasons output file design matters so much in distributed data systems. A well-formed Parquet file is not just about “using Parquet” — it is also about choosing file sizes and row-group sizing that match your workload.

What this means in Spark 🔥

In Spark, Parquet’s layout maps naturally to common optimizations like column pruning and predicate pushdown. When Spark can use Parquet statistics effectively, it avoids reading unnecessary row groups and often avoids materializing columns that are not selected by the query.

That means your file layout choices affect real job behavior. If your data is written into too many small files or poorly sized row groups, you may lose many of the benefits that Parquet is structurally capable of delivering.

What this means in Iceberg 🧊

Iceberg relies heavily on Parquet because Parquet already provides efficient columnar storage and file-level metadata patterns that work well for analytical reads. Iceberg adds another planning layer on top, but the scan efficiency still depends a lot on the properties of the underlying Parquet files.

In other words, Iceberg gives you table-level intelligence, but Parquet still does much of the physical storage work. Understanding row groups and statistics helps explain why good file compaction and sort strategy can matter so much in Iceberg-backed tables.

What this means in Athena 🏛️

Athena benefits from Parquet for the same core reasons: fewer bytes scanned, better compression, and the ability to skip irrelevant data using metadata and layout-aware reads. Since Athena pricing and performance are tightly tied to scanned data volume, Parquet’s structure can directly reduce both runtime and cost.

That is why converting CSV-based data lakes into partitioned and well-written Parquet often delivers an immediate practical benefit. The file format itself changes how much work the engine has to do.

A common misconception 🚫

A common misconception is that Parquet is just “a binary CSV with compression.” That is not really what it is. Parquet is a structured columnar storage format with typed schema metadata, row groups, column chunks, pages, and statistics-aware footers that analytical engines can exploit directly.

CSV is a simple row-based serialization format. Parquet is a storage format engineered for selective analytical access. Those are fundamentally different design goals.

Final mental model 🧠

If you only remember one thing, remember this:

Row groups partition rows into larger processing units.
Column chunks store one column’s data inside each row group.
Pages break column chunks into smaller encoded blocks.
Footer metadata tells engines what exists, where it lives, and what can be skipped.

Once that clicks, a lot of data engineering advice becomes easier to reason about. File sizing, pruning, partitioning, compaction, and scan performance all tie back to this physical layout.

If you work in Spark, Iceberg, Athena, or any modern analytical stack, understanding Parquet internals is one of those low-level concepts that pays off repeatedly. The format is doing much more than simply storing data — it is shaping how your engine thinks about reading it.

👉 Want to inspect this visually? Try it here: https://databro.dev/tools/parquet-inspector-plus/

Data Engineering Meets DuckDB

Kumaravelu Saraboji Mahalingam — Sun, 01 Mar 2026 16:54:19 +0000

Introduction to Data Engineering and DuckDB

Data engineering is a crucial aspect of the data science ecosystem, focusing on the design, construction, and maintenance of data pipelines and architectures. As data engineers, we strive to create efficient, scalable, and reliable systems that can handle the ever-increasing volumes of data. In this article, we will explore the concept of data engineering and introduce DuckDB, an innovative database management system that is revolutionizing the way we work with data.

What is Data Engineering?

Data engineering is a field that combines software engineering and data science to design, build, and maintain large-scale data systems. Data engineers are responsible for:

Designing and implementing data pipelines
Developing and maintaining data architectures
Ensuring data quality and integrity
Optimizing data storage and retrieval

Data engineering involves a range of activities, from data ingestion and processing to data storage and analysis. It requires a deep understanding of data formats, data structures, and data processing algorithms, as well as expertise in programming languages such as Python, Java, and Scala.

Challenges in Data Engineering

Data engineers face numerous challenges, including:

Scalability: Handling large volumes of data and ensuring that systems can scale to meet increasing demands
Performance: Optimizing data processing and retrieval to minimize latency and maximize throughput
Data Quality: Ensuring that data is accurate, complete, and consistent
Security: Protecting sensitive data from unauthorized access and ensuring compliance with regulatory requirements

Introducing DuckDB

DuckDB is an open-source, columnar database management system that is designed to address the challenges of data engineering. It is a relational database that allows for efficient storage and querying of large datasets. DuckDB is written in C++ and provides a SQL interface for interacting with data.

Key Features of DuckDB

Some of the key features of DuckDB include:

Columnar Storage: DuckDB stores data in a columnar format, which allows for efficient compression and querying of data
In-Memory Processing: DuckDB can process data in-memory, which reduces the need for disk I/O and improves performance
SQL Interface: DuckDB provides a SQL interface for interacting with data, making it easy to integrate with existing data pipelines and tools
Support for Advanced Data Types: DuckDB supports advanced data types such as arrays, structs, and geospatial data

Benefits of Using DuckDB

The benefits of using DuckDB include:

Improved Performance: DuckDB's columnar storage and in-memory processing capabilities make it ideal for real-time analytics and data science applications
Simplified Data Engineering: DuckDB's SQL interface and support for advanced data types make it easy to integrate with existing data pipelines and tools
Cost-Effective: DuckDB is open-source and can run on commodity hardware, making it a cost-effective alternative to traditional database management systems

Use Cases for DuckDB

DuckDB is suitable for a range of use cases, including:

Real-Time Analytics: DuckDB's in-memory processing capabilities make it ideal for real-time analytics and data science applications
Data Warehousing: DuckDB's columnar storage and SQL interface make it suitable for data warehousing and business intelligence applications
IoT Data Processing: DuckDB's support for advanced data types and in-memory processing capabilities make it suitable for IoT data processing and analytics

Conclusion

In conclusion, data engineering is a critical aspect of the data science ecosystem, and DuckDB is an innovative database management system that can help address the challenges of data engineering. With its columnar storage, in-memory processing, and SQL interface, DuckDB is an ideal solution for real-time analytics, data warehousing, and IoT data processing. As data engineers, we should consider DuckDB as a key component of our data architectures and explore its capabilities to improve the efficiency, scalability, and reliability of our data pipelines.

RAG?

Kumaravelu Saraboji Mahalingam — Sun, 01 Mar 2026 14:07:57 +0000

Introduction to RAG in GenAI

As Data Engineers, we're constantly exploring innovative technologies to improve our workflows and models. One such concept that has gained significant attention in the realm of General Artificial Intelligence (GenAI) is Retrieval-Augmented Generation (RAG). In this article, we'll delve into the world of RAG, its components, and how it's revolutionizing the field of GenAI.

What is Retrieval-Augmented Generation (RAG)?

RAG is a paradigm that combines the strengths of retrieval-based and generation-based approaches to produce more accurate, informative, and context-specific outputs. It's particularly useful in applications where the model needs to generate human-like text based on a given prompt or input.

The RAG framework consists of three primary components:

Retriever: This module is responsible for retrieving relevant information from a vast knowledge base or database. The retriever uses the input prompt to search for related documents, passages, or data points that can aid in the generation process.
Generator: Once the retriever has fetched the relevant information, the generator takes over. This module uses the retrieved data to generate the final output, which can be text, images, or any other form of media.
Ranker: The ranker is an optional component that evaluates the generated outputs and ranks them based on their relevance, accuracy, and overall quality.

How RAG Works

The RAG pipeline can be broken down into the following steps:

Input: The user provides a prompt or input that serves as the basis for the generation process.
Retrieval: The retriever searches the knowledge base to gather relevant information related to the input prompt.
Generation: The generator uses the retrieved information to produce one or more candidate outputs.
Ranking: If a ranker is present, it evaluates the generated outputs and assigns a score to each one.
Output: The final output is selected based on the ranking scores or other evaluation metrics.

Benefits of RAG

The RAG framework offers several advantages over traditional generation-based approaches:

Improved accuracy: By leveraging the retriever to fetch relevant information, RAG models can produce more accurate and informative outputs.
Increased contextuality: RAG allows models to consider a broader context when generating outputs, leading to more coherent and relevant responses.
Reduced hallucination: The retriever's ability to fetch real-world data helps reduce the likelihood of hallucination, where models generate outputs that are not grounded in reality.

Real-World Applications of RAG

RAG has numerous applications in areas such as:

Chatbots and conversational AI: RAG can be used to generate more informative and context-specific responses to user queries.
Text summarization: RAG models can summarize long documents or articles by retrieving relevant information and generating concise summaries.
Question answering: RAG can be applied to question answering tasks, where the retriever fetches relevant information and the generator produces the final answer.

Conclusion

Retrieval-Augmented Generation (RAG) is a powerful paradigm that has the potential to revolutionize the field of GenAI. By combining the strengths of retrieval-based and generation-based approaches, RAG models can produce more accurate, informative, and context-specific outputs. As Data Engineers, it's essential to stay up-to-date with the latest advancements in RAG and explore its applications in various domains. Whether you're working on chatbots, text summarization, or question answering, RAG is definitely worth considering as a valuable tool in your toolkit.

Agentic AI Explained

Kumaravelu Saraboji Mahalingam — Fri, 27 Feb 2026 01:03:51 +0000

Introduction to Agentic AI

Agentic AI refers to a subset of artificial intelligence (AI) that focuses on creating autonomous agents capable of making decisions and taking actions based on their environment, goals, and constraints. These agents can be used in various applications, including robotics, smart homes, and decision support systems. As a data engineer, understanding the concept of agentic AI and its components is crucial for designing and implementing effective AI solutions.

Key Components of Agentic AI

Agentic AI consists of several key components that work together to enable autonomous decision-making and action-taking. These components include:

Sensors: These are the inputs that provide the agent with information about its environment. Sensors can be physical, such as cameras or microphones, or virtual, such as data streams or APIs.
Reasoning and Decision-Making: This component is responsible for analyzing the data from the sensors and making decisions based on the agent's goals and constraints. Reasoning and decision-making can be achieved using various techniques, including rule-based systems, machine learning, or optimization algorithms.
Actuators: These are the outputs that enable the agent to take actions in its environment. Actuators can be physical, such as motors or speakers, or virtual, such as sending notifications or making API calls.
Goals and Constraints: These define the objectives and limitations of the agent. Goals can be specified using various techniques, such as reward functions or objective functions, while constraints can be defined using rules or optimization constraints.

Types of Agentic AI

There are several types of agentic AI, each with its strengths and weaknesses. Some of the most common types include:

Reactive Agents: These agents respond to their environment without maintaining any internal state or memory. Reactive agents are simple and efficient but can be limited in their ability to make complex decisions.
Proactive Agents: These agents maintain an internal state and can anticipate and plan for future events. Proactive agents are more complex and powerful than reactive agents but require more computational resources and data.
Hybrid Agents: These agents combine the benefits of reactive and proactive agents by using a combination of reactive and proactive techniques.

Applications of Agentic AI

Agentic AI has a wide range of applications across various industries, including:

Robotics: Agentic AI can be used to control robots and enable them to navigate and interact with their environment.
Smart Homes: Agentic AI can be used to control and automate smart home devices, such as thermostats and lights.
Decision Support Systems: Agentic AI can be used to provide decision support for complex tasks, such as financial planning or medical diagnosis.

Challenges and Limitations

While agentic AI has the potential to revolutionize various industries, it also poses several challenges and limitations, including:

Data Quality and Availability: Agentic AI requires high-quality and relevant data to make effective decisions.
Explainability and Transparency: Agentic AI can be complex and difficult to interpret, making it challenging to understand the decision-making process.
Security and Safety: Agentic AI can pose security and safety risks if not designed and implemented properly.

Conclusion

Agentic AI is a powerful and versatile technology that has the potential to transform various industries. As a data engineer, understanding the key components, types, and applications of agentic AI is crucial for designing and implementing effective AI solutions. However, agentic AI also poses several challenges and limitations that need to be addressed to ensure its safe and effective deployment. By continuing to advance and improve agentic AI, we can unlock its full potential and create more autonomous, efficient, and effective systems.

Future Directions

As agentic AI continues to evolve, we can expect to see significant advancements in areas such as:

Edge AI: The integration of agentic AI with edge computing to enable real-time processing and decision-making.
Explainable AI: The development of techniques and tools to improve the explainability and transparency of agentic AI decision-making.
Human-AI Collaboration: The design of systems that enable effective collaboration between humans and agentic AI agents.

By exploring these future directions and addressing the challenges and limitations of agentic AI, we can create more sophisticated and effective AI systems that transform industries and improve our lives.