Forem: divyaprakash D

Vitreus: Local-First Spreadsheet Intelligence with Gemma 4

divyaprakash D — Sun, 24 May 2026 23:03:30 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

TL;DR: Vitreus is a spreadsheet agent that lets you ask natural-language questions of CSV and XLSX workbooks, then uses Gemma 4 to return an auditable JSON action manifest. The manifest can highlight rows, write values, and add formulas; Vitreus then applies those changes to a CSV snapshot or a real .xlsx file with colors and formulas preserved.

I built it because spreadsheets are where a lot of real business logic lives: budgets, project trackers, sales forecasts, invoices, HR reviews, and messy exported reports. But the moment you ask an AI assistant to "just update the sheet," you run into a trust problem: what exactly did it change, and why?

Vitreus answers that by splitting the job in two:

Gemma 4 reasons about the workbook.
A deterministic driver applies only structured, reviewable actions.

The model never mutates the workbook directly.

What I Built

Vitreus is a local-first spreadsheet intelligence tool for developers, analysts, and teams who want AI help inside sensitive workbook workflows without giving the model unrestricted control.

The current project includes:

Capability	Status	Why it matters
CSV workbook snapshots	Working	Fast, portable test format for spreadsheet data
XLSX input and output	Working	Preserves cell values, formulas, and highlight colors
One-shot `analyze --output` flow	Working	No separate "generate JSON" and "apply JSON" steps
JSON action manifests	Working	Every model action is inspectable before execution
Local Gemma 4 via Ollama	Supported	Private local inference path
Google AI Studio backend	Supported	API-key path for users without local GPU access
Deterministic fallback planner	Working	CI and demos can run without a model
Chart/receipt image payload prep	Working	Foundation for multimodal spreadsheet workflows
Bash and Nushell command references	Working	Easy manual testing on Linux/Nushell systems

At a high level:

CSV/XLSX workbook
    |
    v
WorkbookSnapshot
    |
    v
compact JSON sheet context
    |
    v
Gemma 4 planner
    |
    v
JSON manifest: highlight / write_value / formula
    |
    v
InMemoryCalcDriver
    |
    v
CSV + sidecar JSON, or XLSX with colors/formulas

This is not a chatbot bolted onto a spreadsheet. It is a small, testable agent pipeline where the model is responsible for reasoning and the application is responsible for safe execution.

Demo

Demo 1: Ask Gemma 4 to find budget problems

Input file: examples/sample_workbook.csv

It contains employee/project rows with columns like:

Name,Department,Q1_Target,Q1_Actual,Q2_Target,Q2_Actual,Score,Status,Budget,Spent,Notes
Ada Lovelace,Research,120000,128000,130000,135000,94,On Track,200000,184000,
Alan Turing,Security,90000,87000,95000,91000,72,Under Review,110000,135000,
...

Command:

uv run vitreus analyze examples/sample_workbook.csv \
  "Highlight all rows where Spent exceeds Budget, and for each over-budget row write OVER BUDGET in the Notes column" \
  --backend google

Actual output from the live API-key run:

{
  "model": {
    "primary": "gemma4:31b",
    "drafter": "gemma4:4b",
    "rationale": "Identified rows where Spent exceeds Budget: Alan Turing (Row 3) and Dennis Ritchie (Row 9). Applied highlights to these rows and updated the Notes column to 'OVER BUDGET'."
  },
  "actions": [
    {
      "type": "highlight",
      "range": "Sheet1!A3:K3",
      "color": "#f97316",
      "reason": "Spent (135000) exceeds Budget (110000)"
    },
    {
      "type": "write_value",
      "cell": "Sheet1!K3",
      "value": "OVER BUDGET",
      "reason": "Spent exceeds Budget"
    },
    {
      "type": "highlight",
      "range": "Sheet1!A9:K9",
      "color": "#f97316",
      "reason": "Spent (108000) exceeds Budget (90000)"
    },
    {
      "type": "write_value",
      "cell": "Sheet1!K9",
      "value": "OVER BUDGET",
      "reason": "Spent exceeds Budget"
    }
  ]
}

This is the core Vitreus pattern: Gemma 4 does the semantic reasoning, but the output is still machine-checkable.

Demo 2: One command, real XLSX output

CSV is useful, but CSV cannot store background colors or Excel formulas. So Vitreus supports one-shot XLSX export:

uv run vitreus analyze examples/sample_workbook.csv \
  "Highlight rows where Spent exceeds Budget in orange, write OVER BUDGET in the Notes column" \
  --backend google \
  --output /tmp/vitreus_result.xlsx

The result is a real Excel workbook. Open it in LibreOffice Calc or Excel and the highlighted rows are actually colored.

Demo 3: XLSX in, XLSX out

I also created a larger workbook for testing:

examples/test_workbook.xlsx
├── Sales        25 data rows, quota and commission formulas
├── Expenses     24 data rows, annual budget/actual formulas
└── HR_Reviews   18 data rows, rating/bonus formulas

Command:

uv run vitreus analyze examples/test_workbook.xlsx \
  "In the Expenses sheet, highlight rows where Annual_Actual exceeds Annual_Budget" \
  --backend google \
  --sheet Expenses \
  --output /tmp/vitreus_expenses_result.xlsx

Recent smoke-test result:

{"applied": 1, "saved": "/tmp/vitreus_xlsx_test.xlsx", "errors": []}

Code

https://github.com/divyaprakash0426/vitreus

Important files:

File	Responsibility
`core/driver.py`	Workbook snapshots, CSV/XLSX loading, XLSX saving, manifest execution
`core/reasoning.py`	Gemma 4 model policy, Ollama backend, Google AI Studio backend, manifest parsing
`core/vision.py`	Image metadata and multimodal prompt payload preparation
`interfaces/cli.py`	Typer CLI commands: `models`, `analyze`, `apply-manifest`, `vision`
`examples/sample_workbook.csv`	Small CSV test dataset
`examples/test_workbook.xlsx`	Larger multi-sheet XLSX test workbook
`examples/test_commands.sh`	Bash demo/test command reference
`examples/test_commands.nu`	Nushell demo/test command reference
`tests/`	37 automated tests covering backends, CLI, driver, output, and vision

Try it:

git clone https://github.com/divyaprakash0426/vitreus.git
cd vitreus

uv sync --extra dev
uv run pytest -q
uv run vitreus models

Run with Google AI Studio:

uv sync --extra integrations
export GEMINI_API_KEY="your_key_here"

uv run vitreus analyze examples/sample_workbook.csv \
  "Flag rows where Spent exceeds Budget and write OVER BUDGET in Notes" \
  --backend google

Run locally with Ollama:

uv sync --extra integrations
ollama pull gemma4:31b

uv run vitreus analyze examples/sample_workbook.csv \
  "Summarise department-level spending risks" \
  --backend ollama

Use the lighter local model:

uv run vitreus analyze examples/sample_workbook.csv \
  "Highlight rows that need review" \
  --backend ollama \
  --model gemma4:4b

How I Used Gemma 4

Vitreus is designed around Gemma 4 31B Dense as the primary model.

Spreadsheet intelligence is a long-context reasoning task. A useful spreadsheet agent needs to:

Read many rows without losing the column semantics.
Understand user intent expressed in plain English.
Compare related fields like Budget and Spent.
Decide which cells or rows need attention.
Generate valid spreadsheet references like Sheet1!A3:K3.
Explain why each action is needed.
Return strict JSON instead of prose.

That is why I chose the 31B Dense model as the default planner. It is the best fit for "read this workbook, understand the pattern, and produce a reliable action plan."

The project still supports smaller Gemma 4 models as a deliberate secondary path:

Model family	Role in Vitreus	Why
Gemma 4 31B Dense	Primary planner	Best fit for long-context workbook reasoning
Gemma 4 4B	Drafter / edge assistant	Lower latency for quick previews and constrained hardware
Gemma 4 26B MoE	Future throughput path	Useful when many independent workbook requests need efficient routing

The model policy is encoded directly in the project:

@dataclass(frozen=True)
class GemmaModelChoice:
    primary: str
    drafter: str
    rationale: str

    @classmethod
    def default(cls) -> "GemmaModelChoice":
        return cls(
            primary="gemma4:31b",
            drafter="gemma4:4b",
            rationale=(
                "Gemma 4 31B Dense is the default because Vitreus needs local, "
                "long-context workbook reasoning and stronger multimodal planning; "
                "Gemma 4 4B remains useful as a low-latency drafter on edge hardware."
            ),
        )

The prompt requires JSON only:

You are Vitreus, a spreadsheet intelligence agent running gemma4:31b.
Analyze the spreadsheet data below and respond with ONLY a valid JSON manifest.

Task: Highlight rows where Spent exceeds Budget.

Required JSON response shape:
{
  "model": {"primary": "gemma4:31b", "drafter": "gemma4:4b", "rationale": "..."},
  "actions": [
    {
      "type": "highlight|write_value|formula",
      "range": "Sheet1!A1:B2",
      "cell": "Sheet1!C2",
      "value": "...",
      "formula": "=SUM(A1:A10)",
      "color": "#f97316",
      "reason": "why this action is needed"
    }
  ]
}

That contract is the heart of the project. Gemma 4 is not asked to "edit a spreadsheet." It is asked to produce a plan that Vitreus can inspect and execute.

How It Works Internally

1. WorkbookSnapshot: turn sheets into model context

Vitreus reads CSV and XLSX files into a simple in-memory representation:

@dataclass
class WorkbookSnapshot:
    sheets: dict[str, list[list[Any]]]

The snapshot can load:

WorkbookSnapshot.from_csv("scores.csv")
WorkbookSnapshot.from_xlsx("workbook.xlsx", sheet_name="Expenses")
WorkbookSnapshot.from_file("workbook.xlsx", sheet_name="Sales")

Then it exports the requested range as compact JSON:

[
  {"Name": "Alan Turing", "Budget": 110000, "Spent": 135000},
  {"Name": "Dennis Ritchie", "Budget": 90000, "Spent": 108000}
]

This gives Gemma 4 useful semantic structure: headers become keys, rows become records, and the model does not have to infer everything from raw cell coordinates.

2. Backends: local-first, cloud-optional

Vitreus supports three execution modes:

Backend	Command	Use case
Fallback	default	CI, demos, no model required
Ollama	`--backend ollama`	Local Gemma 4 inference
Google AI Studio	`--backend google`	API-key run when local GPU is unavailable

The backends are small adapter classes:

class OllamaBackend:
    def call(self, prompt: str) -> str:
        from ollama import chat
        response = chat(model=self.model, messages=[{"role": "user", "content": prompt}])
        return response.message.content

class GoogleAIBackend:
    def call(self, prompt: str) -> str:
        from google import genai
        client = genai.Client(api_key=self.api_key)
        response = client.models.generate_content(model=self.model, contents=prompt)
        return response.text

The lazy imports are intentional. The core package can be installed and tested without Ollama, Google GenAI, LibreOffice, or cloud credentials.

3. Manifest execution: structured actions only

The executor supports three action types:

{"type": "highlight", "range": "Sheet1!A3:K3", "color": "#f97316", "reason": "..."}
{"type": "write_value", "cell": "Sheet1!K3", "value": "OVER BUDGET"}
{"type": "formula", "cell": "Sheet1!J11", "formula": "=SUM(J2:J10)", "reason": "..."}

Unsupported action types are rejected instead of silently ignored. This keeps the model inside a narrow, auditable capability boundary.

4. CSV vs XLSX: the boring detail that mattered

CSV cannot store background colors or formulas as spreadsheet formulas. Early testing made that painfully obvious: a manifest could say "highlight this row," but a CSV output could only store text.

So Vitreus handles both formats explicitly:

Output	Behavior
`.csv`	Saves values and writes highlights to `<name>_highlights.json`
`.xlsx`	Saves values, formulas, and real cell background colors

That means users get a clear warning for CSV:

CSV format cannot store cell colors or formulas.
write_value changes are saved in result.csv
Highlight colors -> result_highlights.json
Tip: use --output result.xlsx to preserve everything in one file.

And they get real spreadsheet formatting when they choose XLSX.

The Technical Problems That Shaped the Project

Problem 1: "AI changed my spreadsheet" is not good enough

If a spreadsheet agent directly mutates a workbook, the user has to trust a black box.

The fix was the manifest contract. Every action contains:

the action type,
the exact cell or range,
the value/formula/color,
and the reason.

That makes it possible to log, review, diff, test, or reject model output before execution.

Problem 2: Local-first should not mean "local-only"

My preferred path is Ollama with gemma4:31b, but not every developer has a GPU available. I hit this myself while testing away from my GPU profile.

So Vitreus supports both:

# Local
uv run vitreus analyze examples/sample_workbook.csv "..." --backend ollama

# API key
uv run vitreus analyze examples/sample_workbook.csv "..." --backend google

The model interface stays the same. Only the backend changes.

Problem 3: Multi-sheet XLSX files are the real spreadsheet format

CSV was useful for early tests, but real workbooks have sheets, formulas, styles, and business structure.

The latest version added:

WorkbookSnapshot.from_xlsx(path, sheet_name="Expenses")
WorkbookSnapshot.from_xlsx(path, all_sheets=True)
WorkbookSnapshot.from_file(path, sheet_name="Sales")

The CLI now accepts .xlsx as input:

uv run vitreus analyze examples/test_workbook.xlsx \
  "In the Sales sheet, highlight reps below quota" \
  --sheet Sales \
  --output /tmp/sales_review.xlsx

Problem 4: Tests need to run without secret keys or local models

The project has 37 automated tests. They cover:

backend adapter construction,
CLI flags,
missing API key behavior,
manifest parsing,
CSV save behavior,
XLSX values,
XLSX formulas,
XLSX cell colors,
XLSX input loading,
and the multi-sheet test workbook shape.

The deterministic fallback planner is not a replacement for Gemma 4. It exists so the execution pipeline can be tested without depending on a network call or a local model.

Why This Is a Good Gemma 4 Use Case

Gemma 4 is doing real work here. It is not decorative.

The model is responsible for the part that is hard to encode as rules:

understanding workbook headers,
mapping natural-language requests to spreadsheet operations,
comparing values across columns,
deciding which rows need attention,
generating formulas,
and explaining each action.

The surrounding application does the parts software should do:

loading files,
constraining the action schema,
validating JSON,
applying known operations,
preserving output formats,
and keeping the workflow auditable.

That division is what makes the project useful. Gemma 4 supplies reasoning; Vitreus supplies guardrails.

Current Limitations

Vitreus is already useful for CSV/XLSX workflows, but there are areas I would keep improving:

Area	Current state	Next step
LibreOffice live control	Adapter planned from blueprint	Wire PyUNO to a running Calc socket
Multimodal receipts/charts	Payload prep implemented	Feed images into Gemma 4 multimodal backend
Multi-sheet reasoning	Sheet-specific input works	Add whole-workbook summarization
Formula safety	Formula strings are written	Add formula linting and policy controls
Review UI	Terminal-first	Add a small manifest review screen

The design intentionally keeps these as separable layers. The workbook reader, reasoning engine, and manifest executor can evolve independently.

What I Learned

The biggest lesson was that a good spreadsheet agent is less about "letting AI use Excel" and more about designing a trustworthy boundary between reasoning and execution.

Gemma 4 is capable enough to understand messy tabular context and produce useful spreadsheet plans. But the application still needs to say:

here is the allowed action vocabulary,
here is the exact JSON shape,
here is how output will be applied,
and here is what happens when a format cannot represent an action.

That is the difference between an impressive demo and a tool I would trust with a real workbook.

Acknowledgements

Built with:

Gemma 4 31B Dense as the primary reasoning model
Gemma 4 4B as the drafter/edge model path
Ollama for local model execution
Google AI Studio for API-key testing
Typer for the CLI
openpyxl for XLSX input/output
pytest for the test suite
LibreOffice Calc as the target spreadsheet environment

What I Learned Building a Local-First Spreadsheet Agent with Gemma 4

divyaprakash D — Sun, 24 May 2026 22:41:57 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I built a small project called Vitreus to answer a practical question:

What does a useful local-first AI agent look like when the data is not a chat history, but a spreadsheet?

Spreadsheets are deceptively hard. They look simple because they are rows and columns, but real spreadsheet work mixes:

business rules,
messy headers,
formulas,
visual formatting,
multiple sheets,
hidden assumptions,
and private data.

That made the project a good test case for Gemma 4. I did not want to use the model as a novelty wrapper. I wanted Gemma 4 to do the part that is genuinely hard: read tabular context, understand the user's intent, and produce a safe action plan.

This post is the technical write-up behind that build: how I chose the Gemma 4 model, how I wired local and API backends, and the pattern I recommend if you want to let an LLM work with sensitive documents.

The Core Idea: Reasoning Is Not Execution

The most important design decision came before any code:

Gemma 4 should reason about the spreadsheet, but it should not directly mutate the spreadsheet.

Instead of giving the model a tool like "edit cell" and letting it operate freely, Vitreus asks Gemma 4 to return a JSON manifest:

{
  "model": {
    "primary": "gemma4:31b",
    "drafter": "gemma4:4b",
    "rationale": "Why this model and these actions were selected."
  },
  "actions": [
    {
      "type": "highlight",
      "range": "Sheet1!A3:K3",
      "color": "#f97316",
      "reason": "Spent exceeds Budget."
    },
    {
      "type": "write_value",
      "cell": "Sheet1!K3",
      "value": "OVER BUDGET",
      "reason": "The row needs review."
    }
  ]
}

The application then applies only the actions it understands:

highlight
write_value
formula

This separation makes the system easier to trust. The model can still be creative and semantic, but the executor stays deterministic and testable.

Why Gemma 4 31B Dense Was the Right Default

The Gemma 4 family spans different hardware and throughput needs:

Gemma 4 variant	Best fit	How I think about it
2B / 4B small models	Edge, browser, mobile, fast drafts	Great for low-latency helpers and constrained devices
31B Dense	Local/server reasoning	Best fit when the task needs stronger reasoning over larger context
26B MoE	High-throughput reasoning	Attractive when many requests need efficient expert routing

For Vitreus, I picked Gemma 4 31B Dense as the primary model.

Why? Because spreadsheet work is context-heavy.

A spreadsheet request like this:

Highlight all rows where Spent exceeds Budget, and for each over-budget row
write OVER BUDGET in the Notes column.

requires the model to:

Find the relevant columns.
Compare values row by row.
Preserve the row identity.
Produce valid spreadsheet ranges.
Write a clear reason for each action.
Return strict JSON without extra prose.

That is not just classification. It is structured reasoning over tabular data.

The 4B model still has a useful role. I treat it as a drafter: good for previews, edge use, or lightweight suggestions. But for the final workbook plan, the 31B Dense model is the safer default.

The model policy in Vitreus is explicit:

@dataclass(frozen=True)
class GemmaModelChoice:
    primary: str
    drafter: str
    rationale: str

    @classmethod
    def default(cls) -> "GemmaModelChoice":
        return cls(
            primary="gemma4:31b",
            drafter="gemma4:4b",
            rationale=(
                "Gemma 4 31B Dense is the default because Vitreus needs local, "
                "long-context workbook reasoning and stronger multimodal planning; "
                "Gemma 4 4B remains useful as a low-latency drafter on edge hardware."
            ),
        )

That rationale is not just documentation. It is returned in the manifest so the user can see which model policy was used.

Running Gemma 4 Locally with Ollama

The local path is the preferred Vitreus path because spreadsheets often contain sensitive data: salaries, invoices, forecasts, customer exports, sales reports, and internal review notes.

Install dependencies:

uv sync --extra integrations

Pull the local model:

ollama pull gemma4:31b

Run Vitreus with local Gemma 4:

uv run vitreus analyze examples/sample_workbook.csv \
  "Summarise department-level spending risks" \
  --backend ollama

Use the smaller drafter model when you want a lighter local run:

uv run vitreus analyze examples/sample_workbook.csv \
  "Highlight rows that need review" \
  --backend ollama \
  --model gemma4:4b

The backend adapter is deliberately tiny:

class OllamaBackend:
    def __init__(self, model: str = "gemma4:31b"):
        self.model = model

    def call(self, prompt: str) -> str:
        from ollama import chat
        response = chat(model=self.model, messages=[{"role": "user", "content": prompt}])
        return response.message.content

This keeps the dependency boundary clean. The rest of the project does not need to know whether the model is local or remote.

Running Gemma 4 with an API Key

Local inference is ideal for privacy, but it is not always practical. I was not always running in my GPU profile while testing, so I also added a Google AI Studio backend.

Set the key:

export GEMINI_API_KEY="your_key_here"

Run with the cloud backend:

uv run vitreus analyze examples/sample_workbook.csv \
  "Highlight the top 3 performers by Score in green" \
  --backend google

Or pass the key inline:

uv run vitreus analyze examples/sample_workbook.csv \
  "Flag rows where Spent exceeds Budget" \
  --backend google \
  --api-key "$GEMINI_API_KEY"

The adapter:

class GoogleAIBackend:
    def __init__(self, api_key: str, model: str = "gemma-4-31b-it"):
        self.api_key = api_key
        self.model = model

    def call(self, prompt: str) -> str:
        from google import genai
        client = genai.Client(api_key=self.api_key)
        response = client.models.generate_content(model=self.model, contents=prompt)
        return response.text

This gives users two deployment modes:

Mode	Command	When to use
Local	`--backend ollama`	Sensitive data, local GPU/CPU available
API key	`--backend google`	No local model available, quick testing

The prompt and manifest contract stay the same in both modes.

The Prompt Contract

The prompt is intentionally strict. It tells Gemma 4 that it is a spreadsheet intelligence agent and that the output must be JSON only:

You are Vitreus, a spreadsheet intelligence agent running gemma4:31b.
Analyze the spreadsheet data below and respond with ONLY a valid JSON manifest
(no markdown, no explanation).

Task: Highlight rows where Spent exceeds Budget.

Sheet data:
[
  {
    "Name": "Alan Turing",
    "Budget": 110000,
    "Spent": 135000,
    "Notes": ""
  }
]

Required JSON response shape:
{
  "model": {"primary": "gemma4:31b", "drafter": "gemma4:4b", "rationale": "..."},
  "actions": [
    {
      "type": "highlight|write_value|formula",
      "range": "Sheet1!A1:B2",
      "cell": "Sheet1!C2",
      "value": "...",
      "formula": "=SUM(A1:A10)",
      "color": "#f97316",
      "reason": "why this action is needed"
    }
  ]
}

The response is parsed as JSON:

def parse_manifest(content: str) -> dict[str, Any]:
    fenced = re.search(r"```

(?:json)?\s*(\{.*?\})\s*

```", content, flags=re.DOTALL)
    candidate = fenced.group(1) if fenced else content
    return json.loads(candidate)

In a production system I would harden this further with schema validation, but even this first version shows the important pattern: the LLM output is data, not authority.

CSV Taught Me a Product Lesson

The first version worked with CSV files. That was useful for tests, but then a practical issue surfaced:

CSV cannot store colors.

A manifest can say:

{"type": "highlight", "range": "Sheet1!A3:K3", "color": "#f97316"}

But if the output is CSV, there is nowhere to put that color.

So Vitreus handles the limitation explicitly:

CSV output:
  result.csv
  result_highlights.json

XLSX output:
  result.xlsx with values, formulas, and cell fills

That led to the one-shot output flow:

uv run vitreus analyze examples/sample_workbook.csv \
  "Highlight rows where Spent exceeds Budget in orange, write OVER BUDGET in Notes" \
  --backend google \
  --output /tmp/vitreus_result.xlsx

No separate "generate manifest" and "apply manifest" commands required.

The command still uses the manifest internally; it just applies it immediately and writes the output workbook.

XLSX Made It Feel Real

CSV is a good interchange format. XLSX is what spreadsheet users actually expect.

The current Vitreus snapshot layer can now load Excel workbooks:

WorkbookSnapshot.from_xlsx("examples/test_workbook.xlsx", sheet_name="Expenses")
WorkbookSnapshot.from_xlsx("examples/test_workbook.xlsx", all_sheets=True)
WorkbookSnapshot.from_file("examples/test_workbook.xlsx", sheet_name="Sales")

And the CLI can round-trip XLSX:

uv run vitreus analyze examples/test_workbook.xlsx \
  "In the Expenses sheet, highlight rows where Annual_Actual exceeds Annual_Budget" \
  --backend google \
  --sheet Expenses \
  --output /tmp/vitreus_expenses_result.xlsx

The generated test workbook has three sheets:

examples/test_workbook.xlsx
├── Sales
├── Expenses
└── HR_Reviews

This mattered because it moved the project from "LLM over CSV" toward a more realistic spreadsheet workflow.

What Gemma 4 Is Good At in This Pattern

From this project, I found Gemma 4 most useful for:

1. Header-aware reasoning

The user does not have to say "compare column I and column J." They can say:

Find rows where Spent exceeds Budget.

Gemma 4 maps that request to the right fields.

2. Spreadsheet reference generation

The output needs actual ranges:

Sheet1!A3:K3

This is small but important. If the model cannot translate semantic findings back to spreadsheet coordinates, it cannot drive a workbook.

3. Explanations attached to actions

Every non-obvious action can include a reason:

{
  "reason": "Spent (135000) exceeds Budget (110000)"
}

That makes the manifest useful for audit logs and review UIs.

4. Flexible user intent

The same pipeline can handle:

Highlight the top 3 performers by Score in green.

or:

Add a SUM formula below the Spent column.

or:

In the Expenses sheet, highlight rows where Annual_Actual exceeds Annual_Budget.

The executor is narrow, but the language interface is flexible.

Where I Would Use Each Gemma 4 Variant

If you are deciding which Gemma 4 model to use, here is the mental model I ended up with:

Use 2B / 4B when latency and hardware matter most

Good for:

browser or mobile helpers,
preview suggestions,
small table summaries,
command suggestions,
offline edge workflows.

In Vitreus, the 4B model is the drafter path.

Use 31B Dense when reasoning quality matters most

Good for:

long-context document analysis,
spreadsheet reasoning,
multi-step planning,
structured JSON generation,
local-first professional tools.

This is the default Vitreus model.

Use 26B MoE when throughput matters

Good for:

many independent requests,
server workloads,
high-volume classification/routing,
batch analysis where expert routing can improve efficiency.

I did not make MoE the default because Vitreus is currently optimized around careful workbook reasoning, not high-throughput request serving. But it is an obvious future backend option.

Design Checklist for LLMs Working on Private Files

If you are building something similar with Gemma 4, I recommend this checklist:

Do not let the model directly mutate private files.
Make the model return structured data.
Keep action types small and explicit.
Require reasons for destructive or non-obvious actions.
Make local inference the default path when privacy matters.
Offer an API-key path for users without local hardware.
Test the executor without requiring the model.
Tell users when the output format cannot preserve an action.
Prefer reviewable manifests over invisible automation.

The model should be powerful, but the boundary around it should be boring.

That is a compliment. Boring boundaries are what make AI tools safe enough to use.

Try the Pattern Yourself

Clone and run:

git clone https://github.com/divyaprakash0426/vitreus.git
cd vitreus
uv sync --extra dev
uv run pytest -q
uv run vitreus models

Run without any model:

uv run vitreus analyze examples/sample_workbook.csv \
  "Highlight rows that need review"

Run with Google AI Studio:

uv sync --extra integrations
export GEMINI_API_KEY="your_key_here"

uv run vitreus analyze examples/sample_workbook.csv \
  "Highlight all rows where Spent exceeds Budget and write OVER BUDGET in Notes" \
  --backend google

Run with local Ollama:

uv sync --extra integrations
ollama pull gemma4:31b

uv run vitreus analyze examples/sample_workbook.csv \
  "Summarise budget risks by department" \
  --backend ollama

Write a real XLSX file:

uv run vitreus analyze examples/test_workbook.xlsx \
  "In the Sales sheet, highlight reps whose Quota_Attainment is below 80 percent" \
  --backend google \
  --sheet Sales \
  --output /tmp/vitreus_sales_review.xlsx

Final Thought

The most interesting thing about Gemma 4 for me is not just that it can run locally or reason over longer context. It is that those capabilities change the shape of applications developers can build.

With a strong local model, a spreadsheet assistant does not have to be a SaaS upload box. It can be:

local-first,
auditable,
scriptable,
privacy-aware,
and still useful.

That is the direction I want AI tooling to move: powerful models at the center, but surrounded by software engineering boundaries that users can understand.

Vitreus is my first pass at that pattern for spreadsheets.

ClawForge — A Trustworthy, Cross-Variant Skill Hub for the OpenClaw Ecosystem

divyaprakash D — Sun, 26 Apr 2026 19:42:38 +0000

This is a submission for the OpenClaw Challenge.

What I Built

ClawForge is a curated-from-scratch skill hub for the OpenClaw ecosystem — 30 hand-engineered skills, 6 variant installers, 3 persona "souls", a security trio, and a local cross-architecture evidence harness, all behind a single SKILL.md standard.

It exists because the obvious place to get OpenClaw skills today — ClawHub, with its ~13,729 community uploads — has a trust problem. Roughly 1-in-5 of those skills ship patterns I wouldn't run on a borrowed laptop, and the ClawHavoc campaign already published ~300 outright trojan skills before takedown. "Just claw install whatever" is the new curl | sudo bash.

ClawForge is the boring, opinionated alternative:

30 skills, not 30,000. Every skill is engineered from scratch against one schema — no re-uploads, no forks of forks, no anonymous maintainers.
6 variants, one installer. OpenClaw, ZeroClaw, PicoClaw, NullClaw, NanoBot, IronClaw — each with different runtimes (Node, Rust, Go, Zig, Python, WASM) and different blast radii. One install.sh figures out what your target actually supports.
Static security tiers, not vibes. Every skill declares an L0/L1/L2/L3 tier in frontmatter — read-only, scoped-write, network, or privileged. Scanners reject anything that lies about its tier.
A local QEMU evidence ladder. I don't claim "works everywhere." I claim "x86_64 = emulated guest pass, arm64/armv7/riscv64 = usermode smoke pass, i386 = metadata only," with JSON+Markdown receipts.

Repo: github.com/divyaprakash0426/clawForge
Release: v1.0.0

If ClawHub is npm circa 2018 — open, sprawling, and quietly malicious — ClawForge is trying to be the Debian stable of skills: smaller, slower, signed for what it claims to do.

How I Used OpenClaw

OpenClaw's whole pitch is "AgentSkills are portable." The interface really is portable — every variant speaks the same skill protocol. But execution isn't. A skill that works on OpenClaw (1.5GB Node toolchain) will silently die on NullClaw (1MB Zig binary on a RISC-V board). The blueprint for ClawForge fell out of trying to make that real instead of marketing.

Here's how OpenClaw's design actually shaped the build, and where I had to push back on it.

The `SKILL.md` standard — front-loading every honest constraint

Every skill in ClawForge is a directory with a single SKILL.md. The frontmatter is the contract:

---
name: skill-sentinel
version: 1.2.0
tier: L1                       # read-only audit, no network
variants:
  openclaw: full
  zeroclaw: full
  picoclaw: full
  nullclaw: partial             # no JSON schema validator on Zig build
  nanobot: full
  ironclaw: unsupported         # WASM sandbox blocks subprocess scan
entrypoint: ./run.sh
permissions:
  fs: [read]
  net: []
  exec: [grep, awk, jq]
---

Two design decisions worth defending:

Per-variant compatibility is full | partial | unsupported, not a boolean. Half the pain in the ecosystem comes from skills that pretend to be universal. partial forces the author to write a COMPAT.md explaining what's degraded and why.
Tiers are static, not dynamic. I prototyped a "risk score" early on. It was useless — every author rationalizes their own skill into the green zone. L0–L3 are coarse, declarative, and trivially auditable: if you ask for net permissions in an L1 skill, the scanner just rejects you.

The variant-aware installer

The installer is plain POSIX shell, but it does the one thing OpenClaw itself can't: it refuses to install skills that won't run.

$ ./install.sh --variant nullclaw
[forge] target: nullclaw (zig, riscv64, ~1MB)
[forge] resolving 30 skills against compatibility matrix…
[forge]   ✓ 24 full
[forge]   ⚠ 4 partial   (degrade to L1, see COMPAT.md)
[forge]   ✗ 2 unsupported  skipped: bedrock-rag, kube-scout
[forge] staging to ~/.nullclaw/skills/  (atomic)
[forge] running skill-sentinel on staged tree…
[forge] OK — 28 skills installed, 2 skipped, 0 quarantined

No "install everything and hope." If a skill's frontmatter lies about its variant support, the matching scanner test fails in CI before the skill is ever published.

The security trio

Three skills act as the distribution's immune system, and they all run on every PR:

Skill	Job	Tier
`skill-sentinel`	Static analysis of every `SKILL.md` + entrypoint — tier/permission lies, shell injection patterns, suspicious network calls	L1
`prompt-fence`	Scans souls/personas for prompt-injection payloads and hidden instructions	L1
`secret-guard`	Pre-commit + CI scanner for committed secrets, mirrored config across local + CI	L1

Sentinel output is intentionally loud:

[skill-sentinel] scanning ./skills/arxiv-scout
  ✗ FAILED  declares tier=L1 but entrypoint contains: curl https://...
  → tier L1 forbids network egress. Either:
      (a) bump declared tier to L2 and document it in COMPAT.md, or
      (b) move the fetch into a separate L2 helper skill.
exit 1

This is the part OpenClaw doesn't give you out of the box — and the part I'd argue every claw distribution needs before it goes near a production box.

The scaffolder — keeping the schema honest

The whole standard collapses if writing a compliant skill is harder than writing a non-compliant one. So:

$ ./tools/forge-skill.sh my-new-skill --tier L1 --variants full,full,partial,unsupported,full,full
[forge-skill] generated skills/my-new-skill/
[forge-skill]   SKILL.md       (frontmatter + checklist)
[forge-skill]   run.sh         (entrypoint stub with safe defaults)
[forge-skill]   COMPAT.md      (one section per non-full variant)
[forge-skill]   tests/smoke.sh (skill-sentinel + prompt-fence dry run)
[forge-skill] new skill ready in 27s. Run: ./tools/validate_skills.py my-new-skill

A new compliant skill, end-to-end, in under 30 seconds. The validators (tools/validate_skills.py, tools/validate_souls.py) are stricter than the scaffolder — so the only path that passes CI is the one that passes the scaffolder and the validators.

The cross-architecture evidence harness

This is the part I'm proudest of, and the part that's deliberately not in the public repo.

OpenClaw runs on a deeply heterogeneous fleet — laptops, Pi-class boards, RISC-V dev kits, sandboxed WASM. "It works on my Arch x86_64 box" is a useless claim for a skill hub. So I built a local-only harness under .local/clawforge-vm/ (gitignored) that produces honest receipts:

.local/clawforge-vm/
├── run-host-lane.sh             # 21 deterministic checks on the host
├── bootstrap-e2e-vm.sh          # disposable Ubuntu Jammy KVM guest, reruns the host lane inside
├── run-usermode-arch-lane.sh    # qemu-*-static smoke for arm64, armv7, riscv64
└── report-arch-evidence.sh      # emits JSON + Markdown summary

The host lane runs all 6 variant installers, every validator, every scanner, the soul installer, scaffolder round-trip, and mock flows for kube-scout, permission-lens, docker-hygiene, arxiv-scout, bedrock-rag, and tf-copilot — 21 checks, deterministic, on every host change.

The arch lane does what it can do honestly:

[arch-evidence] summary
  x86_64   : emulated-guest  ✓  21/21 checks (Ubuntu Jammy KVM)
  arm64    : usermode-smoke  ✓  installers + sentinel  (Ubuntu Jammy rootfs)
  armv7    : usermode-smoke  ✓  installers + sentinel  (Alpine 3.20 rootfs)
  riscv64  : usermode-smoke  ✓  installers + sentinel  (Ubuntu Jammy rootfs)
  i386     : metadata-only   —  no rootfs staged, frontmatter-only

That's it. No "fully supported" claims I can't back up. The same script writes a JSON receipt I can check into the next release notes.

Demo

End-to-end proof — running inside a disposable QEMU VM

The GIF below is an unedited asciinema recording of the full ClawForge pipeline running inside a fresh Ubuntu Jammy KVM guest — not on my host machine. 21 deterministic checks. Zero edits.

In the meantime, the project is fully runnable today:

Repo: github.com/divyaprakash0426/clawForge
Release: v1.0.0
Topics: openclaw · claw-ecosystem · ai-tools · skill-hub · security · shell · open-source

Try it in 60 seconds

# Clone
git clone https://github.com/divyaprakash0426/clawForge.git
cd clawForge

# Install the full skill set against your detected variant
./install.sh --variant openclaw     # or zeroclaw, picoclaw, nullclaw, nanobot, ironclaw

# Or pull a single skill into an existing claw install
./install-skill.sh skill-sentinel --variant zeroclaw

# Scaffold your own compliant skill in under 30 seconds
./tools/forge-skill.sh my-skill --tier L1

# Validate the entire repo (what CI runs on every PR)
python3 tools/validate_skills.py
python3 tools/validate_souls.py

What I Learned

1. AgentSkills are portable. Skill execution isn't.

The single biggest mismatch between OpenClaw's marketing and reality: every variant agrees on the skill interface, but the skills themselves silently assume Node, or bash, or curl, or 200MB of RAM. The variants: full | partial | unsupported field only exists because I tried to claim "universal" once and got humiliated by NullClaw.

If you're building anything for a multi-variant ecosystem: front-load the compatibility table in machine-readable frontmatter, then make your installer refuse to lie.

2. Static tiers beat dynamic risk scores every time.

I genuinely tried the dynamic-scoring approach first — heuristics over the entrypoint, weighted permission analysis, "danger floats." It was a beautiful dashboard producing useless numbers. Authors gamed it within an afternoon.

Four flat tiers (L0 read-only metadata, L1 read-only fs, L2 scoped write + network, L3 privileged) are coarse enough that arguing about them is pointless and fine-grained enough that the scanner can mechanically reject lies. Boring, declarative, correct.

3. The harness finds the bugs that demos hide.

Three real bugs the local lane caught that I would 100% have shipped otherwise:

bedrock-rag mock path bug — relative-path assumption that worked from the repo root, broke instantly when the installer staged it under ~/.openclaw/skills/.
arxiv-scout non-determinism — model-eval cache used a hash of the wall-clock minute as a salt. Passed once locally, failed every CI re-run inside the QEMU guest.
Alpine busybox absolute-symlink trap on armv7 — busybox's ln -s resolves absolute symlinks against the host root, not the rootfs. The arm64 Ubuntu lane was happy. The armv7 Alpine lane was a pile of dangling links until I switched to relative symlinks everywhere.

None of these would have been caught by "looks fine on my machine." Build the harness early. Even a usermode-smoke lane is worth ten "trust me" tweets.

4. "30 engineered" beats "300 curated" for a security story.

The first plan was bigger — pull in the most-starred ClawHub skills, audit them, re-ship. I killed that within a week. Auditing someone else's skill is roughly the same effort as writing one from scratch, and you inherit their bus factor and their git history. 30 skills I wrote, against one schema, with one threat model, is a story I can defend. 300 curated ones is a story I'd have to keep apologizing for.

5. The scaffolder is the standard.

A schema is just a wishlist until the easiest path to a new skill is also the only compliant path. forge-skill.sh exists because every skill I added to the repo by hand had at least one frontmatter typo. Now the scaffolder produces compliant skills, and the validators reject anything that drifts. The standard enforces itself.

ClawCon Michigan

I didn't make it to ClawCon Michigan in person this round — wrong continent, wrong calendar week — but the energy coming out of it (the talks, the trip reports, the ClawHavoc post-mortems, the variant maintainers comparing notes in public) is most of what made me stop hand-wringing and actually ship ClawForge. If you were there: the unsigned-skill-distribution debate from day two is the reason skill-sentinel is opinionated instead of advisory. Thank you. I owe you all a drink at the next one.

Built with stubbornness, a lot of `qemu-system-`, and a deep suspicion of any skill hub that has more skills than it has reviewers.*

Repo: github.com/divyaprakash0426/clawForge · Release: v1.0.0

Comments, PRs, and especially "your tier system is wrong because…" arguments are very welcome.

Why the Claw ecosystem needs a skill commons — and how I built one

divyaprakash D — Sun, 26 Apr 2026 12:08:55 +0000

This is a submission for the OpenClaw Writing Challenge

This essay accompanies the ClawForge repository.

The supply chain problem nobody wants to talk about

ClawHub crossed 13,729 community-published skills in April 2026. That number sounds impressive until you learn that security researchers estimate roughly 20% of those skills contain patterns consistent with prompt injection, credential exfiltration, or unsafe shell execution. The ClawHavoc campaign made this concrete: a coordinated actor published ~300 skills across multiple Claw variants, each embedding a silent exfiltration hook behind an otherwise legitimate-looking system automation. Skills were installed by thousands of developers before the campaign was identified.

The root problem is structural. ClawHub operates like npm circa 2014: publish-and-forget, no manifest enforcement, no per-variant compatibility declaration, no permission tier labeling. When you install a skill, you are trusting a markdown file you've never read, written by someone you've never met, to run arbitrary shell commands with your agent's credentials.

ClawForge is my answer to that problem. It is not another skill list. It is an engineered commons: every skill in the catalog was built from scratch with an explicit trust model baked into the schema, the tooling, and the CI pipeline.

The variant problem is bigger than it looks

Ask most OpenClaw developers which variants they target and you'll get a blank stare. But the variant landscape is genuinely fragmented:

Variant	Runtime	Idle RAM	Key constraint
OpenClaw	TypeScript / Node.js	~1.5 GB	Full feature set, desktop only
ZeroClaw	Rust	~7.8 MB	Production VPS, low resource
PicoClaw	Go	< 10 MB	Raspberry Pi, $10 boards
NullClaw	Zig	~1 MB	RISC-V, ultra-embedded
NanoBot	Python	~100 MB	AI research, ML workflows
IronClaw	TypeScript + WASM	Standard	High-security enterprise sandbox

A skill that calls curl | bash works fine on OpenClaw. It will silently fail or be blocked entirely on NullClaw. A skill that requires aws-cli in PATH is useless on a PicoClaw board. A skill that writes to ~/.config may violate the IronClaw sandbox policy.

No existing skill collection tracks this. ClawForge's SKILL.md frontmatter requires every skill to declare a per-variant compatibility value — full, partial, or unsupported — and document the reason for any non-full status in a COMPAT.md file. This makes the compatibility matrix machine-readable: the installer filters skills automatically for the detected variant.

Show, Don't Just Tell

Here's an example of what that engineered frontmatter actually looks like in practice. Notice the explicit compatibility matrix and the security tier—this is what the system parses before anything runs:

---
name: arch-sentry
description: Arch Linux system health monitor.
version: 1.0.0
metadata:
  openclaw:
  compat:
    openclaw: full
    zeroclaw: full
    picoclaw: partial     # no scheduled heartbeat support
    nullclaw: unsupported # requires sudo bash
  security_tier: L1       # L1 (read-only audit)
---

Because of this machine-readable schema, the Developer Experience (DX) becomes dead simple and inherently secure:

# 1. Provide a one-command install that detects your variant
./install.sh --variant zeroclaw

# 2. Block bad skills automatically in CI or locally
./skill-sentinel scan ./skills/sketchy-aws-tool
🚨 [FAILED] Risk Score: 9/10
- Credential exfiltration pattern detected: 'curl -d @$AWS_ACCESS_KEY_ID'
- Install blocked.

What ClawHavoc actually taught us

The ClawHavoc campaign followed a pattern security researchers call a "trojan commons" attack. The attacker published genuinely useful skills — a Docker cleanup tool, an AWS cost monitor, a git log formatter — each with a payload buried in a post-install hook or an obfuscated eval block inside a run.sh. Because the skills were useful, they accumulated real installs. Because ClawHub had no automated scanning, the payload ran undetected for weeks on developer machines with active AWS, GitHub, and Telegram credentials in the environment.

Three lessons:

Usefulness and safety are orthogonal. A trojan skill can be more useful than a clean one. Utility alone is not a trust signal.
Schema enforcement is a prerequisite for trust. If the format is loose enough to hide a payload, it will be used to hide a payload.
Defense has to be part of the install path. A scanner that runs after installation catches nothing.

ClawForge responds to all three. Every skill in the catalog has been authored with explicit permission tiers. skill-sentinel runs as a GitHub Actions workflow on every pull request and is available as a local pre-install scanner. prompt-fence extends this to instruction and soul files. secret-guard covers committed secrets in skill configs. The security primitives are not opt-in — they're part of the repository CI.

The China signal: what the underground market proved

In early 2026, reports emerged of Chinese developers selling "AI skill packs" for OpenClaw and QClaw on secondary markets, with bundles trading for tens of thousands of RMB. Tencent responded by launching QClaw with 5,000 prebuilt skills and a three-minute deployment story. The pattern was striking: there was clearly enormous appetite for batteries-included, domain-specific skill bundles that a non-technical user could drop in and use immediately.

No English-language equivalent existed. The closest thing was Andrej Karpathy's personal skill repo — credible and well-reasoned, but intentionally narrow. everything-claude-code showed that a complete, production-tested system beats a list every time (46K+ stars, Anthropic hackathon winner). But neither project addressed the variant landscape or the supply chain problem.

ClawForge is the legitimate, open-source, English-language answer to the demand signal the underground market proved. Thirty skills across seven domains, installable in one command, with explicit compatibility metadata and a security scanner in the CI pipeline.

The AgentSkills portability problem

The emerging AgentSkills spec promises cross-agent skill portability: a skill authored once should be usable by OpenClaw, NullClaw, and a hypothetical future variant that doesn't exist yet. The promise is real. The current implementation has gaps.

The biggest gap is capability negotiation. The spec defines a skill's interface but not its execution requirements. A skill that requires a 1.5 GB Node.js runtime to evaluate can declare itself AgentSkills-compatible while being completely unusable on a NullClaw device. Portability at the interface level without portability at the execution level creates a false sense of compatibility.

ClawForge's COMPAT.md pattern is a practical interim solution. It doesn't replace the spec — it augments it with the ground truth that the spec doesn't yet capture: which variants actually run this skill, what breaks on partial variants, and what the operator needs to know before installing on a constrained device.

Design decisions and trade-offs

Why 30 skills instead of 300? Engineering quality is the product, not catalog size. ClawHub already has 13,729 entries. Adding to the noise isn't useful. The 30 flagship skills were designed and built from scratch — each one with a concrete runbook, domain-specific helper scripts, a per-variant compatibility declaration, and a security tier. The deliberate scope means every skill can be fully owned and maintained rather than scaffolded and abandoned.

Why a static security tier instead of a dynamic risk score? Dynamic scoring invites gaming. A skill author can rewrite a risky shell call to avoid pattern detection. A static permission tier declared by the author and reviewed by a maintainer is harder to fake — it creates a paper trail and puts accountability on the contributor.

Why souls/ in the same repo? Persona packs are not separate from skills — they shape how skills behave. A security operator running skill-sentinel should get terse, actionable output. A research analyst running arxiv-scout should get structured summaries with citation context. Shipping persona packs alongside the skill catalog makes the behavioral contract explicit.

Why not build on top of ClawHub's existing infrastructure? Because the problem is the infrastructure. ClawHub's permissive publish model is the attack surface. ClawForge is designed to be a curated alternative that coexists with ClawHub, not a replacement for it. The install tooling is intentionally local-first and format-verified.

What comes next

ClawForge is a foundation, not a finished product. The work that matters next:

Per-variant install test matrix: automated CI that actually runs a representative skill on each variant in a sandboxed container and reports pass/fail against the declared compatibility.
Soul-to-skill binding: a mechanism to declare which skills a persona pack activates by default, reducing per-session configuration friction.
Community skill review queue: a structured contribution path with security checklist, reviewer assignment, and automated skill-sentinel gate — analogous to a software package maintainer model.
Locale packs: the demand signal from the China market is real. Non-English skill bodies and locale-adapted persona packs would unlock a substantial user base.

The commons model only works if it stays engineered. That means saying no to volume for the sake of volume, enforcing the schema, and treating every new skill as a potential attack surface until proven otherwise.

That's the bet ClawForge makes.

Over to you

I've open-sourced the entire 30-skill catalog and the scaffolding CLI. If you're building for OpenClaw (or ZeroClaw, or NanoBot), try pulling down the repo. Run ./tools/forge-skill.sh to generate your first secure skill in 30 seconds.

I’d love to know: what’s the biggest variant-compatibility headache you’ve run into? Drop a comment below, and let’s fix the skill commons together.

ClawForge is open source. Contributions follow the schema in CONTRIBUTING.md and the skill format in docs/skill-format.md.

ClawCon Michigan

While I couldn't attend ClawCon Michigan in person, following the community's energy and the announcements remotely is what inspired me to dive deep into the cross-variant compatibility problem to build ClawForge.

Why the Claw ecosystem needs a skill commons — and how I built one

divyaprakash D — Sun, 26 Apr 2026 12:08:55 +0000

This is a submission for the OpenClaw Writing Challenge

This essay accompanies the ClawForge repository.

The supply chain problem nobody wants to talk about

The variant problem is bigger than it looks

Ask most OpenClaw developers which variants they target and you'll get a blank stare. But the variant landscape is genuinely fragmented:

Variant	Runtime	Idle RAM	Key constraint
OpenClaw	TypeScript / Node.js	~1.5 GB	Full feature set, desktop only
ZeroClaw	Rust	~7.8 MB	Production VPS, low resource
PicoClaw	Go	< 10 MB	Raspberry Pi, $10 boards
NullClaw	Zig	~1 MB	RISC-V, ultra-embedded
NanoBot	Python	~100 MB	AI research, ML workflows
IronClaw	TypeScript + WASM	Standard	High-security enterprise sandbox

Show, Don't Just Tell

---
name: arch-sentry
description: Arch Linux system health monitor.
version: 1.0.0
metadata:
  openclaw:
  compat:
    openclaw: full
    zeroclaw: full
    picoclaw: partial     # no scheduled heartbeat support
    nullclaw: unsupported # requires sudo bash
  security_tier: L1       # L1 (read-only audit)
---

Because of this machine-readable schema, the Developer Experience (DX) becomes dead simple and inherently secure:

# 1. Provide a one-command install that detects your variant
./install.sh --variant zeroclaw

# 2. Block bad skills automatically in CI or locally
./skill-sentinel scan ./skills/sketchy-aws-tool
🚨 [FAILED] Risk Score: 9/10
- Credential exfiltration pattern detected: 'curl -d @$AWS_ACCESS_KEY_ID'
- Install blocked.

What ClawHavoc actually taught us

Three lessons:

Usefulness and safety are orthogonal. A trojan skill can be more useful than a clean one. Utility alone is not a trust signal.
Schema enforcement is a prerequisite for trust. If the format is loose enough to hide a payload, it will be used to hide a payload.
Defense has to be part of the install path. A scanner that runs after installation catches nothing.

The China signal: what the underground market proved

The AgentSkills portability problem

Design decisions and trade-offs

What comes next

ClawForge is a foundation, not a finished product. The work that matters next:

Per-variant install test matrix: automated CI that actually runs a representative skill on each variant in a sandboxed container and reports pass/fail against the declared compatibility.
Soul-to-skill binding: a mechanism to declare which skills a persona pack activates by default, reducing per-session configuration friction.
Community skill review queue: a structured contribution path with security checklist, reviewer assignment, and automated skill-sentinel gate — analogous to a software package maintainer model.
Locale packs: the demand signal from the China market is real. Non-English skill bodies and locale-adapted persona packs would unlock a substantial user base.

That's the bet ClawForge makes.

Over to you

I’d love to know: what’s the biggest variant-compatibility headache you’ve run into? Drop a comment below, and let’s fix the skill commons together.

ClawForge is open source. Contributions follow the schema in CONTRIBUTING.md and the skill format in docs/skill-format.md.

ClawCon Michigan

[Boost]

divyaprakash D — Sat, 14 Feb 2026 05:18:06 +0000

Stop Editing. Start Playing. Meet AutoShorts: The AI Gaming Editor 🎮

divyaprakash D ・ Feb 13

#devchallenge #githubchallenge #cli #githubcopilot

Stop Editing. Start Playing. Meet AutoShorts: The AI Gaming Editor 🎮

divyaprakash D — Fri, 13 Feb 2026 09:25:52 +0000

This is a submission for the GitHub Copilot CLI Challenge

What I Built

AutoShorts is an AI-powered pipeline that automatically transforms long-form gameplay footage into viral-ready vertical clips. It uses Vision AI to semantically understand content—distinguishing between "action," "clutch plays," and "WTF moments"—then adds AI-generated captions and AI voiceovers with matching energy and personality.

The result? Hours of gameplay → polished TikTok/Shorts/Reels-ready clips, with minimal human intervention.

Demo

View Project on GitHub: Link

Demo Video:

🎥 Showcase: Multi-Language & Style Generation

AutoShorts automatically adapts its editing style, captions, and voiceover personality based on the content and target language. Here are some examples generated entirely by the pipeline:

Content	Style	Language	Video
Fortnite	Story Roast	🇺🇸 English	Watch Part 1
Indiana Jones	GenZ Slang	🇺🇸 English	Watch Part 1
Battlefield 6	Dramatic Story	🇯🇵 Japanese	Watch Part 1
Indiana Jones	Story News	🇨🇳 Chinese	Watch Part 1
Fortnite	Story Roast	🇪🇸 Spanish	Watch Part 1
Fortnite	Story Roast	🇷🇺 Russian	Watch Part 1
Indiana Jones	Auto Gameplay	🇧🇷 Portuguese	Watch Part 1

📸 Dashboard Interface

1. Generate Page
The command center for creating new content. Simply drop a video or select an existing one, choose your analysis mode (Local vs. Cloud), and hit "Find Clips."

2. Settings & Cost Control
Full control over which AI models are used and strictly managed API costs. You can toggle between OpenAI, Gemini, or efficient Local Heuristics.

Why I Built This

I had a problem that every content creator knows: hours of gameplay footage, but no time to edit.

Recording gameplay is the easy part. The hard part is scrubbing through 2-hour VODs looking for that one clutch moment, that hilarious fail, or that "wait, what just happened?" clip. Then you need to:

Find the moment
Crop to vertical (9:16)
Add captions that match the vibe
Maybe add commentary or voiceover
Export and repeat... dozens of times

I was spending 3-4 hours editing for every hour of footage. That's backwards.

I wanted a system where I could:

Drop a raw gameplay file
Walk away
Come back to ready-to-upload clips with captions and voiceovers

AutoShorts is that system.

How I Built It (Technical Deep-Dive)

Building AutoShorts was a rollercoaster of "this is genius" moments immediately followed by "why is everything on fire." Here's the real story — the problems nobody warns you about, and the solutions that made it all work.

The Architecture Challenge

When the feature set started growing — Vision AI analysis, TTS voice synthesis, story narration, cross-clip narrative arcs — it became clear that a single orchestration file wasn't going to cut it. Every new feature touched everything else, and debugging felt like untangling christmas lights.

The fix was Domain-Driven Design — splitting the logic into focused modules, each owning its piece of the pipeline:

src/
├── shorts.py              # Orchestration & rendering
├── ai_providers.py        # Gemini/OpenAI abstraction
├── tts_generator.py       # Qwen3-TTS voice synthesis
├── subtitle_generator.py  # Caption generation & timing
└── story_narrator.py      # Cross-clip narrative generation

This separation seemed like overkill at first. Then I discovered I needed to load and unload AI models from GPU memory between pipeline stages — TTS has to yield VRAM for rendering, which has to yield for AI analysis — and suddenly having clean boundaries between modules was the only thing keeping me sane.

The VRAM Juggling Act

Here's the thing about running AI models on consumer GPUs: they don't share nicely.

Qwen3-TTS (voice synthesis) needs ~4GB VRAM. Video rendering with PyTorch needs ~2GB. These models don't politely step aside for each other — they sit in VRAM until you physically evict them.

The solution was aggressive model lifecycle management — singleton patterns with explicit cleanup:

# After TTS generation completes
QwenTTS.clear_instance()
torch.cuda.empty_cache()
gc.collect()
logging.info("TTS model unloaded — VRAM freed for rendering")

Without this, the pipeline would OOM (out-of-memory crash) after processing 2-3 clips. Fun times at 2 AM when you're wondering why clip #3 always segfaults.

The Qwen3-VL Dead End: When "Local" Goes Too Far

I desperately wanted the entire video analysis to happen locally. I actually got Qwen3-VL (video-language model) integrated and working, but it was a textbook case of "just because you can, doesn't mean you should."

Qwen3-VL is a monster. It’s not just big; it's VRAM-hungry beyond reason. My 12GB RTX 4080 laptop didn't stand a chance, and even on high-end 24GB cards, it would regularly hit the OOM wall during long video sequences.

I attempted a last-ditch effort using Qwen3-VL-4B-Instruct-FP8, but even with quantization, the stability wasn't there—it still occasionally nuked the pipeline. Worse, the analysis quality didn't justify the struggle; the results were underwhelming compared to the resource cost. It felt like I was trying to race a semi-truck on a go-kart track.

The pivot: This failure is actually what led to the Deep Analysis Proxy system. I realized that instead of fighting 30GB models locally, I could spend those dev cycles on intelligent preprocessing (the 15MB proxy) and let a cloud model do the heavy lifting for pennies. The result was a pipeline that's actually accessible to people with consumer GPUs, rather than just data center owners.

The TTS Timing Nightmare

This was the most infuriating bug I encountered, and it took three separate debugging sessions to crack.

The problem: Subtitles and voiceover were drifting out of sync in story mode. By the end of a 60-second clip, subtitles were 3-4 seconds ahead of the voice. Not great when you're going for "professional esports broadcast" and getting "badly dubbed foreign film."

The investigation:

Story mode generates a continuous narration (like a broadcaster). The TTS engine reads all sentences as one flowing piece. But subtitles were timed by probing each sentence individually:

Subtitle timing (probed separately):
  "The player approaches" → 2.3s
  "An incredible shot"    → 1.8s
  Total: 4.1s

TTS (generated as merged text):
  "The player approaches an incredible shot" → 3.6s

See the problem? When you join sentences, the TTS naturally flows faster — no pause between them. That 0.5s error accumulated across every sentence.

The fix: Probe the merged narration once, then distribute timing proportionally:

# ❌ Wrong: probe each sentence separately
for sentence in sentences:
    duration = probe_tts(sentence)  # Accumulated error!

# ✅ Right: probe merged text, distribute proportionally
full_narration = " ".join(sentences)
total_duration = probe_tts(full_narration)
for sentence in sentences:
    sentence_duration = total_duration * (len(sentence) / total_chars)

One of those fixes where you stare at the solution and think "why didn't I see this three days ago?"

The "TTS Longer Than Video" Problem

Sometimes the AI writes an essay when you asked for a tweet. A 45-second gameplay clip ends up with 52 seconds of narration. Now what?

Three options on the table:

Option A: Truncate the voiceover → Loses content, sounds cut off
Option B: Speed up the voice → Sounds like a chipmunk reading the news
Option C: Extend the video to match → 🤔

Option C won, but with nuance:

if tts_duration > clip_duration + 1.5:
    # Big gap: go back to source video, extract more footage
    rerender_clip_for_tts(clip, render_meta, tts_duration + 1.0)
else:
    # Small gap: freeze last frame using FFmpeg tpad
    ffmpeg_filter = f"tpad=stop_mode=clone:stop_duration={gap}s"

The re-render logic reaches back into the original source video and extracts more footage — even beyond the original scene boundaries. This required tracking render metadata (start time, source file, scene duration) through the entire pipeline. Worth it though. No more cut-off narration.

FlashAttention: When Your RAM Isn't Enough

Qwen3-TTS performs best with FlashAttention 2 — a CUDA kernel that speeds up attention computation by 3-4x. One problem: building it from source requires compiling CUDA code, which needs 125GB+ RAM during compilation. On machines with less than 32GB RAM, the build takes 24 hours or more — if the OOM killer doesn't murder it first.

My machine has 16GB. Killed — my favorite one-word error message.

The solution? Prebuilt wheels. Someone lovely had already compiled FlashAttention for various PyTorch + CUDA combinations:

install_flash_attn:
    @PYVER=$$(python -c "import sys; print(f'cp{sys.version_info.major}{sys.version_info.minor}')"); \
    pip install https://github.com/.../flash_attn-2.6.3+cu128torch2.10-$$PYVER-linux_x86_64.whl

One line. No compilation. No 125GB RAM requirement. Installation went from "impossible on my hardware" to "done in 30 seconds."

Deep Analysis: Letting AI See the Full Picture

Here's an insight that changed everything: short clips lack context.

In the default mode, each candidate clip is analyzed independently — the AI sees 2 minutes of footage and scores it. But it doesn't know what happened before or after. A celebration makes no sense without the clutch play that preceded it.

Deep Analysis mode fixes this by letting Gemini see the entire video — but we're not about to upload a multi-GB 4K recording raw. That would take forever and burn through API quotas.

Instead, we generate a lightweight proxy first using GPU-accelerated FFmpeg:

# GPU-accelerated proxy: 4K@60fps → 640p@1fps, high compression
gpu_cmd = [
    "ffmpeg", "-y",
    "-hwaccel", "cuda",
    "-hwaccel_output_format", "cuda",
    "-i", str(video_path),
    "-vf", "scale_cuda=640:-2,fps=1",   # 640px wide, 1 frame per second
    "-c:v", "hevc_nvenc",
    "-qp", "35",                         # Aggressive compression
    "-c:a", "aac", "-b:a", "32k", "-ac", "1",  # Mono 32kbps audio
    str(temp_proxy)
]

A 2-hour 4K gameplay recording (~30GB) becomes a ~15MB proxy. Same content, same timeline, same audio cues — just tiny enough to upload in seconds. The proxy is also cached by file hash, so re-runs skip the generation step entirely.

The AI can now identify narrative arcs — the setup, the payoff, the aftermath. It finds moments that a clip-by-clip analysis would miss entirely. The quality jump is dramatic, and all it costs is a ~15MB upload instead of 30GB.

Voice Design: From Text to Personality

The most "wow" feature. Instead of picking from generic preset voices, you describe the voice you want in natural language:

VOICE_PRESET_MAP = {
    "story_news": """
        gender: Male.
        pitch: Dynamic, high-energy with excitement.
        speed: Brisk, fast-paced, maintaining high momentum.
        emotion: Hype, adrenaline, "unbelievable play" excitement.
        personality: Charismatic, knowledgeable, maximum energy.
    """,

    "story_dramatic": """
        gender: Female.
        pitch: Rich, resonant mid-range with expressive depth.
        speed: Measured, deliberate pacing with dramatic pauses for impact.
        emotion: Intense, evocative, drawing listeners into the story.
        personality: Wise, commanding, magnetic storyteller presence.
    """
}

Qwen3-TTS reads this description and synthesizes a matching voice. The same caption sounds completely different between "esports broadcaster" and "creepypasta narrator" — and it all happens locally. No cloud TTS API, no per-word billing.

Slang Preprocessing: Making TTS Sound Natural

TTS engines and internet slang do not get along. "rn" becomes "urn." "lol" becomes "loll." "fr fr" sounds like a French car brand.

The fix is a preprocessing layer that expands slang before TTS sees it:

def preprocess_tts_text(text):
    t = text
    t = re.sub(r'\brn\b', 'right now', t, flags=re.IGNORECASE)
    t = re.sub(r'\blol\b', 'L O L', t, flags=re.IGNORECASE)
    t = re.sub(r'\bidk\b', "I don't know", t, flags=re.IGNORECASE)

    # Qwen3-TTS doesn't pause at dashes, so swap them for ellipses
    t = t.replace(" -- ", "... ")
    t = t.replace(" - ", "... ")
    return t

Small detail, huge impact. GenZ-style captions like "bro that was lowkey insane rn fr fr" actually sound right when spoken aloud.

CJK Subtitle Handling: When Words Don't Have Spaces

English subtitles are easy — split on spaces, chunk into 7-word captions, done. But Japanese, Chinese, and Korean (JCK languages) don't use spaces between words. A sentence is one continuous stream of characters.

This completely broke the subtitle chunking logic. A 40-character Japanese sentence would appear as one massive wall of text filling the entire screen.

The fix was character-based splitting with language detection:

# Detect CJK characters in the sentence
is_cjk = any("\u4e00" <= char <= "\u9fff" or  # Chinese
              "\u3040" <= char <= "\u30ff"       # Japanese
              for char in sentence)

MAX_CJK_CHARS = 18  # Characters per line for CJK

if is_cjk:
    # Character-based splitting instead of word-based
    chunks = [sentence[i:i+MAX_CJK_CHARS]
              for i in range(0, len(sentence), MAX_CJK_CHARS)]
    # Distribute TTS duration proportionally by character count
    for chunk in chunks:
        chunk_ratio = len(chunk) / len(sentence)
        chunk_duration = tts_duration * chunk_ratio

The sentence splitter also handles CJK punctuation (。！？) which doesn't follow the English pattern of period-then-whitespace. These characters terminate sentences directly, no space required.

One of those "obvious in hindsight" fixes that makes multi-language support actually work instead of just being a checkbox on a feature list.

My Experience with GitHub Copilot CLI

Everything above? That's the engineering. But I'd be lying if I said I did it alone. GitHub Copilot CLI was my pair programmer through most of this — and here's how it actually helped.

Copilot CLI wasn't just autocomplete — it was a debugging partner, architecture consultant, and documentation writer rolled into one.

What Worked Exceptionally Well

1. Plan Mode for Complex Changes

Using [[PLAN]] prefix before major refactors gave me a structured approach:

[[PLAN]] Migrate from ChatterBox TTS to Qwen3-TTS VoiceDesign

Copilot generated a 6-phase plan covering dependency changes, API migration, FlashAttention setup, testing checkpoints, and rollback strategies. I could review and edit the plan before implementation started.

2. Debugging Across Sessions

The checkpoint system was crucial. When investigating the subtitle timing bug, I could reference earlier sessions:

"Check checkpoint 012-tts-subtitle-sync for what we tried before"

Copilot would review the history and avoid repeating failed approaches.

3. Parallel Exploration

When I wasn't sure which approach to take, I'd ask Copilot to spin up explore agents to investigate multiple paths simultaneously:

task agent_type: explore
prompt: "How does generate_for_captions() handle timing in story mode vs normal mode?"

This let me understand the codebase faster than reading linearly.

4. Test Generation

After making changes, Copilot helped write comprehensive tests:

def test_preprocess_tts_text_em_dash():
    result = preprocess_tts_text("wait — what")
    assert "..." in result
    assert "—" not in result

50 tests covering subtitle formatting, TTS preprocessing, voice description generation, and scene combination logic — all generated from understanding the code context.

What I Learned

Be specific about constraints. "Fix the OOM error" is less useful than "We have 10GB VRAM, model A needs 8GB, model B needs 4GB, how do we sequence them?"
Use checkpoints liberally. Complex debugging spans sessions. Good checkpoints save hours.
Let Copilot see the errors. Pasting full stack traces and logs gives it the context to diagnose accurately.
Trust but verify. Copilot's suggestions are usually good, but always run the tests.

The Pipeline Today

Here's what happens when you drop a gameplay video into AutoShorts:

Scene Detection — GPU-accelerated analysis finds candidate moments using audio spikes + motion detection
AI Ranking — Vision AI (Gemini/OpenAI) watches each clip and scores it across 7 semantic categories
Deep Analysis (optional) — GPU-downscaled proxy uploaded to Gemini for context-aware moment detection
Smart Selection — Diverse category selection ensures variety (not just all "action" clips)
GPU Rendering — NVENC hardware encoding creates vertical crops with blurred backgrounds
Caption Generation — AI writes contextual captions matching the clip's energy
Voice Synthesis — Qwen3-TTS creates matching voiceovers with style-appropriate personalities
Timing Sync — Subtitle timing synchronized with actual TTS audio duration
Smart Mixing — Game audio ducked during voiceover, video extended if TTS runs long

Total processing time: ~5-7 minutes per clip on an RTX 3080.

Analysis Modes & Cost

AutoShorts supports four analysis modes, each with different tradeoffs between cost, accuracy, and speed. You choose the mode via environment variables — no code changes needed.

How Each Mode Works

🔧 Local Heuristics Only (AI_PROVIDER=local)

Zero API calls. Scenes are scored purely on GPU-computed signals:

Audio RMS — Loudness spikes (explosions, crowd reactions, voice peaks).
Spectral Flux — Sudden frequency changes (gunshots, impacts, glass breaking).
Visual Motion — Pixel-diff action scoring via GPU-accelerated grayscale diffing.

All three signals are computed in a single pass using PyTorch on GPU. Scenes are ranked by a combined 0.6 × Audio (RMS + Flux) + 0.4 × Visual Motion score. Fast, free, and surprisingly effective for high-action content — but blind to context (it can't tell a celebration from a firefight).

🖼️ OpenAI Vision (AI_PROVIDER=openai)

Heuristics first narrow the field using Smart Selection (70% top scores + 30% random exploration), then candidates are sent to OpenAI. OpenAI's API doesn't accept video, so we extract 8 keyframe JPEGs per clip:

# Extract 8 static frames as base64 JPEGs
cmd = ["ffmpeg", "-i", clip_path, "-vf", "fps=1", "-frames:v", "8", ...]
# Send as image_url content to GPT-4o
content.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{frame}"}})

The AI scores each clip across 7 semantic categories (action, funny, clutch, wtf, epic_fail, hype, skill). Good accuracy from static frames alone, but it can't hear audio and misses motion-dependent moments like glitches or physics bugs.

🎬 Gemini Per-Clip (AI_PROVIDER=gemini)

Uses the same Smart Selection (mixing high-heuristic clips with random segments for diversity), but uploads each candidate as actual video (downscaled to 640px wide). Gemini sees motion, timing, and audio:

# Each candidate clip: 640p downscaled, ~30-60s, uploaded as MP4
video_file = client.files.upload(file=clip_data, config={"mime_type": "video/mp4"})
response = client.models.generate_content(model="gemini-3-flash", contents=[video_file, prompt])

Significantly better at detecting funny, wtf, and clutch moments that depend on temporal context. Clips are analyzed in parallel (3 concurrent threads) to keep latency manageable.

🧠 Gemini Deep Analysis (GEMINI_DEEP_ANALYSIS=true)

The nuclear option. Instead of pre-filtering with heuristics then analyzing clips, Deep Analysis lets Gemini see the entire video — but not the raw multi-GB 4K file. A GPU-accelerated proxy is generated first:

4K @ 60fps → 640p @ 1fps, QP 35, mono 32kbps audio
~30GB gameplay recording -> ~15MB proxy

Gemini watches the whole thing and returns timestamped moments with categories and scores. No heuristic bias, no missed context. The AI finds narrative arcs — the buildup before a clutch play, the reaction after an epic fail — that clip-by-clip analysis simply can't detect.

Deep Analysis moments are scored with a +200 bias to ensure they rank above any heuristic candidate. A few high-action heuristic backups are still included as safety net clips.

Comparison Summary (1-hour 4K gameplay)

Mode	Accuracy	Analysis Cost	Creative Cost*	Total Cost	Data Uploaded
Local Heuristics	⭐⭐	Free	Free (Whisper)	Free	0 bytes
OpenAI Vision	⭐⭐⭐	~\$0.15	~\$0.15	~\$0.30	~6MB
Gemini Per-Clip	⭐⭐⭐⭐	~\$0.08	~\$0.08	~\$0.16	~90MB
Gemini Deep Analysis	⭐⭐⭐⭐⭐	~\$0.05	~\$0.08	~\$0.13	~60MB

*Creative Cost: Includes AI caption generation (LLM API call) + Voiceover synthesized locally (Free).

The counterintuitive result: Deep Analysis is the most cost-effective mode because it replaces 15 individual analysis uploads with one optimized proxy upload, while still delivering superior context-aware detection.

Roadmap & Vision

AutoShorts works today as a local pipeline for content creators. But the underlying engine — scene detection, AI ranking, voice synthesis, smart cropping — is a general-purpose highlight extraction backend. Here's where this is heading:

🔮 What's Next

Phase	Feature	Status
v2.1	Universal Video Type Support (Podcasts, Sports, Entertainment, etc.)	🔜 Planned
v2.2	SFX generation — AI-generated sound effects matched to on-screen action	🔜 Planned
v2.3	Cloud API mode (submit video URL → get clips back)	📐 Designing
v3.0	Live stream monitoring (detect highlights in real-time)	🔬 Research
v3.x	Multi-platform auto-upload (TikTok, YouTube Shorts, Reels)	📋 Backlog

🎮 Platform Integration Potential

The most exciting future isn't AutoShorts as a standalone tool — it's AutoShorts as a backend engine embedded in platforms millions of gamers already use:

Microsoft Xbox Game Bar — The overlay already captures screenshots and gameplay recordings (Win+G). Imagine a "Generate Highlights" button that takes your captured footage and produces ready-to-share clips with captions and voiceover — without ever leaving the overlay.
NVIDIA ShadowPlay — ShadowPlay's Instant Replay already silently records the last 30 seconds to 20 minutes of gameplay. Pair that buffer with AutoShorts' AI ranking, and ShadowPlay could automatically identify and export your best moments with professional-grade overlays and narration. No scrubbing through footage. No editing. Just play.
Discord Activity Integration — Post-session highlight reels generated from screen shares, dropped directly into your server channel.

The core thesis: highlight detection + voice synthesis + smart cropping is infrastructure, not an app. Every platform that captures gameplay footage could use this engine to turn passive recording into active content creation.

The best highlight reel is the one you never had to make.

Acknowledgements

This project builds upon:

artryazanov/shorts-maker-gpu — GPU-accelerated clip extraction using heuristic scoring (audio dB + motion detection).
Binary-Bytes/Auto-YouTube-Shorts-Maker — Original concept and inspiration for the automated short-form content pipeline.
Qwen3-TTS — Voice synthesis with natural language design
PyCaps — Animated subtitle rendering

Key Improvements Over Base Project

Feature	Base Project	AutoShorts
Architecture	Monolithic script	Modular package with lifecycle management
Scene Scoring	Audio dB + motion only	Hybrid: heuristics + Vision AI semantic analysis
Deep Analysis	N/A	Full-video Gemini analysis for context-aware detection
Voiceover	None	Qwen3-TTS with style-adaptive voice design
Captions	None	AI-generated, 10+ styles including story modes
CJK Support	N/A	Character-based subtitle chunking for JCK languages
Memory	Single model	VRAM-aware model sequencing (unload between phases)
TTS Sync	N/A	Per-sentence TTS generation for accurate timing
Overflow Handling	N/A	Re-render clips when TTS > video length

Try It Yourself


# Clone the repository
git clone https://github.com/divyaprakash0426/autoshorts.git
cd autoshorts

# Setup environment variables
cp .env.example .env
# Edit .env and add your API keys (Gemini/OpenAI) 

# Option 1: Using Makefile (Recommended)

make install

# Option 2: Using Shell Script
./install.sh

# Drop videos in gameplay/, then run:
./.venv/bin/python run.py

Or launch the dashboard:

./.venv/bin/streamlit run src/dashboard/About.py

🛡️ Battle Tested On

Asus Zephyrus G16 (RTX 4080 Mobile, Intel Ultra 9) running Arch Linux.

Built with frustration, caffeine, and GitHub Copilot CLI.

Building AutoShorts: A High-Performance AI Pipeline for Automated Viral Content 🎬🤖

divyaprakash D — Sat, 24 Jan 2026 14:59:45 +0000

The Problem: Content Creation is a Bottleneck

Every creator knows the "highlight reel" struggle. You have hours of high-quality gameplay footage, but finding that perfect 30-second clip, cropping it, adding subtitles, and layering a voiceover takes hours of manual labor.
I wanted to see if I could build a fully automated, high-performance pipeline to handle this from start to finish. Today, I'm open-sourcing AutoShorts.

What is AutoShorts?

AutoShorts is a GPU-optimized CLI tool that analyzes long-form video, identifies high-engagement scenes using AI, and synthesizes them into ready-to-upload vertical shorts.
It doesn't just "cut" video; it understands it.

The Technical Deep Dive 🛠️

To keep processing times low and avoid massive cloud API bills, I focused heavily on local processing and hardware acceleration:

1. GPU Scene Analysis ⚡

Using decord and PyTorch, the pipeline performs frame extraction and visual feature analysis directly on the GPU. We calculate action density and spectral flux to find "loud" or "fast" moments before the text-based AI even sees the clip.

2. Dual-AI Intelligence 🧠

The pipeline integrates with OpenAI (GPT-4o) and Google Gemini. We pass the metadata and scene descriptions to the LLM to score segments based on:
Hook Potential: Is the start grabby?
Relevance: Does the action make sense?
Emotional Impact: Is it funny, impressive, or a "fail"?

3. Smart Subtitles & Neural TTS 🗣️

Local TTS: Instead of paid APIs, we use ChatterBox locally. It supports emotional prosody, so the voiceover doesn't sound like a monotone robot.
PyCaps Renderer: We use a custom Playwright-based renderer to create those "MrBeast style" word-by-word animated captions that are essential for mobile retention.

4. NVENC Rendering 🎞️

Final assembly—including audio mixing, blurring backgrounds (for the vertical look), and burning in subtitles—is offloaded to NVIDIA’s NVENC hardware. This keeps the CPU free for other tasks and slashes render times.

🚧 What’s Next? (The Roadmap)

This is a v1.0 release, and while the pipeline is robust, the potential for enhancement is huge. I’m looking for contributors to help with:
Upgrading the Voice Engine: Integrating more recent open-source models like ChatterBoxTurbo, Qwen-TTS, or NVIDIA’s latest TTS for even more realistic voice cloning and prosody.
Intelligent Auto-Zoom: Currently, the 9:16 crop is centered. Adding object detection (YOLO/RT-DETR) to follow the action—dynamically moving the crop window to follow a character or a vehicle.
Advanced Transition Styles: Adding AI-generated transitions between merged scenes.

Build With Me 🚀

The project is fully dockerized and open for contributions. Whether you're interested in machine learning, computer vision, or just want to automate your own YouTube channel, I'd love to see you in the PRs.
GitHub Repository: github.com/divyaprakash0426/autoshorts
A huge thanks to the original concepts from artryazanov and Binary-Bytes which provided the foundation for this refactored release.
What features would you add to an AI video pipeline like this? Let's discuss in the comments! 👇

Forem: divyaprakash D

Vitreus: Local-First Spreadsheet Intelligence with Gemma 4

What I Built

Demo

Demo 1: Ask Gemma 4 to find budget problems

Demo 2: One command, real XLSX output

Demo 3: XLSX in, XLSX out

Code

How I Used Gemma 4

How It Works Internally

1. WorkbookSnapshot: turn sheets into model context

2. Backends: local-first, cloud-optional

3. Manifest execution: structured actions only

4. CSV vs XLSX: the boring detail that mattered

The Technical Problems That Shaped the Project

Problem 1: "AI changed my spreadsheet" is not good enough

Problem 2: Local-first should not mean "local-only"

Problem 3: Multi-sheet XLSX files are the real spreadsheet format

Problem 4: Tests need to run without secret keys or local models

Why This Is a Good Gemma 4 Use Case

Current Limitations

What I Learned

Acknowledgements

What I Learned Building a Local-First Spreadsheet Agent with Gemma 4

The Core Idea: Reasoning Is Not Execution

Why Gemma 4 31B Dense Was the Right Default

Running Gemma 4 Locally with Ollama

Running Gemma 4 with an API Key

The Prompt Contract

CSV Taught Me a Product Lesson

XLSX Made It Feel Real

What Gemma 4 Is Good At in This Pattern

1. Header-aware reasoning

2. Spreadsheet reference generation

3. Explanations attached to actions

4. Flexible user intent

Where I Would Use Each Gemma 4 Variant

Use 2B / 4B when latency and hardware matter most

Use 31B Dense when reasoning quality matters most

Use 26B MoE when throughput matters

Design Checklist for LLMs Working on Private Files

Try the Pattern Yourself

Final Thought

ClawForge — A Trustworthy, Cross-Variant Skill Hub for the OpenClaw Ecosystem

What I Built

How I Used OpenClaw

The SKILL.md standard — front-loading every honest constraint

The variant-aware installer

The security trio

The scaffolder — keeping the schema honest

The cross-architecture evidence harness

Demo

End-to-end proof — running inside a disposable QEMU VM

Try it in 60 seconds

What I Learned

1. AgentSkills are portable. Skill execution isn't.

2. Static tiers beat dynamic risk scores every time.

3. The harness finds the bugs that demos hide.

4. "30 engineered" beats "300 curated" for a security story.

5. The scaffolder is the standard.

ClawCon Michigan

Why the Claw ecosystem needs a skill commons — and how I built one

The supply chain problem nobody wants to talk about

The variant problem is bigger than it looks

Show, Don't Just Tell

What ClawHavoc actually taught us

The China signal: what the underground market proved

The AgentSkills portability problem

Design decisions and trade-offs

What comes next

Over to you

ClawCon Michigan

Why the Claw ecosystem needs a skill commons — and how I built one

The supply chain problem nobody wants to talk about

The variant problem is bigger than it looks

Show, Don't Just Tell

What ClawHavoc actually taught us

The China signal: what the underground market proved

The AgentSkills portability problem

Design decisions and trade-offs

The `SKILL.md` standard — front-loading every honest constraint