Forem: Luke Hinds

nono : the ultimate coding agent security tool - sandbox and supercharge Claude Code in just two simple steps!

Luke Hinds — Tue, 10 Mar 2026 17:03:08 +0000

Claude Code is one of the most capable coding agents available today. It can navigate your codebase, edit files across multiple directories, run shell commands, install packages, and execute tests — all autonomously.

It also runs with your full user permissions. Every file on your system is readable. Every credential in your environment is accessible. Every command it runs executes as you.

We've all seen the stories. Claude deleted my database. Claude deleted all my code. Claude wiped my home directory. These aren't hypothetical risks — they're things that happen when an agent operates with no isolation between what it wants to do and what it can do.

Permission prompts don't scale

The default response to this problem has been permission dialogs. The agent wants to run a command — do you approve? The agent wants to write a file — do you approve?

The issue is predictable. After the third prompt, you're clicking "allow" reflexively. Approval fatigue turns a security feature into a rubber stamp. You end up with interrupted flow and reduced security.

Claude Code now ships with a built-in sandbox mode, which is a meaningful step forward. It uses OS-level primitives to enforce filesystem and network boundaries, reducing the need for constant approval prompts.

But it includes an escape hatch. When a command fails due to sandbox restrictions, Claude can retry it outside the sandbox with the dangerouslyDisableSandbox parameter. The name is honest — it is dangerous. One retry outside the sandbox and the agent has full access again.

Kernel-level enforcement with no escape

nono takes a different approach. It uses Landlock (Linux) and Seatbelt (macOS) to enforce restrictions at the kernel level. Once applied, these restrictions are irreversible for the process. There is no API to widen them. There is no escape hatch. Not even nono itself can remove the restrictions after they're set.

This is what we mean by "structurally impossible." The kernel denies unauthorised operations directly. There's no interception layer, no policy filter, no mechanism to bypass. If the agent tries to read your SSH keys or delete files outside its granted path, the operation fails at the syscall level.

Getting started

Install nono:

brew install nono

Launch Claude Code inside a sandbox using the built-in profile:

nono run --allow-cwd --profile claude-code -- claude

That's it. Claude Code now runs with default-deny filesystem access. Only the current directory is writable. SSH keys, AWS credentials, system configs — all blocked at the kernel level.

If you want to run silently without the nono banner:

nono run --allow-cwd --silent --profile claude-code -- claude

For more granular control:

# Read-only access to reference docs, write access to project
nono run --read ~/docs --write ./src --profile claude-code -- claude

# Check why a specific path would be blocked
nono why --path ~/.ssh/id_rsa --op read

What's next

nono is free, open source, and available today. We're actively developing fine-grained network controls, atomic rollbacks, and library bindings for Python, TypeScript, and Rust so other tools can integrate nono's security primitives directly.

Docs: nono.sh/docs
GitHub: github.com/always-further/nono
Discord: Join the community

Don't yolo when you can nono.

Introducing nono: A Secure Sandbox for AI Agents

Luke Hinds — Mon, 02 Feb 2026 23:52:43 +0000

AI coding agents like Claude Code, OpenCode, and others are incredibly powerful — they can write code, refactor entire codebases, and automate tedious tasks. But there's a catch: they run with your permissions. That means they can read your SSH keys, access your AWS credentials, and potentially exfiltrate data to remote servers. We have seen this play out the past few days with the attacks against OpenClaw servers.

Current solutions often rely on the agent to police itself. But bugs happen. Security vulnerabilities are common. What if there was a way to make unauthorized operations structurally impossible?

Enter nono.

What is nono?

nono is a capability-based security shell that leverages kernel-level primitives to sandbox AI agents and untrusted processes. It uses Landlock on Linux and Seatbelt on macOS to create an environment where once restrictions are applied, there's no API to escape them — not even for nono itself.

The name says it all: "no" to unauthorized filesystem access, "no" to secret exfiltration, "no" to destructive commands.

Why Not Just Use Docker?

Great question. Containers are fantastic for many use cases, but they come with overhead:

Heavyweight: Containers require images, layers, and a runtime
Complexity: Managing volumes, networking, and permissions adds friction
Overkill: Sometimes you just want to run a single command safely

nono is lightweight by design. It applies kernel-enforced restrictions directly to a process without spinning up containers, VMs, or additional infrastructure. You get security without the weight.

Quick Start

Install via Homebrew (macOS):

brew tap lukehinds/nono
brew install nono

Or build from source:

git clone https://github.com/lukehinds/nono.git
cd nono
cargo build --release

Run your first sandboxed command:

# Give Claude Code read/write access only to the current directory
nono run --allow . -- claude

# Separate read and write permissions
nono run --allow ./project-dir --write ./output claude

# Block network access entirely
nono run --allow . --net-block -- my-agent

# Preview what would happen (dry run)
nono run --allow . --dry-run -- my-command

Key Features

No Escape Hatch

Once nono applies the sandbox, there is no mechanism to bypass it. The kernel denies unauthorized operations — this isn't policy-based filtering that can be circumvented; it's structural enforcement.

Agent Agnostic

nono works with any CLI tool or AI agent: Claude Code, GPT wrappers, custom scripts, build tools — if it runs in a terminal, nono can sandbox it.

Defense in Depth

nono provides multiple layers of protection:

Layer	Protection	Can Be Overridden?
Command blocklist	Blocks dangerous binaries (rm, dd, chmod, sudo)	Yes, with `--allow-command`
Kernel (delete)	Blocks unlink/rmdir syscalls	No
Kernel (truncate)	Prevents zeroing out files	No
Filesystem sandbox	Restricts path access	Only via explicit `--allow`
Network sandbox	Blocks network access	Only by removing `--net-block`

Sensitive Paths Protected by Default

SSH keys, AWS credentials, shell configs, and other sensitive paths are blocked by default. Your secrets stay secret.

Process Inheritance

When a sandboxed process spawns children, those children inherit all restrictions. There's no privilege escalation through subprocess tricks.

How It Works

┌─────────────────────────────────────────────────┐
│  Terminal                                       │
│                                                 │
│  $ nono run --allow ./project -- claude         │
│                                                 │
│  ┌───────────────────────────────────────────┐  │
│  │  nono (applies sandbox, then exec)        │  │
│  │                                           │  │
│  │  ┌─────────────────────────────────────┐  │  │
│  │  │  Claude Code (sandboxed)            │  │  │
│  │  │  ✓ Can read/write ./project         │  │  │
│  │  │  ✗ Cannot access ~/.ssh, ~/.aws     │  │  │
│  │  │  ✗ Cannot delete files              │  │  │
│  │  └─────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────┘  │
└─────────────────────────────────────────────────┘

Command validation — Dangerous commands are blocked before execution
Sandbox applied — OS-level restrictions are set (irreversibly)
Kernel enforcement — Even allowed paths can't have files deleted
Process execution — Your command runs with only granted capabilities
Child inheritance — All subprocesses run under the same restrictions

Platform Support

Platform	Mechanism	Kernel Requirement	Status
Linux	Landlock LSM	5.13+	✅ Filesystem
Linux	Landlock LSM	6.7+	✅ Filesystem + Network
macOS	Seatbelt	10.5+	✅ Filesystem + Network
Windows	—	—	🚧 Not yet supported

Debugging: The `why` Command

Ever wonder why a path is blocked? nono can tell you:

$ nono why ~/.ssh/id_rsa
Path: /Users/you/.ssh/id_rsa
Status: BLOCKED
Reason: Sensitive path - SSH private keys are protected by default

This makes it easy to understand and debug sandbox behavior without trial and error.

Real-World Use Cases

Running AI coding agents safely:

nono run --allow ./my-project -- claude

Building code with restricted write access:

nono run --read ./src --write ./target -- cargo build

Running tests without network access:

nono run --allow . --net-block -- npm test

Processing files without risk of deletion:

nono run --read ./input --write ./output -- python process.py

Get Involved

nono is open source under the Apache 2.0 license. The project is still early (security auditing is ongoing), but it's ready for experimentation and feedback.

GitHub: github.com/lukehinds/nono
Documentation: docs.nono.sh

Conclusion

As AI agents become more powerful and autonomous, the security model of "trust the agent to behave" becomes increasingly untenable. nono offers a different approach: make bad behavior impossible at the kernel level.

Give it a try, break things (safely), and let us know what you think.

nono is created by Luke Hinds. Star the repo if you find it useful!

🚀🔥 Train a 4B Model to Beat Claude Sonnet 4.5 🤖 and Gemini Pro 2.5 ✨ at Tool Calling 🛠️ - and for all Free! 💸🎉

Luke Hinds — Thu, 25 Dec 2025 20:07:07 +0000

How we used synthetic data to fine-tune Qwen3-4B into a specialist that outperforms frontier models — and how you can do it too (for free)

Introduction

There's a common assumption in the AI community that bigger is always better. That if you want state-of-the-art performance, you need to use the largest, most expensive models from frontier labs like Anthropic, OpenAI, or Google.

We're here to challenge that assumption.

Using DeepFabric, an open-source tool, we trained Qwen3-4B — a model with just 4 billion parameters — to outperform Claude Sonnet 4.5 and Google Gemini Pro 2.5 at tool calling tasks. And we're sharing exactly how we did it so you can replicate our results.

The secret? Specialization beats generalization. Frontier models are generalists — they're designed to handle everything from poetry to protein folding. But if you need a model that excels at one specific task, a small, focused model trained on high-quality synthetic data can beat the giants.

The Problem with Generalist Models

Frontier models like Claude Sonnet 4.5 and Gemini Pro 2.5 are remarkable achievements. They can write code, analyze documents, engage in nuanced conversations, and much more. But this versatility comes at a cost.

When you're building an agentic system — one where the model needs to interact with tools, APIs, or MCP servers — you don't need a model that can do everything. You need a model that can do one thing exceptionally well: understand when to call a tool, which tool to call, and what parameters to pass within the tools schema.

This is where the generalist approach shows its weaknesses:

Inconsistent tool calling: Large models sometimes hallucinate tool names or parameters that don't exist
Schema violations: They might return responses that don't match the expected JSON schema
Reasoning gaps: When faced with complex multi-step tool interactions, they can lose track of state
Cost and latency: API calls to frontier models are expensive
Data privacy concerns: Sending sensitive data to third-party APIs can be risky

The DeepFabric Approach

Instead of trying to create one model that handles everything, we help you create specialist models that are experts at specific tool-calling tasks.

The process works in three stages:

1. Topic Graph Generation

One of the biggest challenges in synthetic data generation is ensuring diversity. If you just ask an LLM to generate training samples, you'll quickly get repetitive, homogeneous data that leads to overfitting.

DeepFabric solves this with a unique topic graph generation algorithm. Given a domain (like "Blender 3D modeling" or "REST API development"), it builds a comprehensive graph of subtopics, ensuring that generated samples cover the full breadth of the domain without redundancy.

2. Synthetic Sample Generation

For each topic in the graph, DeepFabric generates training samples that combine:

Chain-of-thought reasoning: The model learns to think through problems step by step
Tool calling patterns: Proper syntax and parameter usage for each tool
Real execution traces: Using Spin's WebAssembly sandbox, tools actually execute, producing authentic results

That last point is critical. Traditional synthetic data generators simulate tool outputs, which means the training data often contains hallucinated results. DeepFabric actually executes the tools in isolated sandboxes, so the model learns from real cause-and-effect relationships.

3. Fine-Tuning and Evaluation

The generated dataset is automatically formatted for popular training frameworks (TRL, Unsloth) and can be uploaded directly to Hugging Face. After training, DeepFabric's built-in evaluation engine measures performance on held-out test samples.

The Results

We put this approach to the test using the Blender MCP server — a tool interface for 3D modeling operations in Blender. Here's how the models performed:

Model	Tool Calling Accuracy
DeepFabric Fine-Tuned (Qwen3-4B)	93.50%
Claude Sonnet 4.5	80.50%
Google Gemini Pro 2.5	47.00%

A 4 billion parameter model — small enough to run on consumer hardware — outperforming models that are orders of magnitude larger. How is this possible?

Why This Works

The key insight is that tool calling is a narrow, well-defined task. Unlike open-ended conversation or creative writing, successful tool calling requires:

Recognizing when a tool is needed
Selecting the correct tool from available options
Formatting parameters according to a strict schema
Interpreting results and deciding next steps

These are learnable patterns. A small model with high-quality, domain-specific training data can master these patterns more effectively than a large generalist model that has to balance thousands of different capabilities.

Think of it like hiring for a job. Would you rather have a brilliant generalist who's pretty good at everything, or a specialist who has spent years mastering exactly the skill you need? For tool calling, specialization wins.

How to Replicate Our Results

We've made this entire process available for free. Here's how to get started:

Option 1: Use Our Google Colab (Easiest)

We've created a Google Colab notebook using the awesome Unsloth.ai for optimized fine-tuning - this walks you through the entire process — from generating your dataset to training and evaluating your model. No local setup required, and it runs on free GPU instances.

👉 [Link to Colab Notebook]

Option 2: Run Locally

If you prefer to run things on your own hardware:

# Install DeepFabric
pip install deepfabric

# Set your API key (for dataset generation)
export OPENAI_API_KEY="your-api-key"

# Generate a dataset
deepfabric generate \
  --topic-prompt "Your domain here" \
  --mode graph \
  --depth 4 \
  --degree 4 \
  --num-samples 100 \
  --output-save-as dataset.jsonl

Training Your Model

Once you have your dataset, training is straightforward:

from datasets import load_dataset
from transformers import AutoTokenizer
from trl import SFTTrainer, SFTConfig

# Load your dataset
dataset = load_dataset("json", data_files="dataset.jsonl", split="train")

# Split into train/eval
splits = dataset.train_test_split(test_size=0.1, seed=42)

# Format and train
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=splits["train"],
    args=SFTConfig(output_dir="./output", num_train_epochs=3),
)
trainer.train()

Evaluating Your Model

from deepfabric.evaluation import Evaluator, EvaluatorConfig, InferenceConfig

config = EvaluatorConfig(
    inference_config=InferenceConfig(
        model_path="./output/checkpoint-final",
        backend="transformers",
    ),
)

evaluator = Evaluator(config)
results = evaluator.evaluate(dataset=splits["test"])

print(f"Tool Selection Accuracy: {results.metrics.tool_selection_accuracy:.2%}")
print(f"Parameter Accuracy: {results.metrics.parameter_accuracy:.2%}")
print(f"Overall Score: {results.metrics.overall_score:.2%}")

What Makes DeepFabric Different

There are other tools for generating synthetic datasets. Here's what sets DeepFabric apart:

Topic Graph Diversity

Most dataset generators produce samples that cluster around common patterns. DeepFabric's graph-based approach ensures comprehensive coverage of your domain, hitting edge cases and uncommon scenarios that make your model robust.

Real Tool Execution

Simulated tool outputs create unrealistic training data. DeepFabric uses Spin's WebAssembly sandbox to actually execute tools, producing authentic cause-and-effect relationships in your training data.

MCP Integration

DeepFabric natively supports the Model Context Protocol (MCP) schema, making it easy to import tool definitions from any MCP server. If you already have an MCP server, you can start generating training data immediately.

End-to-End Pipeline

From dataset generation to training to evaluation, DeepFabric handles the entire workflow. No need to stitch together multiple tools or write custom integration code.

Use Cases

This approach isn't limited to Blender. Any domain where you need reliable tool calling is a candidate:

Developer tools: Code execution, file manipulation, git operations
Data pipelines: Database queries, API calls, data transformation
Business automation: CRM updates, email composition, calendar management
Creative software: Image editing, audio processing, video production
IoT and hardware: Device control, sensor reading, automation scripts

If you can define it as a tool with a schema, you can train a specialist model for it.

The Bigger Picture

We believe this represents a shift in how AI systems will be built. Rather than relying on ever-larger generalist models, we'll see architectures that combine:

A reasoning backbone (which could be a frontier model) for complex decision-making
Specialist tool-calling models for reliable, schema-compliant interactions with external systems

This hybrid approach gives you the best of both worlds: the reasoning capability of large models with the reliability and efficiency of focused specialists.

Getting Started

Ready to train your own specialist model? Here are your next steps:

Try the Colab notebook to see the full pipeline in action
Star the GitHub repo to stay updated on new features
Join our Discord to connect with other builders
Share your results — we'd love to see what specialist models you create

Fine-Tuning with GRPO Datasets: A Developer's Guide to DeepFabric's GRPO Formatter

Luke Hinds — Tue, 21 Oct 2025 09:34:02 +0000

Introduction

When training language models for mathematical reasoning, one of the key challenges is getting the model to not just produce correct answers, but to show its work in a structured, verifiable way. This is where GRPO (Generalized Reward-based Policy Optimization) comes in.

DeepFabric's GRPO formatter transforms your datasets into the precise format needed for GRPO training pipelines, wrapping reasoning traces and solutions in configurable tags that enable reward-based optimization. In this post, we'll dive deep into how to use it effectively.

What is GRPO?

GRPO is a reinforcement learning technique that optimizes language models using reward signals. For mathematical reasoning tasks, this typically means:

Structured reasoning: The model's thought process is wrapped in specific tags
Extractable solutions: Final answers are clearly delineated for verification
Reward computation: Numerical answers can be automatically validated against ground truth

The format enables training systems to parse the model's output, extract the answer, compute rewards based on correctness, and update the model accordingly.

The GRPO Format Structure

A properly formatted GRPO sample looks like this:

{
  "messages": [
    {
      "role": "system",
      "content": "You are given a problem. Think about the problem and provide your working out. Place it between <start_working_out> and <end_working_out>. Then, provide your solution between <SOLUTION> and </SOLUTION>."
    },
    {
      "role": "user",
      "content": "What is 15% of 240?"
    },
    {
      "role": "assistant",
      "content": "<start_working_out>To find 15% of 240, I need to multiply 240 by 0.15. 240 × 0.15 = 36<end_working_out><SOLUTION>36</SOLUTION>"
    }
  ]
}

Notice the clear separation:

Reasoning section: Wrapped in <start_working_out>...</end_working_out>
Solution section: Wrapped in <SOLUTION>...</SOLUTION>

This structure allows the training pipeline to:

Extract the numerical answer (36)
Compare it against ground truth
Compute reward signals
Backpropagate through the reasoning process

Using DeepFabric's GRPO Formatter

Basic Usage

The GRPO formatter is built into DeepFabric and supports multiple input formats out of the box:

from deepfabric.formatters.builtin.grpo import GrpoFormatter

# Initialize with default configuration
formatter = GrpoFormatter()

# Your raw dataset sample
sample = {
    "question": "If a train travels 120 km in 2 hours, what is its average speed?",
    "final_answer": "60",
    "chain_of_thought": "Speed = Distance / Time. Speed = 120 km / 2 hours = 60 km/h"
}

# Format for GRPO training
formatted = formatter.format_dataset([sample])

Configuration Options

The formatter is highly configurable to match your training pipeline's requirements:

config = {
    "reasoning_start_tag": "<think>",
    "reasoning_end_tag": "</think>",
    "solution_start_tag": "<answer>",
    "solution_end_tag": "</answer>",
    "system_prompt": "Solve the following problem step by step.",
    "validate_numerical": True  # Enforce numerical answer extraction
}

formatter = GrpoFormatter(config=config)

This flexibility means you can adapt to different GRPO implementations (like Qwen, DeepSeek, or custom pipelines) that might use different tag conventions.

Supported Input Formats

One of the formatter's strengths is its ability to handle diverse dataset structures. Let's explore each:

1. Question-Answer Format

The simplest format - just questions and answers:

sample = {
    "question": "What is 25² ?",
    "final_answer": "625"
}

The formatter automatically generates a basic reasoning wrapper if no reasoning trace is provided.

2. Chain-of-Thought Format

Includes explicit reasoning steps:

sample = {
    "question": "Solve: 3x + 5 = 20",
    "chain_of_thought": "Subtract 5 from both sides: 3x = 15. Divide both sides by 3: x = 5.",
    "final_answer": "5"
}

3. Structured Chain-of-Thought

The most detailed format with message structure and reasoning traces:

sample = {
    "messages": [
        {"role": "user", "content": "Calculate 15! / 13!"}
    ],
    "reasoning_trace": [
        {"thought": "Using factorial properties, 15! / 13! = 15 × 14 × 13! / 13!"},
        {"thought": "The 13! cancels out, leaving 15 × 14"},
        {"action": "Calculate: 15 × 14 = 210"}
    ],
    "final_answer": "210"
}

4. Conversation Format

Already has messages but needs GRPO formatting:

sample = {
    "messages": [
        {"role": "user", "content": "What is 2³ + 3²?"},
        {"role": "assistant", "content": "Let me calculate: 2³ = 8 and 3² = 9. Therefore 8 + 9 = 17"}
    ]
}

The formatter intelligently extracts reasoning and answer, then wraps them in GRPO tags.

5. Generic Format

For datasets with non-standard field names:

sample = {
    "problem": "Find the area of a circle with radius 5",
    "solution": "78.54",
    "reasoning": "Area = πr². With r=5: Area = π × 5² = π × 25 ≈ 78.54"
}

The formatter searches for common field name patterns like problem, prompt, input for questions and solution, output, response for answers.

Validation and Quality Control

The formatter includes robust validation at two levels:

Input Validation

Before formatting, each sample is validated:

# This returns False for invalid samples
is_valid = formatter.validate(sample)

Validation checks:

Required fields are present
Data types are correct
Format can be detected and handled

Output Validation

After formatting, samples are validated against GRPO requirements:

formatted_sample = formatter._format_single_sample(sample)
is_grpo_compliant = formatter.validate_output(formatted_sample)

This ensures:

All required roles (system, user, assistant) are present
GRPO formatting tags are correctly applied
Numerical answers are extractable (if validation enabled)

Numerical Answer Extraction

When validate_numerical: True, the formatter uses regex patterns to ensure answers can be extracted:

# The formatter compiles these patterns
format_regex = re.compile(
    r"<end_working_out>.*?<SOLUTION>(.+?)</SOLUTION>\s*$",
    flags=re.MULTILINE | re.DOTALL
)

number_regex = re.compile(
    r"<SOLUTION>.*?\s*([+-]?[\d\.,]+)",
    flags=re.MULTILINE | re.DOTALL
)

This ensures the training pipeline can reliably extract answers for reward computation.

Real-World Example: Formatting a Math Dataset

Let's walk through a complete example using a mathematical reasoning dataset:

from deepfabric.formatters.builtin.grpo import GrpoFormatter
import json

# Configuration matching your training pipeline
config = {
    "reasoning_start_tag": "<start_working_out>",
    "reasoning_end_tag": "<end_working_out>",
    "solution_start_tag": "<SOLUTION>",
    "solution_end_tag": "</SOLUTION>",
    "validate_numerical": True
}

formatter = GrpoFormatter(config=config)

# Your raw dataset
raw_samples = [
    {
        "question": "A rectangle has length 8cm and width 5cm. What is its perimeter?",
        "chain_of_thought": "Perimeter of rectangle = 2(length + width). P = 2(8 + 5) = 2(13) = 26cm",
        "final_answer": "26"
    },
    {
        "question": "Simplify: (x² - 9) / (x - 3)",
        "chain_of_thought": "Factor numerator: (x+3)(x-3) / (x-3). Cancel (x-3): x + 3",
        "final_answer": "x + 3"
    }
]

# Format the entire dataset
formatted_dataset = formatter.format_dataset(raw_samples)

# Save for training
with open("grpo_training_data.jsonl", "w") as f:
    for sample in formatted_dataset:
        f.write(json.dumps(sample) + "\n")

print(f"Formatted {len(formatted_dataset)} samples for GRPO training")

The output JSONL file is ready to feed into your GRPO training pipeline.

Integration with DeepFabric Dataset Generation

DeepFabric can generate synthetic datasets and format them for GRPO in one pipeline:

# config.yaml
dataset_system_prompt: |
  Generate mathematical reasoning problems suitable for GRPO training.
  Include step-by-step reasoning and numerical answers.

topic_tree:
  args:
    model: "gpt-4"
    depth: 2
    branching_factor: 3
    root_prompt: "Mathematical reasoning: algebra, geometry, arithmetic"
  save_as: "topics.jsonl"

data_engine:
  args:
    model: "gpt-4"
    samples_per_topic: 10
    temperature: 0.7

dataset:
  creation:
    formatter: "grpo"
    formatter_config:
      reasoning_start_tag: "<start_working_out>"
      reasoning_end_tag: "<end_working_out>"
      solution_start_tag: "<SOLUTION>"
      solution_end_tag: "</SOLUTION>"
      validate_numerical: true
  save_as: "grpo_dataset.jsonl"

Run the pipeline:

deepfabric start config.yaml

This generates a complete GRPO-formatted dataset from scratch.

Best Practices

1. Match Your Training Pipeline

Different GRPO implementations use different tags. Always configure the formatter to match your training code:

# For Qwen-style GRPO
qwen_config = {
    "reasoning_start_tag": "<start_working_out>",
    "reasoning_end_tag": "<end_working_out>",
    "solution_start_tag": "<SOLUTION>",
    "solution_end_tag": "</SOLUTION>"
}

# For custom pipeline
custom_config = {
    "reasoning_start_tag": "[REASONING]",
    "reasoning_end_tag": "[/REASONING]",
    "solution_start_tag": "[ANSWER]",
    "solution_end_tag": "[/ANSWER]"
}

2. Enable Validation for Math Tasks

For mathematical reasoning, always enable numerical validation:

config = {"validate_numerical": True}

This ensures your reward function can extract answers reliably.

3. Provide Quality Reasoning Traces

The better your input reasoning, the better your GRPO training:

# Good: Detailed step-by-step
sample = {
    "question": "What is 15% of 80?",
    "chain_of_thought": "Convert percentage to decimal: 15% = 0.15. Multiply: 80 × 0.15 = 12",
    "final_answer": "12"
}

# Less ideal: Minimal reasoning
sample = {
    "question": "What is 15% of 80?",
    "final_answer": "12"
}

4. Customize System Prompts

The system prompt guides model behavior during training:

config = {
    "system_prompt": """Solve mathematical problems by:
1. Breaking down the problem
2. Showing all calculation steps
3. Providing the final numerical answer"""
}

5. Validate Your Output

Always check a few formatted samples before training:

formatted = formatter.format_dataset(raw_samples)

# Inspect first sample
print(json.dumps(formatted[0], indent=2))

# Validate all samples
valid_count = sum(1 for s in formatted if formatter.validate_output(s))
print(f"{valid_count}/{len(formatted)} samples are GRPO-compliant")

Troubleshooting Common Issues

Issue: Samples Being Filtered Out

Problem: Some samples don't appear in the formatted output.

Solution: Check validation errors:

for sample in raw_samples:
    if not formatter.validate(sample):
        print(f"Invalid sample: {sample}")

Common causes:

Missing required fields
Empty answers
Incompatible format

Issue: Answer Extraction Fails

Problem: validate_numerical rejects valid samples.

Solution: Check answer format:

# The regex expects numbers in the solution tags
# This works:
"<SOLUTION>42</SOLUTION>"

# This might fail:
"<SOLUTION>The answer is forty-two</SOLUTION>"

For non-numerical answers, disable validation:

config = {"validate_numerical": False}

Issue: Reasoning Not Preserved

Problem: Original reasoning is lost during formatting.

Solution: Ensure reasoning is in a recognized field:

# Recognized fields for reasoning:
- "chain_of_thought"
- "reasoning"
- "reasoning_trace"

# Not recognized:
- "explanation"
- "steps"
- "working"

Advanced: Custom Format Handling

If you have a unique dataset structure, you can extend the formatter:

from deepfabric.formatters.builtin.grpo import GrpoFormatter

class CustomGrpoFormatter(GrpoFormatter):
    def _format_custom_format(self, sample: dict) -> dict:
        """Handle custom dataset structure."""
        question = sample["problem_statement"]
        steps = sample["solution_steps"]  # List of steps
        answer = sample["correct_answer"]

        # Combine steps into reasoning
        reasoning = " ".join(steps)

        assistant_content = (
            f"{self.reasoning_start_tag}{reasoning}{self.reasoning_end_tag}"
            f"{self.solution_start_tag}{answer}{self.solution_end_tag}"
        )

        return {
            "messages": [
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": question},
                {"role": "assistant", "content": assistant_content}
            ]
        }

    def _format_single_sample(self, sample: dict) -> dict | None:
        # Try custom format first
        if "problem_statement" in sample and "solution_steps" in sample:
            return self._format_custom_format(sample)

        # Fall back to standard formats
        return super()._format_single_sample(sample)

Conclusion

DeepFabric's GRPO formatter provides a robust, flexible way to prepare datasets for reward-based optimization training. Key takeaways:

Multiple format support: Works with Q&A, CoT, conversations, and generic formats
Configurable tags: Adapt to any GRPO training pipeline
Built-in validation: Ensures quality and compliance
Numerical extraction: Enables reliable reward computation
Integration ready: Works seamlessly with DeepFabric's generation pipeline

Whether you're formatting existing datasets or generating new synthetic data, the GRPO formatter handles the complex transformations needed for successful mathematical reasoning training.

One Dataset, Many Formats: DeepFabric's Approach to Training Format Flexibility

Luke Hinds — Mon, 20 Oct 2025 06:44:21 +0000

The Format Problem in ML Training

Synthetic dataset generation for fine-tuning language models often leads to a format lock-in problem. After investing significant time generating high-quality synthetic data, teams discover their dataset is bound to a single training format. Experimenting with TRL's SFTTrainer requires one specific structure. Switching to GRPO for mathematical reasoning demands another. Testing OpenAI's Harmony format for reasoning-focused models needs yet another transformation.

Teams typically respond by regenerating datasets for each framework, writing custom conversion scripts that are brittle and hard to maintain, limiting themselves to a single framework, or maintaining multiple versions of the same dataset with associated storage and versioning complexity.

The DeepFabric Approach

DeepFabric addresses this through a format-agnostic workflow built on three principles. First, all datasets are generated in a universal storage format using the OpenAI messages standard in JSONL. Second, HuggingFace Hub integration enables sharing datasets once in their generic format. Third, on-demand formatting allows pulling and reformatting datasets for any training framework at runtime.

This architecture means datasets are generated once and uploaded to HuggingFace in a generic format where they can be shared with teams or the broader community. When needed, the dataset can be formatted for any training pipeline without regeneration, enabling experimentation with different frameworks using the same source data.

The Complete Workflow

Step 1: Generate Your Dataset with Chain-of-Thought Reasoning and Custom Tools

Chain-of-thought (CoT) reasoning in tool-calling datasets teaches models to think through problems step-by-step before invoking functions. This approach significantly improves structured output quality in tool and MCP (Model Context Protocol) calling scenarios. When models explicitly reason about which tool to use, what parameters to provide, and why that tool is appropriate, they produce more accurate and contextually appropriate function calls.

DeepFabric supports custom tool definitions, allowing you to specify domain-specific functions that your model should learn to use. Let's create a configuration file that defines custom financial analysis tools and generates a dataset with chain-of-thought reasoning:

# financial_cot_config.yaml
dataset_system_prompt: "You are an expert financial analyst. When responding, first explain your reasoning step-by-step, then call the appropriate financial analysis tools with correct parameters."

topic_tree:
  topic_prompt: "Financial analysis and portfolio management with quantitative tools"
  topic_system_prompt: "You are an expert financial analyst creating comprehensive financial analysis scenarios."
  provider: "openai"
  model: "gpt-4o"
  degree: 4
  depth: 3
  temperature: 0.7
  save_as: "financial_topics.jsonl"

data_engine:
  instructions: "Generate realistic financial analysis scenarios with step-by-step reasoning before tool calls"
  generation_system_prompt: "You are an expert financial analyst. When responding, first explain your reasoning step-by-step, then call the appropriate financial analysis tools with correct parameters."
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.8
  max_retries: 3
  conversation_template: "cot_structured"

  # Define custom financial analysis tools
  custom_tools:
    - name: "get_portfolio_performance"
      description: "Retrieves historical performance data for a portfolio including returns, holdings, and time series data"
      parameters:
        type: "object"
        properties:
          portfolio_id:
            type: "string"
            description: "Unique identifier for the portfolio"
          start_date:
            type: "string"
            description: "Start date in YYYY-MM-DD format"
          end_date:
            type: "string"
            description: "End date in YYYY-MM-DD format"
        required: ["portfolio_id", "start_date", "end_date"]

    - name: "calculate_sharpe_ratio"
      description: "Calculates the Sharpe ratio for risk-adjusted return analysis"
      parameters:
        type: "object"
        properties:
          returns:
            type: "array"
            items:
              type: "number"
            description: "Array of periodic returns"
          risk_free_rate:
            type: "number"
            description: "Annual risk-free rate as a decimal"
        required: ["returns", "risk_free_rate"]

    - name: "analyze_sector_exposure"
      description: "Analyzes portfolio exposure across different market sectors"
      parameters:
        type: "object"
        properties:
          portfolio_id:
            type: "string"
            description: "Unique identifier for the portfolio"
          date:
            type: "string"
            description: "Analysis date in YYYY-MM-DD format"
        required: ["portfolio_id", "date"]

dataset:
  creation:
    num_steps: 200
    batch_size: 4
    sys_msg: true
  save_as: "financial_cot_dataset.jsonl"

# Automatically upload to HuggingFace Hub
huggingface:
  repository: "your-org/financial-reasoning-cot"
  tags:
    - "finance"
    - "reasoning"
    - "chain-of-thought"
    - "tool-calling"
    - "synthetic"

Generate the dataset and upload it to HuggingFace in one step:

export HF_TOKEN="your-hf-token"
deepfabric generate financial_cot_config.yaml

The generated dataset contains examples where the model reasons through financial problems before making tool calls. This generic format includes the reasoning chain, tool definitions, and the conversation structure:

{
  "messages": [
    {
      "role": "user",
      "content": "I need to analyze the risk-adjusted return of my portfolio over the last year"
    },
    {
      "role": "assistant",
      "content": "Let me work through this systematically. To calculate risk-adjusted returns, I need three key metrics: the portfolio's actual return, the risk-free rate for comparison, and the portfolio's volatility (standard deviation). The Sharpe ratio is the standard metric for this analysis, calculated as (portfolio return - risk-free rate) / portfolio standard deviation.\n\nFirst, I should retrieve your portfolio's historical performance data for the past year. Then I'll need to fetch the current risk-free rate, typically using the 1-year Treasury yield. Finally, I'll calculate the standard deviation of returns and compute the Sharpe ratio.\n\nLet me start by getting your portfolio performance data for the specified period."
    }
  ],
  "available_tools": [
    {
      "name": "get_portfolio_performance",
      "description": "Retrieves historical performance data for a portfolio including returns, holdings, and time series data",
      "parameters": {
        "type": "object",
        "properties": {
          "portfolio_id": {"type": "string", "description": "Unique identifier for the portfolio"},
          "start_date": {"type": "string", "description": "Start date in YYYY-MM-DD format"},
          "end_date": {"type": "string", "description": "End date in YYYY-MM-DD format"}
        },
        "required": ["portfolio_id", "start_date", "end_date"]
      }
    },
    {
      "name": "calculate_sharpe_ratio",
      "description": "Calculates the Sharpe ratio for risk-adjusted return analysis",
      "parameters": {
        "type": "object",
        "properties": {
          "returns": {"type": "array", "items": {"type": "number"}, "description": "Array of periodic returns"},
          "risk_free_rate": {"type": "number", "description": "Annual risk-free rate as a decimal"}
        },
        "required": ["returns", "risk_free_rate"]
      }
    }
  ]
}

This format is framework-agnostic. It contains all necessary information including the chain-of-thought reasoning, tool schemas, and conversation flow, but isn't locked to any specific training library. The reasoning component teaches the model to think through domain-specific problems, improving both the quality of tool selection and parameter accuracy in production use.

DeepFabric automatically handles the HuggingFace upload, repository creation, dataset card generation with appropriate metadata, and tag application for discoverability. The key insight is that you're uploading the generic format, not a training-specific format. This single upload serves all downstream use cases across different training frameworks.

Step 2: Pull and Format for Your Training Pipeline

Anyone with access to the dataset can now pull it and format it for their specific training framework. The same source data can be reformatted multiple times for different purposes without regeneration.

Training with TRL SFTTrainer for Tool Calling

TRL's SFTTrainer expects tools in an OpenAI-compatible schema with explicit function definitions. The chain-of-thought reasoning is preserved in the assistant's response, teaching the model to think through the problem before tool invocation:

import subprocess
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer

# Format the dataset for TRL
subprocess.run([
    "deepfabric", "format",
    "--repo", "your-org/financial-reasoning-cot",
    "--formatter", "trl",
    "-o", "training_trl.jsonl"
])

# Load the formatted dataset
dataset = load_dataset("json", data_files="training_trl.jsonl")

# Train with TRL - the reasoning helps the model learn better tool selection
config = SFTConfig(output_dir="./output")
trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=dataset["train"],
)
trainer.train()

This converts the generic dataset to TRL's specific structure:

{
  "messages": [
    {
      "role": "user",
      "content": "I need to analyze the risk-adjusted return of my portfolio over the last year"
    },
    {
      "role": "assistant",
      "content": "Let me work through this systematically. To calculate risk-adjusted returns, I need three key metrics: the portfolio's actual return..."
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_portfolio_performance",
        "description": "Retrieves historical performance data for a portfolio including returns, holdings, and time series data",
        "parameters": {
          "type": "object",
          "properties": {
            "portfolio_id": {"type": "string", "description": "Unique identifier for the portfolio"},
            "start_date": {"type": "string", "description": "Start date in YYYY-MM-DD format"},
            "end_date": {"type": "string", "description": "End date in YYYY-MM-DD format"}
          },
          "required": ["portfolio_id", "start_date", "end_date"]
        }
      }
    }
  ]
}

Training with GRPO for Mathematical Reasoning

GRPO (Generalized Reward-based Policy Optimization) requires explicit reasoning tags to separate the working-out from the final solution. This format works particularly well with the chain-of-thought data:

deepfabric format \
  --repo your-org/financial-reasoning-cot \
  --formatter grpo \
  -o training_grpo.jsonl

The formatter wraps reasoning in tags that GRPO uses for reward modeling:

{
  "messages": [
    {"role": "system", "content": "You are a financial analysis assistant that shows your reasoning before providing solutions."},
    {"role": "user", "content": "I need to analyze the risk-adjusted return of my portfolio over the last year"},
    {
      "role": "assistant",
      "content": "<start_working_out>To calculate risk-adjusted returns, I need three key metrics: the portfolio's actual return, the risk-free rate for comparison, and the portfolio's volatility. The Sharpe ratio is calculated as (portfolio return - risk-free rate) / portfolio standard deviation. I should retrieve the portfolio's historical performance data, fetch the current risk-free rate using the 1-year Treasury yield, calculate the standard deviation of returns, then compute the Sharpe ratio.<end_working_out><SOLUTION>Retrieve portfolio performance data using get_portfolio_performance, then calculate Sharpe ratio with the returns and current risk-free rate.</SOLUTION>"
    }
  ]
}

The explicit separation of reasoning from solution helps GRPO training optimize for both correct thinking processes and accurate final answers.

Training with Harmony for OpenAI Reasoning Models

OpenAI's Harmony format (used in gpt-oss and reasoning-focused models) uses specific XML-style tags for reasoning chains:

deepfabric format \
  --repo your-org/financial-reasoning-cot \
  --formatter harmony \
  -o training_harmony.jsonl

Output structure:

{
  "messages": [
    {"role": "user", "content": "I need to analyze the risk-adjusted return of my portfolio over the last year"},
    {
      "role": "assistant",
      "content": "<reasoning>\nTo calculate risk-adjusted returns, I need three key metrics: portfolio return, risk-free rate, and volatility. The Sharpe ratio is the standard metric: (return - risk_free_rate) / std_dev.\n\nI need to:\n1. Retrieve portfolio historical data\n2. Get current risk-free rate\n3. Calculate standard deviation\n4. Compute Sharpe ratio\n</reasoning>\n<output>I'll retrieve your portfolio performance data for the past year and calculate the Sharpe ratio for risk-adjusted return analysis.</output>"
    }
  ]
}

Training with ChatML for Reasoning-Capable Chat Models

For models that expect ChatML delimiters with preserved reasoning:

deepfabric format \
  --repo your-org/financial-reasoning-cot \
  --formatter im_format \
  -o training_chatml.jsonl

The formatter wraps everything in ChatML tags while maintaining the reasoning flow:

{
  "text": "<|im_start|>user\nI need to analyze the risk-adjusted return of my portfolio over the last year<|im_end|>\n<|im_start|>assistant\nLet me work through this systematically. To calculate risk-adjusted returns, I need three key metrics: the portfolio's actual return, the risk-free rate for comparison, and the portfolio's volatility...<|im_end|>"
}

Why Chain-of-Thought Improves Tool Calling with Custom Tools

The chain-of-thought approach combined with custom tool definitions provides several concrete benefits for model training. Models trained with explicit reasoning learn to validate their tool selection before making calls, reducing errors where inappropriate functions are invoked. The reasoning chain provides context for parameter selection, leading to more accurate argument values. When models explain their approach, they're more likely to catch edge cases and error conditions before execution.

Custom tools allow you to define domain-specific functions that match your production environment. For example, the financial analysis tools defined in the configuration above (get_portfolio_performance, calculate_sharpe_ratio, analyze_sector_exposure) teach the model to work with your specific API schema. When fine-tuning models like SmolLM2-1.7B-Instruct with PEFT/LoRA on this custom tool dataset, the model learns both the reasoning patterns and the exact tool signatures it will encounter in production.

Here's a practical training example using the formatted dataset:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
import subprocess

# Format the dataset for TRL
subprocess.run([
    "deepfabric", "format",
    "--repo", "your-org/financial-reasoning-cot",
    "--formatter", "trl",
    "-o", "trl_sft_tools.jsonl"
])

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")

# Configure LoRA for efficient fine-tuning
peft_config = LoraConfig(
    r=8,
    lora_alpha=8,
    lora_dropout=0.1,
    target_modules=['down_proj', 'o_proj', 'k_proj', 'q_proj', 'gate_proj', 'up_proj', 'v_proj'],
    use_dora=True,
    init_lora_weights="gaussian"
)

# Apply PEFT model adaptation
peft_model = get_peft_model(model, peft_config)

# Load dataset
dataset = load_dataset("json", data_files="./trl_sft_tools.jsonl")

# Configure training
training_args = SFTConfig(
    output_dir="financial-reasoning-model",
    optim="adamw_torch_fused",
    bf16=True,
    push_to_hub=True,
    report_to="none"
)

# Train
trainer = SFTTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset["train"],
    processing_class=tokenizer,
)

trainer.train()

The resulting model learns to reason through financial problems using the exact custom tools you defined, making it production-ready for your specific use case.

Multi-Format Conversion Script

Here's a practical script demonstrating conversion of one chain-of-thought dataset to multiple formats:

#!/bin/bash
# multi-format-reasoning.sh

REPO="your-org/financial-reasoning-cot"
BASE_NAME="financial_reasoning"

echo "Converting reasoning dataset to multiple formats..."

# Format for TRL SFTTrainer with tool calling
echo "Formatting for TRL SFTTrainer..."
deepfabric format --repo $REPO --formatter trl -o "${BASE_NAME}_trl.jsonl"

# Format for GRPO reasoning training
echo "Formatting for GRPO..."
deepfabric format --repo $REPO --formatter grpo -o "${BASE_NAME}_grpo.jsonl"

# Format for Harmony (OpenAI reasoning models)
echo "Formatting for Harmony..."
deepfabric format --repo $REPO --formatter harmony -o "${BASE_NAME}_harmony.jsonl"

# Format for ChatML reasoning chat models
echo "Formatting for ChatML..."
deepfabric format --repo $REPO --formatter im_format -o "${BASE_NAME}_chatml.jsonl"

# Format for XLAM v2 multi-turn reasoning
echo "Formatting for XLAM v2..."
deepfabric format --repo $REPO --formatter xlam_v2 -o "${BASE_NAME}_xlam.jsonl"

echo "Conversion complete. Created 5 training-ready formats from a single dataset."

Running this script produces TRL-ready format with OpenAI-compatible tool schemas, GRPO format with explicit reasoning/solution separation, Harmony format with reasoning tags for OpenAI models, ChatML format with delimiter-wrapped reasoning, and XLAM v2 format for Salesforce's multi-turn tool calling framework. All formats preserve the chain-of-thought reasoning that improves tool calling accuracy.

Supported Formatters

DeepFabric currently supports formatters for major training frameworks:

Formatter	Command Flag	Use Case	Framework
TRL SFT Tools	`trl`	Tool/function calling	HuggingFace TRL
GRPO	`grpo`	Mathematical/logical reasoning	GRPO training
Harmony	`harmony`	Reasoning with XML tags	OpenAI gpt-oss
ChatML	`chatml`	Chat models with structure	ChatML-compatible
Im Format	`im_format`	Chat with delimiters	ChatML variants
Alpaca	`alpaca`	Instruction following	Stanford Alpaca
XLAM v2	`xlam_v2`	Multi-turn tool calling	Salesforce xLAM
Tool Calling	`tool_calling`	Generic tool calling	Agent training
Single Tool	`single_tool_call`	Individual function calls	Function execution

Custom formatters can be created for specific needs using DeepFabric's formatter API.

Advanced Configuration

Fine-grained control over formatting is available through YAML configuration files:

# formatter_config.yaml
dataset:
  formatters:
    - name: "trl_custom"
      template: "builtin://trl_sft_tools"
      output: "custom_trl.jsonl"
      config:
        include_system_prompt: true
        system_prompt_override: |
          You are a function calling AI model. You are provided with function
          signatures within <tools></tools> XML tags. Think through your
          reasoning before invoking tools.
        validate_tool_schemas: true
        remove_available_tools_field: true

Apply with:

deepfabric format --repo your-org/dataset -c formatter_config.yaml

Benefits of This Approach

Storage efficiency improves because teams store one canonical dataset and generate formats on-demand rather than maintaining multiple versions. Collaboration becomes simpler when datasets are shared once in a universal format but consumed in framework-specific formats, eliminating coordination overhead. Experimentation costs drop dramatically since trying different frameworks no longer requires expensive regeneration. Reproducibility improves with clear lineage from source to formatted data and version control at the dataset level. Cost savings accumulate because expensive synthetic data generation happens once while cheap reformatting happens as needed.

The approach is future-proof. When new training frameworks emerge, adding a formatter is straightforward. Datasets remain relevant as frameworks evolve without requiring regeneration as tools improve.

Community Impact

This architecture enables new collaboration patterns. Dataset creators can focus on data quality rather than format compatibility, reach wider audiences with a single upload, and receive feedback more easily. Dataset users find more usable datasets, can experiment with different frameworks without risk, and can contribute formatters back to the community. Research teams can compare approaches fairly using identical source data, reproduce results more reliably, and build on each other's work more effectively.

Getting Started

Install DeepFabric:

pip install deepfabric

Create a configuration file with your custom tools and chain-of-thought settings:

# my_cot_config.yaml
dataset_system_prompt: "You are an expert software engineer. Explain your reasoning step-by-step before using debugging tools."

topic_tree:
  topic_prompt: "Software debugging with developer tools"
  provider: "openai"
  model: "gpt-4o"
  degree: 3
  depth: 3
  temperature: 0.7
  save_as: "debugging_topics.jsonl"

data_engine:
  instructions: "Generate realistic debugging scenarios with step-by-step reasoning"
  generation_system_prompt: "You are an expert software engineer. Explain your reasoning step-by-step before using debugging tools."
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.8
  conversation_template: "cot_structured"

  custom_tools:
    - name: "run_debugger"
      description: "Executes a debugger command and returns the output"
      parameters:
        type: "object"
        properties:
          command:
            type: "string"
            description: "The debugger command to execute"
          breakpoint:
            type: "integer"
            description: "Line number for breakpoint"
        required: ["command"]

dataset:
  creation:
    num_steps: 50
    batch_size: 2
  save_as: "my_dataset.jsonl"

huggingface:
  repository: "your-username/debugging-reasoning"
  tags:
    - "software-engineering"
    - "reasoning"
    - "chain-of-thought"
    - "tools"

Generate and upload in one step:

export HF_TOKEN="your-token"
deepfabric generate my_cot_config.yaml

Format for your training pipeline:

deepfabric format \
  --repo your-username/debugging-reasoning \
  --formatter trl \
  -o training_data.jsonl

Then train using your preferred framework as shown in the training example above.

Working with Existing Datasets

You can format existing community datasets without generating new ones:

# Format a community dataset for TRL
deepfabric format \
  --repo lukehinds/smol-test-sample \
  --formatter trl \
  -o trl_training.jsonl

# Or for GRPO reasoning training
deepfabric format \
  --repo lukehinds/smol-test-sample \
  --formatter grpo \
  -o grpo_training.jsonl

Conclusion

DeepFabric's format-agnostic architecture decouples data generation from training format requirements. Datasets are generated once in a universal format, shared through HuggingFace Hub, and reformatted on-demand for specific training frameworks. This approach reduces storage requirements, eliminates coordination overhead, enables cost-effective experimentation, and ensures datasets remain useful as the ecosystem evolves.

The integration of chain-of-thought reasoning with tool calling further improves this workflow by teaching models to reason through problems before invoking functions, resulting in more accurate tool selection and parameter generation in production environments.

Resources

Documentation: https://lukehinds.github.io/deepfabric/

GitHub: https://github.com/lukehinds/deepfabric

Discord Community: https://discord.gg/pPcjYzGvbS

Examples: https://github.com/lukehinds/deepfabric/tree/main/examples

DeepFabric is open source and welcomes contributions. To add support for a new training format, see the custom formatter guide.

Build High Quality Reasoning Datasets in Minutes Using Natural Prompts

Luke Hinds — Wed, 24 Sep 2025 21:03:45 +0000

As developers, we've all been there. You have a brilliant idea for a Reasoning model you want to fine-tune, but then reality hits: you need training data. Quality training data. A lot of quality training data.

The traditional path? Weeks of manual data curation, complex prompt engineering, which looks good at first, but turns out to be full of duplicates with very little diversity. You have tried a few of the open source tools, but non of them cut-it. Most of us end up abandoning the project or settling for subpar datasets that produce mediocre models.

What if I told you there's a tool that can generate professional-grade CoT datasets in minutes using natural language prompts?

Enter DeepFabric - and it's about to change how you think about dataset creation.

The Problem: Dataset Creation is Broken

Before DeepFabric, creating CoT datasets meant:

📝 Manual curation: Spending days writing examples by hand
🔧 Complex prompt engineering: Wrestling with intricate templates
💸 Expensive services: Paying premium rates for quality data
🎯 Limited diversity: Struggling to create varied, non-repetitive examples
⚖️ Quality vs. quantity: Choosing between good data or enough data

Most developers either gave up or shipped models trained on insufficient data.

The Solution: DeepFabric's Triple Threat

DeepFabric doesn't just solve the dataset problem - it obliterates it with three different CoT formats that cover every use case:

1. 🔥 Free-text CoT (GSM8K Style)

Perfect for mathematical reasoning and step-by-step problem solving.

deepfabric generate \
  --mode tree \
  --provider openai \
  --model gpt-4o-mini \
  --depth 2 \
  --degree 2 \
  --num-steps 4 \
  --topic-prompt "Mathematical word problems and logical reasoning" \
  --generation-system-prompt "You are a math tutor creating educational problems" \
  --conversation-type cot_freetext \
  --dataset-save-as math_reasoning.jsonl

Output format:

{
  "question": "Sarah has 24 apples. She gives away 1/3 to her neighbors and keeps 1/4 for herself. How many apples are left?",
  "chain_of_thought": "First, I need to find 1/3 of 24 apples. 24 ÷ 3 = 8 apples given to neighbors. Next, I need to find 1/4 of 24 apples. 24 ÷ 4 = 6 apples kept for herself. Total apples used: 8 + 6 = 14 apples. Apples left: 24 - 14 = 10 apples.",
  "final_answer": "10 apples"
}

2. 🏗️ Structured CoT (Conversation Based)

Ideal for educational dialogues and systematic problem-solving.

deepfabric generate \
  --mode graph \
  --provider ollama \
  --model qwen3:32b \
  --topic-prompt "Computer science algorithms and data structures" \
  --conversation-type cot_structured \
  --reasoning-style logical \
  --dataset-save-as cs_reasoning.jsonl

Output format:

{
  "messages": [
    {"role": "user", "content": "How would you implement a binary search algorithm?"},
    {"role": "assistant", "content": "I'll walk you through implementing binary search step by step..."}
  ],
  "reasoning_trace": [
    {"step": 1, "reasoning": "Define the search space with left and right pointers"},
    {"step": 2, "reasoning": "Calculate middle index to divide the array"},
    {"step": 3, "reasoning": "Compare target with middle element"}
  ],
  "final_answer": "Binary search works by repeatedly dividing the search interval in half. Start with two pointers: `left = 0` and `right = len(arr) - 1`. While `left <= right`, compute `mid = (left + right) // 2`. If `arr[mid] == target`, return `mid`. If `arr[mid] < target`, move `left = mid + 1`; otherwise move `right = mid - 1`. If the loop ends without finding the target, return -1."
}

3. 🚀 Hybrid CoT (Best of Both Worlds)

Combines natural reasoning with structured steps - perfect for complex domains.

deepfabric generate \
  --provider gemini \
  --model gemini-2.5-flash \
  --topic-prompt "Infrastructure outages and their resolution" \
  --conversation-type cot_hybrid \
  --num-steps 8 \
  --dataset-save-as science_hybrid.jsonl

Output format:

{
  "question": "A critical production service, 'Order Processing,' is experiencing intermittent failures. Users report orders are occasionally failing to process, leading to lost revenue. The service runs on Kubernetes, utilizing a deployment of 10 pods. Initial monitoring shows CPU and memory usage are within acceptable limits, but error logs indicate sporadic database connection timeouts. You suspect a resource contention issue or a misconfiguration somewhere in the infrastructure. Describe your systematic approach to diagnose and resolve this issue, detailing the tools and techniques you'd employ at each step.",
  "chain_of_thought": "To systematically troubleshoot this intermittent failure, I will follow a structured approach. First, I will gather more information about the failures, including frequency, error messages, and affected users. Then, I will examine the Kubernetes deployment, focusing on resource limits, liveness/readiness probes, and networking. Next, I will investigate the database connection pool and its configuration. I will also analyze the application code for potential resource leaks or inefficient database queries. Finally, I will implement monitoring and alerting to proactively detect and prevent future issues.",
  "reasoning_trace": [
    {
      "step_number": 1,
      "thought": "Gather detailed information about the failures. This includes the exact error messages, timestamps, affected users, and any patterns in the failures (e.g., time of day, specific order types).",
      "action": "Analyze application logs, Kubernetes events, and user reports to collect failure data."
    },
    {
      "step_number": 2,
      "thought": "Examine the Kubernetes deployment configuration. Verify resource limits (CPU, memory) are appropriately set for the pods. Check liveness and readiness probes to ensure pods are healthy and responsive. Inspect networking configuration to rule out connectivity issues.",
      "action": "Use `kubectl describe deployment order-processing` and `kubectl get pods -o wide` to inspect the deployment and pod configurations. Check network policies and service definitions."
    },
    {
      "step_number": 3,
      "thought": "Investigate the database connection pool configuration. Verify the maximum number of connections is sufficient for the workload. Check for connection leaks or inefficient connection management in the application code.",
      "action": "Examine the application's database connection pool settings. Use database monitoring tools to track connection usage and identify potential leaks. Analyze application code for database query performance and connection handling."
    },
    {
      "step_number": 4,
      "thought": "Analyze application code for potential resource leaks or inefficient database queries. Use profiling tools to identify performance bottlenecks and memory leaks. Optimize database queries to reduce load on the database.",
      "action": "Use profiling tools like `pprof` or Java profilers to analyze application performance. Examine database query logs and use `EXPLAIN` to optimize slow queries."
    },
    {
      "step_number": 5,
      "thought": "Implement enhanced monitoring and alerting. Set up alerts for database connection errors, high latency, and resource utilization. Use dashboards to visualize key metrics and identify trends.",
      "action": "Configure Prometheus and Grafana to monitor database connections, latency, and resource usage. Set up alerts in Alertmanager to notify on-call engineers of critical issues."
    },
    {
      "step_number": 6,
      "thought": "Consider implementing a circuit breaker pattern to prevent cascading failures. If the database becomes unavailable, the circuit breaker will prevent the application from overwhelming the database with requests.",
      "action": "Implement a circuit breaker using a library like Hystrix or Resilience4j in the application code."
    },
    {
      "step_number": 7,
      "thought": "Implement connection retry logic with exponential backoff. This will allow the application to gracefully handle temporary database connection issues.",
      "action": "Implement retry logic using a library like Guava Retry or Spring Retry in the application code."
    },
    {
      "step_number": 8,
      "thought": "Conduct load testing to simulate production traffic and identify potential bottlenecks. Use load testing tools like JMeter or Gatling to generate realistic workloads.",
      "action": "Run load tests with increasing traffic to identify the point at which the database connection timeouts occur."
    },
    {
      "step_number": 9,
      "thought": "If the database is the bottleneck, consider scaling the database or optimizing its configuration. This may involve adding more resources to the database server, tuning database parameters, or implementing database sharding.",
      "action": "Scale the database vertically (add more CPU, memory) or horizontally (add more database replicas). Tune database parameters like `max_connections` and `shared_buffers`."
    },
    {
      "step_number": 10,
      "thought": "Document the troubleshooting steps and the resolution. Create a runbook for future incidents. Share the knowledge with the team to prevent similar issues in the future.",
      "action": "Create a detailed runbook documenting the troubleshooting steps, the root cause, and the resolution. Share the runbook with the team and conduct a post-mortem analysis."
    }
  ],
  "final_answer": "The intermittent failures are likely due to database connection timeouts caused by resource contention or misconfiguration. The systematic approach involves gathering detailed information, examining the Kubernetes deployment, investigating the database connection pool, analyzing application code, implementing enhanced monitoring and alerting, and considering circuit breakers and retry logic. Load testing and database scaling may also be necessary. Finally, documenting the troubleshooting steps and sharing the knowledge with the team is crucial."
}

Why Developers Are Going Crazy for DeepFabric

🧠 Smart Topic Generation

DeepFabric doesn't just generate random examples. It creates a hierarchical topic tree or graph-nodes first, ensuring your dataset covers diverse subtopics without redundancy:

Mathematical Reasoning
├── Algebra Problems
│   ├── Linear Equations
│   └── Quadratic Functions
└── Geometry Problems
    ├── Area Calculations
    └── Volume Problems

🔧 YAML Configuration = Zero Complexity

No more complex prompt engineering. Just describe what you want:

# cot_config.yaml
dataset_system_prompt: "You are a helpful AI that solves problems step-by-step"

topic_tree:
  topic_prompt: "Programming challenges and algorithms"
  provider: "ollama"
  model: "qwen3:32b"
  depth: 3
  degree: 3

data_engine:
  conversation_type: "cot_hybrid"
  reasoning_style: "logical"
  instructions: "Create coding problems that require systematic thinking"

dataset:
  creation:
    num_steps: 50
    batch_size: 5

Then run: deepfabric generate cot_config.yaml

🌐 Multi-Provider Freedom

Switch between providers based on your needs:

OpenAI GPT-4 for complex reasoning
Ollama for local, private generation
Gemini for fast bulk creation
Anthropic Claude for nuanced problems

📤 Instant HuggingFace Integration

deepfabric generate config.yaml --hf-repo username/my-cot-dataset

Your dataset is automatically uploaded with a generated dataset card. No manual uploads, no fuss.

Real-World Impact: What Developers Are Building

🎓 Educational AI: Teachers creating personalized math tutoring datasets
🤖 Agent Training: Developers building reasoning agents for complex tasks
📊 Research: ML researchers generating evaluation benchmarks
💼 Enterprise: Companies creating domain-specific reasoning models

The Numbers Don't Lie

⏱️ 95% faster than manual dataset creation
📈 10x more diverse examples per domain
💰 80% cost reduction compared to data labeling services
🎯 Zero prompt engineering required

Ready to Transform Your ML Pipeline?

Getting started takes literally 30 seconds:

# Install
pip install deepfabric

# Generate your first CoT dataset
deepfabric generate \
  --topic-prompt "Your domain here" \
  --conversation-type cot_freetext \
  --num-steps 10 \
  --provider openai \
  --model gpt-4o-mini

# Watch the magic happen ✨

What's Next?

The ML community is moving fast, and quality training data is the bottleneck. DeepFabric removes that bottleneck entirely.

Whether you're building the next breakthrough in reasoning AI or just need better training data for your side project, DeepFabric gives you superpowers.

Stop spending weeks on dataset creation. Start building better models today.

Try DeepFabric Now:

📚 GitHub: https://github.com/lukehinds/deepfabric
📖 Documentation: https://lukehinds.github.io/DeepFabric/
💬 Discord: Join the community for support and sharing datasets

What kind of CoT dataset will you build first? Drop a comment and let's discuss! 🚀

Tags: #MachineLearning #AI #Datasets #ChainOfThought #Python #OpenSource #MLOps #DataScience #DeepLearning #ArtificialIntelligence

🚀 Build a ☸️ Kubernetes ☸️ SRE Agent in Just Ten Minutes! ⏱️⚡

Luke Hinds — Tue, 02 Sep 2025 09:42:55 +0000

Hey dev.to community! Have you ever wished you could automate more of your Kubernetes operational tasks, freeing up valuable SRE time? I made a video recently on "Building an AgentUp Kubernetes Agent in only Ten Minutes," demonstrates how to do just that with incredible speed and efficiency using AgentUp. This isn't just about scripting; it's about creating an autonomous, goal-driven agent that can understand, plan, and execute complex Kubernetes operations.

Let's dive into how the new and exciting AgentUp framework makes this possible and what makes this demonstration so compelling for developers.

Getting Started: A New Agent in a Flash

The video kicks off by showing just how easy it is to initialize a new agent using the agentup init command. The key choices made set the stage for a powerful automation tool:

Iterative Agent: This is the core of its intelligence – a goal-driven autonomous agent capable of planning, decomposing tasks, and iterating through them until a goal is achieved. The demo generously allocates 50 iterations, showing its capacity for complex problem-solving.
Essential Components: The initialization includes authentication, state management, an AI provider (OpenAI in this case), MCP (Model Context Protocol), and deployment options. For simplicity in the demo, a simple API key is used for authentication, with state persisted to a file.

The Brain of AgentUp: `agentup.yaml`

Once initialized, the agentup.yaml file becomes the central configuration point – effectively, the "brain of AgentUp". What's particularly appealing for developers here is its declarative runtime, allowing you to set everything in a single point of truth. This configuration is designed to be portable and easily shared or stored in a repository.

A critical part of the configuration involves the MCP (Model Context Protocol), where we use the Kubernetes MCP Server from the containers team. In AgentUp, MCP tools are exposed as A2A (Agent-to-Agent) skills, making them discoverable capabilities. This means one agent can query another to find out if it possesses the necessary skills, for example, to list pods or check cluster state.

Robust Security with Scopes

One of the most impressive aspects for a production environment is AgentUp's approach to security. Each tool available within the Kubernetes MCP server is mapped to a scope, which acts as a security permission.

The demo uses a wildcard scope for simplicity, granting admin access.
However, the system supports a top-down scope hierarchy, allowing for highly granular control over what the MCP server presents to the model.
Crucially, the model is not even aware of tools it has no scope for, and any attempt to use them would be denied. This ensures that your agent operates only within its designated permissions. While a fixed API key is used for the demo, in a real-world scenario, these scopes would typically be within a signed token for robust validation.

Putting the Agent to Work: Live Kubernetes Operations

After a quick virtual environment setup and running the agent, the demonstration shifts to a front-end client that communicates with the agent via A2A JSON RPC methods. This client immediately pulls and displays the available tools/skills from the agent.

The demo beautifully illustrates the iterative agent's capabilities with a series of chained commands:

Discovering Namespaces and Pods: The agent is asked, "How many namespaces do I have in my cluster and which namespaces contain the most pods?".
- The agent authenticates, identifies available tools, and selects namespaces list.
- It then proceeds to pods list in namespace.
- The agent successfully breaks this down into three tasks: listing namespaces, counting pods per namespace, and ranking namespaces by pod count. It correctly identifies the namespace with the most pods, showcasing its goal-driven task decomposition.
Checking Pod Running Times: A follow-up query, "How long have the pods been running within the agent up cluster?", demonstrates the agent's ability to extract specific information, again calling MCP pods listing namespace.
Performing a Deletion (and Verification!): In a bold move, the demonstration requests, "Please remove the completed completed pod".
- The agent processes this, calling pods delete, and successfully removes the specified pod. This highlights its write capabilities and the impact of its assigned scopes.
- To complete the loop, a verification step, "verify the pod has actually been removed," is performed, with the agent confirming the removal and providing a 100% confidence score.

What This Means for Developers

This AgentUp demo highlights several powerful takeaways:

Speed of Development: You can configure an agent incredibly quickly, essentially needing to "set a persona for the agent, give it some tools, and then send it to work".
Autonomous Operations: The iterative, goal-driven nature of these agents means they can handle complex tasks with minimal human intervention.
Secure & Discoverable Capabilities: The MCP system with granular scopes ensures secure operations, while A2A skills make agent capabilities discoverable, paving the way for advanced agent collaboration.

The video hints at exciting future developments like agent collaboration, which promises to unlock even more complex automation scenarios.

If you're looking to bring intelligent automation to your Kubernetes clusters and streamline SRE workflows, AgentUp looks like a game-changer. Check out the video and consider exploring AgentUp yourself – there's even an upcoming agent workshop mentioned where you can build your own agents! This is definitely one to watch for anyone interested in the future of autonomous operations.

The Agentup.yml is available here, to get you going in minutes!

What other Agents could you build, using your favorite MCP servers?

https://gist.github.com/lukehinds/1013cfe94a5417715a71ae4c5e9b6c8c

vibe coders welcome

Luke Hinds — Sat, 30 Aug 2025 12:40:53 +0000

Luke Hinds

Aug 30 '25

🤖 Build Your First Fully Autonomous AI Agent - Free Workshop! 🚀

#ai #agentaichallenge #aiops #opensource

Comments 3

1 min read

🤖 Build Your First Fully Autonomous AI Agent - Free Workshop! 🚀

Luke Hinds — Sat, 30 Aug 2025 12:27:43 +0000

Ever wanted to build your own multi-turn autonomous AI Agent?

A red teamer 👾, coding assistant 🧑🏽‍💻, CI bug fixer 💽, AI Dog Washer 🐶?

Got a burning business idea you'd love to prototype with AI agents, but don't want to spend 2-3 days wrestling with spaghetti like over-bloated frameworks?

Well come along to the "Build an Agent", AgentUp Workshop and you get to take home your own working agent at the end! 🤖

What you need:
Ability to use a terminal ✅
Can edit config files ✅
Some imagination ✅
✅ No coding skills required , vibe coders are welcome 🚀

Bring whatever MCP servers you want, plug them into AgentUp, and bob's your uncle - you've got a working agent!

AgentUp is the Open Source Agent Framework to learn for the future. It built by engineers who've created open-source solutions powering mission-critical systems at Google, GitHub, Nvidia, Red Hat, Shopify and more. We understand what it takes to build stable, secure, scalable software - and we're applying those same principles to make AI agents truly production-ready, secure and reliable.

Friday 09/05 at 7PM C.E.T / 10:00 am PDT on the RDR discord:

https://discord.gg/gqSg8YMm?event=1410340261778952263

The Agent Paradox: Why AI Agents Shine in Demos But Stumble in Production

Luke Hinds — Tue, 26 Aug 2025 12:23:06 +0000

It worked so well before, what went wrong?

A recurring pattern has emerged across the AI Agent world: Agents that demonstrate remarkable capabilities in controlled environments often struggle when deployed to real-world scenarios. These systems execute complex tasks with seemingly magical precision during demonstrations, impressing audiences and convincing stakeholders, when on strict guardrails with highly specific system prompts. However, production deployment tells a different story.

The same agent that wow'ed with the demo, begins to stumble in the wild. It enters infinite loops, produces inconsistent outputs, or worse—fails catastrophically with real financial and reputational consequences. This phenomenon reveals the Agent Paradox.

This phenomenon isn't just an unfortunate coincidence; it's a fundamental challenge that exposes the mismatch between our traditional engineering practices and the probabilistic nature of AI systems.

The Heart of the Problem

The core issue lies in the probabilistic nature of Large Language Models (LLMs) that power these agents. Unlike traditional software, LLMs don't produce the same output for identical inputs. This non-determinism creates a perfect storm of engineering challenges:

Unpredictable behavior under identical conditions
Difficulty in comprehensive testing of all potential outcomes
Complex failure modes that emerge only in production environments
"Coming aground" scenarios where agents get stuck in loops or halt unexpectedly

When agents encounter novel user interactions, live data streams, or unexpected API responses, their brittleness becomes painfully apparent. This reality demands a fundamental rethinking of how we develop, test, and manage agentic software.

The Fundamental Engineering Conflict

Traditional Software Engineering: The Deterministic World

For decades, our engineering practices have been built on solid deterministic foundations:

Predictable Logic

def calculate_tax(income):
    if income <= 50000:
        return income * 0.10
    else:
        return income * 0.20

Given input x, function always produces output y. Simple. Reliable. Testable.

Repeatable Testing

def test_tax_calculation():
    assert calculate_tax(30000) == 3000
    assert calculate_tax(60000) == 12000

Run this test a million times—same result every time. CI passes, we ship!

Transparent Debugging

When bugs occur, we set breakpoints, inspect variables, trace the call stack. The application's state is frozen in time, allowing systematic analysis of exactly what went wrong.

This is engineering in a determined world.

AI Agents: Welcome to the Probabilistic Universe

AI agents operate in a fundamentally different paradigm that shatters our deterministic world:

Probabilistic Nature
Instead of computing a single "correct" output, agents calculate probability distributions over vast possibility spaces:

P(Output | Input) = complex probability distribution

The agent samples from this distribution, and the emergent behaviors are often beyond our complete understanding.

The Black Box Problem

There's no breakpoint you can set inside a neural network during inference. The agent's "state" exists as:

High-dimensional vectors that are opaque to human inspection
Complex weight matrices with billions of parameters
Emergent patterns that resist traditional analysis

When an agent fails, debugging becomes an exercise in educated guesswork rather than systematic analysis.

Real-World Implications

Demo Environment vs. Production Reality

Demo Environment:

Controlled inputs and scenarios
Curated test cases that showcase strengths
Limited edge cases
Forgiving error handling

Production Environment:

Unpredictable user behavior
Malformed or unexpected data
Integration with flaky external APIs
Real stakes for failures

The Brittle Agent Phenomenon

Consider a customer service agent that works perfectly in demos but fails in production when:

Users provide ambiguous requests
External systems return unexpected error codes
The conversation context becomes too complex
Edge cases emerge that weren't anticipated during development

Moving Forward: New Engineering Paradigms

The Agent Paradox isn't just a technical challenge—it's a call for new engineering methodologies adapted to probabilistic systems:

1. Probabilistic Testing Strategies

Instead of asserting exact outputs, we need to test distributions and ranges of acceptable behaviors.

2. Observability Over Debugging

Since we can't debug AI agents traditionally, we need robust monitoring, logging, and behavioral analysis systems.

3. Graceful Degradation by Design

Systems must be architected to handle probabilistic failures elegantly, with multiple fallback strategies.

4. Continuous Behavioral Validation

Ongoing monitoring of agent behavior in production, with automated detection of drift or degradation.

Conclusion

The Agent Paradox represents more than a technical hurdle—it's a paradigm shift that challenges decades of software engineering practices. As we continue to integrate AI agents into critical systems, we must evolve our methodologies to embrace probabilistic behavior while maintaining the reliability our users expect.

The magic of AI agents in demos is real, but harnessing that magic reliably in production requires us to fundamentally rethink how we build, test, and maintain software systems.

The future belongs to engineers who can bridge the gap between deterministic expectations and probabilistic realities.

We are trying to our small part towards making that happen with AgentUp, enterprise grade Agents , built with good fundamental principles of software engineering.

What strategies have you found effective for managing probabilistic AI systems in production? Share your experiences in the comments below.

Tags: #ai #agents #llm #softwaredevelopment #engineering #testing #debugging #machinelearning

What is the Best Programming Language for AI Engineering!

Luke Hinds — Thu, 07 Aug 2025 09:16:33 +0000

Luke Hinds

Aug 7 '25

What Is The Best Language 🏆for AI Engineering❓

#ai #python #rust #go

Comments 1

2 min read

What Is The Best Language 🏆for AI Engineering❓

Luke Hinds — Thu, 07 Aug 2025 09:00:47 +0000

I have had a few folks ask me about language choices in AI Engineering. Now before I get into this, there is never a one-size fits all answer, so much of it is based on many nuances, with a key one being the fluency of the individual or team in the choice of language.

We have also seen AI become commoditized where you don't need to understand complex math or neural networks, and instead it's about slinging requests around, handling concurrency and other common web-like programming paradigms.

Having said that , here are my views and why as someone who has historically been more of a static typed enthusiast - Python is coming out on top and its why I am building AgentUp in Python.

1 - Python

Massive AI/ML ecosystem (transformers, pytorch, OpenAI libs, etc.)
Performance when it matters - heavy computation happens in optimized C/CUDA libraries under the hood, so you get Python's simplicity for App logic + C's speed for AI operations
Decent async support for concurrent agents (I love FastAPI)
Yes, its dynamically typed, but Pydantic helps lots with data validation (and v2's Rust core from a performance standpoint)
Lots of strong frameworks, langchain, autogen, langgraph, pydantic AI and more.
Extensive libraries for common agent needs (web scraping, APIs, databases)
It's been the go to data science lang for years, so has a strong pipeline of research to product - e.g. jupyter -> prototype -> product.

No 2. Go

Go would be my second choice, especially if performance and deployment simplicity are priorities:

Excellent concurrency model with goroutines (ideal for multi-agent systems)
Single binary deployments with no dependency management headaches
Strong performance for I/O-heavy operations (common in agents)
Growing ecosystem for AI integrations
Excellent native networking (net/http is phenomenal and the gRPC ecosystem strength for microservices architectures)
No native CUDA bindings (I am not a fan of CGo).

No. 3 Rust

Rust ranks third , another strong statically typed language, with great tooling support (cargo ftw!)

Exceptional performance and memory safety
Excellent for systems-level work
But smaller AI ecosystem and steeper learning curve, but huggingface are building out more libraries there.
Development velocity would likely be slower
Better suited if you're building performance-critical infrastructure

Worthy Mention, Typescript!

TypeScript/Node.js/Deno deserves a mention for teams already in that ecosystem! Vercel's AI SDK is decent framework!

Of course there are many more nuances though. Certain languages can be applied where they are strong. If you're building Agents that deal with web APIs all day, you're not too concerned about gpu inference etc.

As with any x vs y post, I am sure I have opened a can of worms! My next topic will be vim vs emacs, just kidding

Forem: Luke Hinds

nono : the ultimate coding agent security tool - sandbox and supercharge Claude Code in just two simple steps!

Permission prompts don't scale

Kernel-level enforcement with no escape

Getting started

What's next

Introducing nono: A Secure Sandbox for AI Agents

What is nono?

Why Not Just Use Docker?

Quick Start

Key Features

No Escape Hatch

Agent Agnostic

Defense in Depth

Sensitive Paths Protected by Default

Process Inheritance

How It Works

Platform Support

Debugging: The why Command

Real-World Use Cases

Get Involved

Conclusion

🚀🔥 Train a 4B Model to Beat Claude Sonnet 4.5 🤖 and Gemini Pro 2.5 ✨ at Tool Calling 🛠️ - and for all Free! 💸🎉

Introduction

The Problem with Generalist Models

The DeepFabric Approach

1. Topic Graph Generation

2. Synthetic Sample Generation

3. Fine-Tuning and Evaluation

The Results

Why This Works

How to Replicate Our Results

Option 1: Use Our Google Colab (Easiest)

Option 2: Run Locally

Training Your Model

Evaluating Your Model

What Makes DeepFabric Different

Topic Graph Diversity

Real Tool Execution

MCP Integration

End-to-End Pipeline

Use Cases

The Bigger Picture

Getting Started

Links

Fine-Tuning with GRPO Datasets: A Developer's Guide to DeepFabric's GRPO Formatter

Introduction

What is GRPO?

The GRPO Format Structure

Using DeepFabric's GRPO Formatter

Basic Usage

Configuration Options

Supported Input Formats

1. Question-Answer Format

2. Chain-of-Thought Format

3. Structured Chain-of-Thought

4. Conversation Format

5. Generic Format

Validation and Quality Control

Input Validation

Output Validation

Numerical Answer Extraction

Real-World Example: Formatting a Math Dataset

Integration with DeepFabric Dataset Generation

Best Practices

1. Match Your Training Pipeline

2. Enable Validation for Math Tasks

3. Provide Quality Reasoning Traces

4. Customize System Prompts

5. Validate Your Output

Troubleshooting Common Issues

Issue: Samples Being Filtered Out

Issue: Answer Extraction Fails

Issue: Reasoning Not Preserved

Advanced: Custom Format Handling

Conclusion

Further Reading

One Dataset, Many Formats: DeepFabric's Approach to Training Format Flexibility

The Format Problem in ML Training

The DeepFabric Approach

Debugging: The `why` Command

The Brain of AgentUp: `agentup.yaml`