Forem: Luca Ostermann

The RL environment platform landscape in 2026

Luca Ostermann — Tue, 28 Apr 2026 09:36:50 +0000

In my last post I wrote about the pain of setting up a local RL environment from scratch.

So Update guys hehe I spent some time doing some digging and here what I got :
My focus is browser-based web navigation tasks, so I care a lot about headless browser support, reset speed, parallelism, and how well the reward signal reflects real task completion. Your priorities might differ.

Why this market exists at all

It's worth stepping back to understand why RL environment platforms are becoming a thing.

OpenAI, Anthropic, and Meta don't buy RL environments off the shelf. They build them internally. According to a TechCrunch investigation, Anthropic has discussed spending more than $1 billion on RL environments over the next year. OpenAI's ChatGPT Agent training relies on what researchers call "UI Gyms" browser-based environments simulating real software at scale. As SemiAnalysis reported, the major labs each maintain distinct procurement strategies, with firms like Mercor, Surge, and Handshake acting as major environment and data suppliers.

The market is moving fast. Mercor one of the largest AI training data platforms, used by the top 5 AI labs acquired Sepal AI in February 2026 to deepen its RL environment capabilities, describing the acquisition as targeting the intersection of human data, RL environments, and specialized research. TechCrunch noted that Mercor is now pitching investors on domain-specific RL environments for coding, healthcare, and law.

For everyone outside the top labs: building your own environment infrastructure from scratch is almost certainly the wrong move. The engineering cost is high, the maintenance is ongoing, and your core competency is probably the agent not the environment. That's exactly the gap the platforms below are trying to fill.

The landscape: 6 platforms worth knowing

1. Surge AI

Focus: Enterprise RL environments, human-expert data pipelines

Surge AI is one of the most established players in this space they partner with OpenAI, Anthropic, Meta, and Google, and have been building RL environments well before most startups entered the market. Their flagship environment suite includes CoreCraft, a large-scale enterprise simulation spanning 2,500+ entities and 23 tools, designed to test real-world agentic capabilities. Their research showed that even GPT-5 and Claude fail over 40% of agentic tasks in realistic RL environments which gives a sense of how seriously they approach environment design. The tradeoff: Surge is enterprise-grade and priced accordingly. Not the entry point for smaller teams.

2. Rise Data Labs

Focus: Browser agents, human data pipelines, RL environment curation

Rise Data Labs operates at an interesting intersection they build RL training environments with a focus on human data and AI training data pipelines, and they also maintain a curated directory of providers across the ecosystem. That dual positioning gives them a broader view of the space than most pure-play platforms, and the task quality reflects it. Worth looking at both as a platform and as a resource for navigating the broader landscape especially for teams that aren't quite at Surge's scale.

3. Mercor

Focus: Domain-specific RL environments, expert data at scale

Mercor recently acquired Sepal AI to deepen its RL environment capabilities, targeting domain-specific tasks like coding, healthcare, and law. They're used by the top 5 AI labs and bring a strong human-expert network to environment and reward design. Still expanding their environment product, but worth watching closely especially as they integrate Sepal's infrastructure.

4. Prime Intellect

Focus: Research teams, custom environment infrastructure

Prime Intellect Open-source friendly and highly flexible you can bring your own environment via their Environments Hub, useful if your setup has unusual dependencies. Strong on distributed compute. The tradeoff is onboarding complexity: documentation assumes you already know what you want, better for experienced teams than newcomers.

5. Mechanize

Focus: Coding and software agent tasks

Mechanize Purpose-built for code-related RL. Their "replication training" approach agents recreating implementations from spec produces strong reward signals for code tasks. Not the right tool for browser agents, but worth knowing about if your use case is code execution, repo navigation, or terminal interaction.

6. HUD

Focus: General RL, end-to-end lifecycle

HUD One of the more complete general-purpose platforms covers environment authoring, evaluation, and observability in one place. Useful if you don't want to stitch together separate tools. Performance on browser-specific tasks lags behind more specialized options, but for general RL workflows it covers the bases.

How to think about the choice

A few things worth keeping in mind when evaluating:

Match the platform to your task type. A platform built for coding tasks won't give you what you need for browser agents, and vice versa. The more specialized the platform, the better it tends to perform in its lane and the worse outside it.

Human data integration matters more than most people think. Platforms that incorporate real human feedback into the reward signal rather than purely synthetic signals tend to produce agents that generalize better.

Evaluate independently from where you train. If you train and evaluate on the same environment, you're measuring memorization, not generalization. Worth building this separation in early.

If you've worked with any of these platforms or others I haven't covered, I'd genuinely like to hear what you've seen in the comments!

Setting up a local RL environment in 2026 and WHAT I wish I knew!!!

Luca Ostermann — Mon, 27 Apr 2026 11:48:27 +0000

I spent three days last month getting a reinforcement learning environment to run locally before I could write a single line of training code.

Three days. For the setup.

I'm writing this because I found almost no practical guide that covers the annoying parts, the ones that actually eat your time. So here's everything I wish someone had told me before I started.

1. Your "simple" environment is probably not simple

I started with what I thought was a minimal setup: a custom browser-based environment for testing a web navigation agent. I figured I'd have something running in an afternoon.

What I didn't account for:

Rendering backends. If your env involves any visual observation (even a headless browser), you need a display server. On a Linux dev machine without a monitor, that means Xvfb or a virtual framebuffer. This alone took me half a day to debug.
Gym vs Gymnasium. OpenAI Gym is deprecated. A lot of tutorials still use it. Gymnasium is the maintained fork. They're mostly compatible but not perfectly especially around reset() return signatures. If you're getting too many values to unpack errors, this is probably why.
Step API changes. Gymnasium introduced a new step API that returns 5 values instead of 4 (terminated and truncated are now separate). Half the example code online still uses the old API.

Lesson: read the Gymnasium migration docs before anything else. It takes 15 minutes and saves hours.

2. Dependency hell is real, and it's specifically bad for RL

RL libraries have notoriously tangled dependencies. In my case:

stable-baselines3 → requires torch >= 1.11
ray[rllib] → pins its own torch version
my browser env → needs playwright which needs its own chromium

These don't always play nice together. My recommendation: one environment per project, managed with uv or at minimum a fresh venv. Don't try to share an environment across RL projects. It will break.

Also: pin your versions immediately. RL libraries update fast and breaking changes are common. Future-you will thank present-you.

3. Episode resets are where bugs hide

The most subtle bugs I've hit are in reset(), not step(). Specifically:

State leakage between episodes. If your environment holds any mutable state (a browser session, a file handle, a DB connection), make sure reset() actually clears it. I had an agent that looked like it was learning when it was just reusing the previous episode's state.
Seeding. If you don't seed your environment properly, your results aren't reproducible. Gymnasium has a seed parameter in reset() now. Use it. Log the seed.
Slow resets kill training speed. If your environment takes 2 seconds to reset and you're running 10,000 episodes, that's 5+ hours just in resets. Profile this early.

4. Observation and action spaces: be boring

I made the mistake of designing a fancy observation space early on — nested dicts, variable-length sequences, mixed types. It looked elegant. It was a nightmare to work with.

For a first pass: flatten everything. Use gym.spaces.Box with a fixed shape. Use gym.spaces.Discrete for actions. You can make it fancy later once the training loop actually runs.

The goal at setup is to get something training, not to get the right thing training.

5. Validate your environment before training

This saved me from a week of confused debugging. Before running any RL algorithm on your env, run this:

from gymnasium.utils.env_checker import check_env
check_env(env)

It will catch observation/action space mismatches, incorrect reset signatures, and a bunch of other subtle issues. It's not perfect but it's fast and it catches the obvious stuff.

Also manually step through a few episodes with random actions and print everything:

obs, info = env.reset()
for _ in range(10):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    print(obs, reward, terminated, truncated)
    if terminated or truncated:
        obs, info = env.reset()

If this breaks, your RL algorithm will too — but with much less helpful error messages.

6. Local is good, but know its limits

Local setup is great for iteration speed and not burning cloud credits. But there are limits:

Parallelism is hard locally. Most serious RL training benefits from running many environments in parallel. On a laptop or a single dev machine, you'll hit CPU/memory limits fast.
Browser-based environments are especially heavy. Each environment instance might spin up its own browser process. 8 parallel envs = 8 browser processes. Your machine will notice.
You'll eventually want to scale. Whether that's a cloud VM, a university compute cluster, or an RL environment platform, local setup is a starting point — not the final destination.

I'm still figuring out the scaling part myself. If you've solved this in an interesting way, I'd genuinely like to hear it in the comments.

TL;DR the checklist

Use Gymnasium, not Gym. Read the migration docs.
Isolate dependencies. Use uv or a fresh venv per project.
Profile your reset(). State leakage and slow resets are silent killers.
Start with flat, boring observation and action spaces.
Run check_env() before you touch an RL algorithm.
Local is fine to start but plan for the day you need to scale.

If you're setting up your first RL environment and hit something I didn't cover, drop it in the comments. I'm definitely still learning and would appreciate the discussion.