Just added SAM3 video object tracking to X-AnyLabeling!

Jack Wang — Sat, 03 Jan 2026 17:10:03 +0000

Hey everyone!

Just wanted to share that we've integrated SAM3's video object tracking into X-AnyLabeling. If you're doing video annotation work, this might save you some time.

What it does:

Track objects across video frames automatically
Works with text prompts (just type "person", "car", etc.) or visual prompts (click a few points)
Non-overwrite mode so it won't mess with your existing annotations
You can start tracking from any frame in the video

Compared to the original SAM3 implementation, we've made some optimizations for more stable memory usage and faster inference.

The cool part: Unlike SAM2, SAM3 can segment all instances of an open-vocabulary concept. So if you type "bicycle", it'll find and track every bike in the video, not just one.

How it works:
For text prompting, you just enter the object name and hit send. For visual prompting, you click a few points (positive/negative) to mark what you want to track, then it propagates forward through the video.

We've also got Label Manager and Group ID Manager tools if you need to batch edit track_ids or labels afterward.

It's part of the latest release (v3.3.4). You'll need X-AnyLabeling-Server v0.0.4+ running. Model weights are available on ModelScope (for users in China) or you can grab them from GitHub releases.

Setup guide: https://github.com/CVHub520/X-AnyLabeling/blob/main/examples/interactive_video_object_segmentation/sam3/README.md

Anyone else working on video annotation? Would love to hear what workflows you're using or if you've tried SAM3 for this kind of thing.

Meet X-AnyLabeling: The Python-native, AI-powered Annotation Tool for Modern CV 🚀

Jack Wang — Sun, 14 Dec 2025 02:32:39 +0000

The "Data Nightmare" 😱

Let’s be honest for a second.

As AI engineers, we love tweaking hyperparameters, designing architectures, and watching loss curves go down. But there is one part of the job that universally sucks: Data Labeling.

It’s the unglamorous bottleneck of every project. If you've ever spent a weekend manually drawing 2,000 bounding boxes on a dataset, you know the pain.

I realized the tooling landscape was broken:

Commercial SaaS: Great features, but expensive and I hate uploading sensitive data to the cloud.
Old-school OSS (LabelImg/Labelme): Simple, but "dumb." No AI assistance means 100% manual labor.
Heavy Web Suites (CVAT): Powerful, but requires a complex Docker deployment just to label a folder of images.

I wanted something different. I wanted a tool that felt like a lightweight desktop app but had the brain of a modern AI model.

So, I built X-AnyLabeling. And today, we are releasing Version 3.0. 🎉

What is X-AnyLabeling? 🤖

X-AnyLabeling is a desktop-based data annotation tool built with Python and Qt. But unlike traditional tools, it’s designed to be "AI-First."

The philosophy is simple: Never label from scratch if a model can do a draft for you.

Whether you are doing Object Detection, Segmentation, Pose Estimation, or even Multimodal VQA, X-AnyLabeling lets you run a model (like YOLO, SAM, or Qwen-VL) to pre-label the data. You just verify and correct.

Here is what’s new in v3.0 and why it matters for developers.

1. Finally, a PyPI Package 📦

In the past, you had to clone the repo and pray the dependencies didn't break. We fixed that. You can now install the whole suite with a single command:

# Install with GPU support (CUDA 12.x)
pip install x-anylabeling-cvhub[cuda12]

# Or just the CPU version
pip install x-anylabeling-cvhub[cpu]

We also added a CLI tool for those who love the terminal. Need to convert a dataset from COCO to YOLO format? Don't write a script; just run:

xanylabeling convert --task yolo2xlabel

2. The "Remote Server" Architecture ☁️ -> 🖥️

This is a big one for teams. Running a heavy model (like SAM-3 or a large VLM) on a annotator's laptop is slow or impossible.

We introduced X-AnyLabeling-Server, a lightweight FastAPI backend.

Server: You deploy the heavy models on a GPU machine.
Client: The annotator uses the lightweight UI on their laptop.
Result: Fast inference via REST API without local hardware constraints.

It supports custom models, Ollama, and Hugging Face Transformers out of the box.

3. The "Label-Train-Loop" with Ultralytics 🔄

We integrated the Ultralytics framework directly into the GUI.

You can now:

Label a batch of images.
Click "Train" inside the app.
Wait for the YOLO model to finish training.
Load that new model back into the app to auto-label the next batch of images.

This creates a positive feedback loop that drastically speeds up dataset creation.

4. Multimodal & Chatbot Capabilities 💬

Computer Vision isn't just boxes anymore. We added features for the LLM/VLM era:

VQA Mode: Structured annotation for document parsing or visual Q&A.
Chatbot: Connect to GPT-4, Gemini, or local models to "chat" with your images and auto-generate captions.
Export: One-click export to ShareGPT format for fine-tuning LLaMA-Factory models.

Supported Models (The "Batteries Included" List) 🔋

We support 100+ models out of the box. You don't need to write inference code; just select them from the dropdown.

Segmentation: SAM 1/2/3, MobileSAM, EdgeSAM.
Detection: YOLOv5/8/10/11, RT-DETR, Gold-YOLO.
OCR: PP-OCRv5 (Great for multilingual text).
Multimodal: Qwen-VL, ChatGLM, GroundingDINO.

Try it out! 🛠️

This project is 100% Open Source.

We've hit 7.5k stars on GitHub, and we're just getting started. If you are tired of manual labeling or struggling with complex web-based annotation tools, give X-AnyLabeling a spin.

GitHub Repo: https://github.com/CVHub520/X-AnyLabeling
Docs: Full Documentation

I’d love to hear your feedback in the comments! What features are you missing in your current data pipeline? 👇

Forem: Jack Wang