KitOps: Bringing DevOps Discipline to Machine Learning Artifacts

#programming #opensource #docker #machinelearning

Yesterday, KitOps project lead and Jozu CTO, Gorkem Ercan joined Docker Captain, Brett Fisher to discuss how KitOps is addressing one of the most overlooked challenges in ML operations - artifact management. This conversation highlighted how the growing intersection between traditional DevOps and ML engineering is creating new challenges that require specialized tooling.

The Problem: ML's "Wild West" of Artifact Management

As Brett pointed out during the conversation, many platform engineers and traditional DevOps professionals are increasingly being asked to support ML workloads without necessarily having ML expertise. Meanwhile, data scientists who excel at creating models often lack infrastructure automation experience.

"Everything's all over the place," Brett noted while discussing the current state of ML artifact management. "Instead of someone reading 50 pages of documentation to try to figure out what things they need to download, what assets they need to get started, [KitOps] almost feels like the Docker compose file of MLOps."

Gorkem confirmed this observation, explaining that before KitOps, ML artifacts were stored "wherever" - from Windows file servers to network shares, Google Drive, and S3 buckets. This created significant challenges for traceability, security, and deployment.

What Makes KitOps Different

KitOps consists of three main components:

The Model Kit Specification - A standardized format for ML artifacts based on OCI standards
The CLI Tool - For managing these artifacts with familiar commands
OCI Registry Integration - Leveraging existing container registry infrastructure

"We wanted to create a bit of separation between roles," Gorekm explained. "Data scientists are really proficient in certain parts of AI and ML, but they are not the ones going to apply platform engineering DevOps techniques to take things into production."

Shivay, a KitOps contributor and Jozu advocate added that while tools like Kubeflow try to help bridge the gap between DevOps and ML teams, KitOps specifically focuses on "how you package your machine learning systems in a more efficient and secure manner."

Practical Applications and Commands

During the demonstration, Gorkem showed how KitOps works with straightforward commands that will feel familiar to Docker users:

kit pack - Package ML artifacts (similar to Docker's build)
kit pull - Pull artifacts to a local cache
kit unpack - Extract artifacts without caching
kit list - View available artifacts
kit info - Get detailed information about artifacts

The CLI also includes utilities like local caching to prevent duplicating large files, support for multiple model formats (including scikit-learn, PyTorch, TensorFlow, and ONNX), and even a development command for testing LLMs.

Security and Enterprise Integration

One of the most compelling aspects discussed was how KitOps leverages existing OCI registries that organizations already use, complete with enterprise features like RBAC, auditing, and signing capabilities.

"The good thing about Docker registries is they already exist in your organization," Gorkem explained. "Someone already has RBAC built on top of it, someone has auditing built on top of it."

Shivay emphasized how this helps with security: "We use a tool like cosign to sign our models as well... you are actually able to attach attestations and signing to your models and datasets."

The Future: Standardizing ML Artifacts

The conversation concluded with a discussion about the future of ML artifact standardization. Gorkem mentioned ongoing work with companies like Red Hat and potentially Docker on a new model packaging specification, aimed at creating greater interoperability between different tools in the ecosystem.

When Brett asked about compatibility with Docker's new Model Runner, Gorkem acknowledged the current fragmentation in the space but expressed hope for convergence: "If they are interoperable, then I don't think there is a lot of reason for competing on the media type..."

As organizations continue to integrate ML workloads into their infrastructure, tools like KitOps that bring the rigor and discipline of DevOps to the ML world will become increasingly important, helping both data scientists and platform engineers collaborate more effectively.

This blog post summarizes a conversation with Gorkem Ercan (CTO of Jozu and co-creator of KitOps) and Shivay (a KitOps contributor) on Brett Fisher's livestream, discussing how KitOps is standardizing the management of ML artifacts in cloud-native environments.

User Feedback & The Pivot That Saved The Project

🔥 Check out Episode 3 of Dev Diairies, following a successful Hackathon project turned startup.

Watch full video 🎥

Top comments (2)

Nevo David • Apr 18

nice, love seeing devops tools moving into the ml world - do you think trying to force standardization always helps, or can it slow down real-world work sometimes?

Brad Micklea • Apr 21

I'm obviously biased as a maintainer of KitOps, but this was something we talked about early on - we didn't want to create a new standard (see the xkcd cartoon), and we knew you can't "force" standardization. Standardization is either something needed by the community or it isn't.

In this case we saw two things: 1/ data science teams had no big issues getting models running locally or in small shared cloud environments; 2/ devops teams did have problems taking those running models and turning them into something production ready and appropriate for more scalable and secure environments. KitOps is really about use case 2 and the standard in prouduction environments is already established: OCI.

When we started we constrained ourselves to "how do we take the existing production runtime standard of OCI and make it work seamlessly and simply for AI/ML workloads?" KitOps was born from that. We're seeing consistent and strong adoption of KitOps by those "DevOps" teams who love that it just works with their existing tools and processes. Platform engineering will then make it a fairly invisible part of the development cycle so data science teams don't have to change what's been working for them, but SREs and production teams have something that is more secure and production-focused. Our users tell us it's a best-of-both-worlds solution.