Forem: Sematic

Sematic + Ray: The Best of Orchestration and Distributed Compute at your Fingertips

Josh Bauer — Mon, 17 Apr 2023 19:56:08 +0000

Finding Dynamic Combos

Getting Machine Learning (ML) infrastructure right is really hard. One of the challenges for any ML project getting off the ground is finding the right tools for the job. The number of tools out there that target different parts of the ML lifecycle can easily feel overwhelming.

A sampling of tools that help with ML development, created by the Linux Foundation.

Sometimes, two tools seem to “just fit” together, and you forget that you’re even working with multiple tools as the lines blur into a coherent experience. One example that every ML Engineer or Data Scientist is familiar with is numpy and pandas. Numpy enables fast and powerful mathematical computations with arrays/matrices in Python. Pandas provides higher-level data structures for manipulating tabular data. While you can of course use one without (explicitly) using the other, they complement each other so well that they are often used together. Pandas works as a usability layer, while numpy supercharges it with compute efficiency.

Pandas and numpy working together seamlessly.

At Sematic, we care a lot about usability. We aim to make your ML workflows as simple and intuitive as possible, while providing you with best-in-class lineage tracking, reproducibility guarantees, local/cloud parity, and more. You can chain together the different parts of your ML pipelines using Sematic, and specify what kind of resources you need in the cloud. But many parts of the modern ML lifecycle require more than one computing node–you need a cluster. For example, training the original ResNet-50 on a single GPU takes 14 days. Leveraging cluster computing can cut this time drastically. Sematic needed a tool to help supercharge it with cluster computing resources, ideally in a way that “just fits” with another tool.

Ray

Ray pitches itself as “an open-source unified compute framework that makes it easy to scale AI and Python workloads.” Ray can be broken down into three major pieces:

Ray Core: some primitives for distributed communication, defining workloads and logic to be executed by the distributed compute layer, and initializing computing resources to interact with the system.
Ray-native domain libraries: libraries provided “out of the box” with Ray for various parts of ML development, such as hyperparameter tuning, data processing, and training.
Ecosystem of integrations: Ray integrates with many popular tools and frameworks within the broader ML landscape, such as Hugging Face, Spark, PyTorch, and many more.

With these pieces, Ray easily stands as a powerhouse for distributed computing within ML.

Sematic + Ray

How Ray and Sematic complement each other.

Sematic was designed to let you create end-to-end ML pipelines with minimal development overhead, while adding visualization, lineage tracking, reproducibility, and more. In the language of Sematic, your pipeline steps are Sematic Functions–perhaps one for data processing, one for training, one for evaluation, and so on. Then, within these Sematic Functions, you can use Ray to efficiently scale data processing beyond a single compute node.

That’s great as a conceptual model, but how does Sematic integrate with Ray in practice?

When you’re authoring a pipeline, using Ray within a Sematic Function is as easy as using the RayCluster context manager inside the function. This will spin up a Ray cluster on-demand and enter the ‘with’ context only once the cluster is ready for use. Your code can then use Ray just like it would in any other situation. When your code is done executing (either successfully or unsuccessfully), the Ray cluster will be cleaned up for you. The Ray cluster uses the same container image as your pipeline so that the same code and dependencies are guaranteed to be present on every node.

Using Ray within Sematic.

If you’re familiar with Ray or Sematic, you likely know that both can be used locally as well as in the cloud. Sematic’s Ray integration is no exception! When you execute the code above locally, a local-process based Ray cluster will be created instead of one executing on Kubernetes. This enables rapid local development, where you can use all of your favorite debuggers and other tools until you’re ready to move execution to the cloud.

Unlocking New Use Cases

This combination of Sematic + Ray can jumpstart your journey to a world-class ML platform. Using these tools together, your Sematic Functions can now do things such as:

Do quick and efficient distributed training on a PyTorch image classifier using PyTorch Lightning and Ray.
Perform distributed Hyperparameter tuning of a TensorFlow natural language model using Ray Tune.
Do distributed data processing and ingest with Ray Datasets.

And you can do all these things while taking advantage of Sematic’s lineage tracking, visualization and orchestration capabilities.

A Look Behind the Scenes

When you use RayCluster as above, there’s a lot going on behind the scenes. Sematic uses KubeRay, a tool developed by the maintainers of Ray to manage Ray clusters within Kubernetes. Your code execution will result in calls to the Sematic server, which will in turn publish information about the required cluster to KubeRay. KubeRay will then create and manage the new Ray cluster for you.

Since Sematic knows all about your code and what container image it’s using, it can ensure that KubeRay uses that same image for the Ray head and workers. This means that you don’t have to worry about any new dependency management when using Ray from Sematic – any code that can be used from your Sematic Functions can be used from Ray, even without using Ray’s Runtime environments!

Architecture of Ray + Sematic

Learning More

If you want to know more about Sematic’s Ray integration, you can check out our docs. If you’re looking for something more hands on, check out one of our examples doing distributed training and evaluation using Ray from Sematic. One uses PyTorch Lightning to do distributed training of a ResNet model, and another uses Ray’s AIR APIs (including Ray Datasets) to do distributed training of a simple image classifier on the CIFAR10 dataset. You can also join our Discord if you’d like to ask some questions. We’re always happy to help!

Sematic’s Ray integration is part of Sematic’s paid “Enterprise Edition”. Get in touch if you’d like to use it! Rather play around with Sematic for free first? Most of it is free and open-source!

Implementing Deep Links in React with Atoms

Chance An — Fri, 03 Feb 2023 22:38:36 +0000

What are deep links and why are they useful?

Broadly speaking, Deep Links are a mobile app platform (Android/iOS) concept which allows users to get to a specific in-app location following a hyperlink. When this idea is generalized back to single-page applications (SPAs) on the web, it refers to using a URL enriched with additional information to guide a user to a specific state of the SPA.

It allows users to return to the exact spot they were at, even after closing the app. Users can also share deep links with others to reproduce what the original user sees or refers to.

So we need to persist the UI state in the URL. What states should be persisted in the URL? There is a limitation on how many characters you can put in the URL. Different browsers and server-ware enforce different limits. To be on the safe side, one should not design a URL that exceeds 2048 bytes in length.

We cannot encode every state info a SPA uses into the URL. However, the user interactions applied to the app are in a controllable size; we can encode interaction-related information into the URL, hoping to recover the application state from the point when the user has already initiated the same interactions.

Let's look at two examples:

In Sematic, when a user selects a nested run, we hope to share what the user sees with someone else through a deep link. So when other users open the website using the deep link, they will see the same nested run being selected.

The deep link has a form of

https://sematic.host/pipelines/pipeline_name/[root_run_id]#run=[nested_run_id]

Separately, when the user selects a view-specific tab on the run details page, we also hope the deep link reflects this selection.

For this purpose, we add the tab=tab_name hash segment into the URL with the following form.

https://sematic.host/pipelines/pipeline_name/[root_run_id]#run=[nested_run_id]&tab=[tab]

For example:

https://sematic.host/pipelines/pipeline_name/864ae3#run=33d7f&tab=output

Why use hashes instead of paths or query strings?

The other alternatives are to encode the information using query strings or as a part of the URL path segments.

URL paths are reserved for React Router usage. We want to keep the deep linking and React Router operations separate for the sake of separating concerns. This way, we simplify React Router's work and avoid potential re-routing due to path changes.

We didn't choose query strings because of a narrow interpretation of the differences between a query string and a hash:

A hash works as an anchor on the webpage; conventionally, changing the URL hash should not trigger UI mutations. However, a navigational shift, like scrolling to a designated location, is typical for hash changes. Turning to a specific page of a business entity fits the semantics of hash changes. So this becomes one reason to adopt hash for deep linking.

Another reason for choosing hash is because the Jotai library we adopted directly supports an easy API for manipulating and syncing with URL hash, which we will elaborate on next.

Thirdly, hash fragments are not sent to the server. The browsers' side is usually more generous on the URL length limitation than the server applications. We have more room to store the deep link information on the hash fragments if we need to.

Meanwhile, we also reserve the query strings to store feature flags, view settings, debug options, etc.

It is also worth noting that choosing query parameters to store deep link information is also reasonable. For example, the Recoil library supports both.

To summarize, in Sematic, we assign the following distinct responsibilities to different URL components:

Path: reserved for routing
Query string parameters: feature flags and view settings.
Hash: deep links

The Jotai and Recoil library

Before talking about Jotai, let us first introduce Recoil, from which the Jotai library gets a lot of inspiration.

An Atom from Recoil is a unit of a state that is declared outside of the React component hierarchy. Different React components can subscribe to the changes of such an atom and can mutate the atom's state. The mutation of the atom will be propagated to each subscribing component and trigger its re-rendering. Using a Recoil atom is a way of sharing a global state across multiple components regardless of their locations in the component tree.

How does this compare to React Context?

React Context API is more verbose in injecting context providers in the ancestry. It sometimes causes unnecessary re-renderings. One way to avoid those is to disintegrate the state groups in smaller, dispersed contexts. But this might also lead to pyramids of doom when writing too many nested provider declarations.

<context1.Provider value={value1}>
  <context2.Provider value={value2}>
    <context3.Provider value={value3}>
      <context4.Provider value={value4}>
        <context5.Provider value={value5}>

        </context5.Provider>
      </context4.Provider>
    </context3.Provider>
  </context2.Provider>
</context1.Provider>

An Atom is more lightweight. It can be declared individually. As its name suggests, it is atomic.

Example:

const todoListState = atom({
 key: 'TodoList',
 default: [],
});

function TodoList() {
 const todoList = useRecoilValue(todoListState);
 return (
   <>
     {/* <TodoListStats /> */}
     {/* <TodoListFilters /> */}
     <TodoItemCreator />

     {todoList.map((todoItem) => (
       <TodoItem key={todoItem.id} item={todoItem} />
     ))}
   </>
 );
}

Jotai is another library that shares a lot of features with Recoil. This article will focus on one particular hook atomWithHash() in the jota-location integration, which allows us to manipulate URL hash segments and subscribe to their changes directly.

Example:

import { useAtom } from 'jotai'
import { atomWithHash } from 'jotai-location'

const countAtom = atomWithHash('count', 1)

const Counter = () => {
 const [count, setCount] = useAtom(countAtom)
 return (
   <div>
     <div>count: {count}</div>
     <button onClick={() => setCount((c) => c + 1)}>+1</button>
   </div>
 )
}

This useAtom hook returns the current value of a hash segment and a state updater function. The hook consumers will be notified of the updated value in a new rendering cycle whenever the hash segment changes. The state updater function updates the atom's value and the URL's corresponding hash fragment.

In Sematic, we think having a global state utility using the atom mechanism is applicable. Jotai is more streamlined with updated ergonomic APIs. We have adopted Jotai in the front-end application of Sematic.

Implementation of deep links

With the help of jotai, implementing deep-link becomes easy. For instance, when we need to generate a deep link of the selected nested run, we store the selected run ID with a Jotai hash atom. So whenever the atom for storing the selected run ID is changed, the corresponding hash linked to the atom will be automatically updated in the URL. The user will directly pick the updated URL in the browser address bar as the new deep link.

When a user opens a deep link in a new browser tab, the initial state of the atom will be hydrated with the value from the URL hash segment. So the consumer components will retrieve the selected run ID represented by the hash in the deep link URL, then drive the successive rendering.

During the implementation, we also learned about the following caveats, which might be interesting to the readers.

Serialization

atomWithHash() has a config option to support serializing the state value into the hash string and vice versa. By default, it uses JSON serialization and deserialization. If the atom's value type is string, it will become "value" (quotes included) in the hash string because a JSONified string value has quote marks.

A straight string-to-string conversion will eliminate the quote marks.

With default serialization, we have:

https://sematic.host/pipelines/pipeline_name/864ae3#tab="output"

After using custom serialization/deserialization functions, we have:

https://sematic.host/pipelines/pipeline_name/864ae3#tab=output

Hash change with unwanted intermediate states

After the Jotai takes over certain hash segments atoms, for most cases, one should not touch the hash in the URL directly instead of using the atom updater function to drive the hash change. There are, however, some edge cases when you have to manually change some hash segments when you want to update the URL path component simultaneously. This can happen when you want to navigate to a new page with an alternative hash value. If you use the updater function to change the hash, followed by a URL path change, you get two records in the browser history stack. The problem will surface when a user uses the browser's back button to navigate back, which will lead the user to an invalid intermediate page state.

Analyzing Jotai's source code leads us to a solution. Jotai uses URLSearchParams to manipulate the hash portion of the URL. This approach can change only a specific sub-component of the hash portion instead of rewriting the entire hash string. When we do page navigation, we will likely only need to change one hash sub-component while keeping other sub-components in the deep link untouched. After understanding its principle, whenever you have to do a path change with a hash change. Do the following:

Grab the current hash component of the URL
Utilize URLSearchParams to replace the value of a specific hash chip in the entire hash string.
Take the revised hash component by reading URLSearchParams again (.toString()). Combine it with the new path value and update the URL in an atomic operation.

Jotai's atomWithHash supports an option replaceState, which controls whether hash updates will create new browser history records. Unfortunately, it is an atom-level configuration. It would perfectly solve the issue above if it could be specified on a per-update basis.

How do you implement deep linking in your project?

Reach out to us on Discord to discuss deep links or other topics.

References

What is “production” Machine Learning?

Emmanuel Turlay — Wed, 18 Jan 2023 23:28:31 +0000

In traditional software development, “production” typically refers to instances of an application that are used by actual users – human or machine.

Whether it’s a web application, an application embedded in a device or machine, or a piece of infrastructure, production systems receive real-world traffic and are supposed to accomplish their mission without issues.

Production systems usually come with these guarantees:

Safety – Before the application is deployed in production, it is thoroughly tested by an extensive suite of unit tests, integration tests, and sometimes manual tests. Scenarios covering happy-paths and corner cases are checked against expected results.
Traceability – A record is kept of exactly what code was deployed, by whom, at what time, with what configurations, and to what infrastructure.
Observability – Once the system is in production, it is possible to access logs, observe real-time resource usage, and get alerted when certain metrics veer outside of acceptable bounds.
Scalability – The production system is able to withstand the expected incoming traffic and then some. If necessary, it is capable of scaling up or down based on demand and cost constraints.

What is a production ML system?

Some ML models are deployed and served by an endpoint (e.g. REST, gRPC), or directly embedded inside a larger application. They generate inferences on demand for each new sets of features sent their way (i.e. real-time inferencing).

Others are developed for the purpose of generating a set of inferences and persisting them in a database table as a one time task (i.e. batch inferencing). For example, a model can predict a set of customers’ lifetime value, write those in a table for consumption by other systems (metrics dashboard, downstream models, production applications, etc.).

Whether built for real-time or batch inferencing, a production ML system refers to the end-to-end training and inferencing pipeline: the entire chain of transformations that turn raw data sitting in a data warehouse, into a trained model, which is then used to generate inferences.

What guarantees for production ML systems?

We saw at the top what guarantees are expected from a traditional software system. What similar guarantees should we expect from production-grade ML systems?

Safety

In ML, safety means having a high level of certainty that inferences produced by a model fall within the bounds of expected and acceptable values, and do not endanger users – human or machine.

For example, safety means that a self-driving car will not drive dangerously on the road, or that a facial recognition model will not show biases, or that a chat bot will not generate abusive messages.

Safety in ML systems can be guaranteed in the following ways.

Unit testing – Each piece of code used to generate a trained model should be unit-tested. Data transformation functions, data sampling strategies, evaluation methods, etc. should be confronted to both happy-path inputs and a reasonable set of corner cases.
Model testing, simulation – After the model is trained and evaluated, real production data should be sent to it to establish an estimate of how the model will behave once deployed. This can be achieved by e.g. sending a small fraction of live production traffic to a candidate model and monitoring inferences, or by subjecting the model to a set of must-pass scenarios (real or synthetic) to ensure that no regressions are introduced.

Traceability

In ML, beyond simply tracking what model version was deployed, it is crucial to enable exhaustive so-called Lineage Tracking.

End-to-end Lineage Tracking means a complete bookkeeping of ALL the assets and artifacts involved in the production of every single inference.

This means tracking

The raw dataset used as input
All of the intermediate data transformation steps
Configurations used at every step: featurization, training, evaluation, etc.
The actual code used at every step
What resources the end-to-end pipeline used (map/reduce clusters, GPU types, etc.)
Inferences generated by the deploy model

as well as the lineage relationships between those.

Lineage Tracking enables:

Auditability – If a deployed model leads to undesired behaviors, an in-depth post-mortem investigation is made possible by using lineage data. This is especially important for models performing safety critical tasks (e.g. self-driving cars) where legal and compliance requirements demand auditability.
Debugging – It is virtually impossible to debug a model without knowing what data, configuration, code, and resources were used to train it.
Reproducibility – See below.

Reproducibility

If a particular inference cannot be reproduced from scratch (within stochastic variations) starting from raw data, the corresponding model should arguably not be used in production. This would be like deploying an application when you had lost the source code with no way to retrieve it.

Without the ability to reproduce a particular trained model, there is no explainability of the model and its inferences. It is impossible to debug production issues.

Additionally, reproducibility enables rigorous experimentation. The entire end-to-end pipeline can be re-run while changing a single degree of freedom at a time (e.g. input data selection, sampling strategy, hyper-parameters, hardware resources, training code, etc.)

Automation

ML models are not one-and-done type of projects. Often times, the characteristics of the training data change over time, and the model needs to be retrained recurrently to pick up on new trends.

For example, when real-world conditions change (e.g. COVID, macro-economic changes, user trends, etc.), models can lose predictive power if they are not retrained frequently.

This frequent retraining can only be achieved if an end-to-end pipeline exists. If engineers can simply run or schedule a pipeline pointing to new data, and all steps are automated (data processing, featurization, training, evaluation, testing, etc.) then refreshing a model is no harder than deploying a web app.

How Sematic provides production-grade guarantees

Sematic is an open-source framework to build and execute end-to-end ML pipelines of arbitrary complexity.

Sematic helps build production-grade pipelines by ensuring the guarantees listed above in the following way, without requiring any additional work.

Safety – In Sematic, all pipelines steps are just Python functions. Therefore, they can be unit-tested as part of a CI pipeline. Model testing/simulation is enabled by simply adding downstream steps after model training and evaluations steps. Trained model can be subjected to real or simulated data to ensure the absence of regressions.
Traceability – Sematic keeps an exhaustive lineage graph of ALL assets consumed and produced by all steps in your end-to-end pipelines: inputs and outputs of all steps, code, third-party dependencies, hardware resources, etc. All these are visualizable in the Sematic UI.
Reproducibility – By enabling complete traceability of all assets, Sematic lets you re-execute any past pipeline with the same or different inputs.
Automation – Sematic enables users to build true end-to-end pipelines, from raw data to deployed model. These can then be scheduled to pick up on new data automatically

Check us out at sematic.dev, star us on Github, and join us on Discord to discuss production ML.