Forem: Jens Gerke

Preloading Ollama Models

Jens Gerke — Tue, 26 Mar 2024 15:44:18 +0000

A few weeks ago, I started using Ollama to run language models (LLM), and I've been really enjoying it a lot. After getting the hang of it, I thought it was about time to try it out on one of our real-world cases (I'll share more about this later).

At Direktiv we are using Kubernetes for all our deployments and when I tried to run it as a pod, I faced a couple of issues.

The initial issue I faced was Ollama downloading models as needed, which is logical given its support for multiple models. When starting up, the specific model required has to be fetched, with sizes ranging from 1.5GB to 40GB. This really extends the time it takes for the container to start up.

To start the download, you'd either make an API call or get the CLI going to fetch the model you need. In a Kubernetes setup, you can easily handle this using a lifecycle event in postStart. So, here's a simple example of an Ollama deployment I put together:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  selector:
    matchLabels:
      name: ollama
  template:
    metadata:
      labels:
        name: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:0.1.29
        ports:
        - name: http
          containerPort: 11434
          protocol: TCP
        lifecycle:
          postStart:
            exec:
              command: [ "/bin/sh", "-c", "ollama pull gemma:2b" ]

That went okay, but there is still the startup problem - it took ages to run the lifecycle hook, plus it won't function on Kubernetes nodes with no internet access. At Direktiv were are using Knative a lot as well which does not support lifecycle events. So, my plan was to create a container using the Ollama image as base with the model pre-downloaded.

So, a little hiccup is that Ollama runs as an HTTP service with an API, which makes it a bit tricky to run the pull model command when building the container image to have the models ready to go right from the start. No services in docker build, remember?

There have been a couple of GitHub issues pointing out this problem, but the workaround is to start an Ollama container, pull the model, and then transfer the generated models into a new container build. Personally, I found this process not the best for an automated build.

Got my developer gloves on and thought, "How hard can it be?" 🧤 Excited that all the download functions in the project were exported, but oh boy, the dependencies didn't play nice! Ended up having to copy and tweak the existing setup. Voila! Now we've got a neat little container for a multi-stage build. Check out the project here:

https://github.com/jensg-st/ollama-pull 💥

With this container, you can fetch the model in the first stage - in this scenario, it's gemma:2b. For the main container you can still use the default ollama/ollama image. The model simply needs to be copied from the downloader to the main container at /root/.ollama. You can even download multiple models in the first stage.

FROM gerke74/ollama-model-loader as downloader

RUN /ollama-pull gemma:2b

FROM ollama/ollama 

ENV OLLAMA_HOST "0.0.0.0"

COPY --from=downloader /root/.ollama /root/.ollama

Let's build it and run it:

cat << 'EOF' > Dockerfile
FROM gerke74/ollama-model-loader as downloader
RUN /ollama-pull gemma:2b
FROM ollama/ollama 
ENV OLLAMA_HOST "0.0.0.0"
COPY --from=downloader /root/.ollama /root/.ollama
EOF
docker build -t gemma . 
docker run -p 11437:11434 gemma

The curl command sends the question to the container. It is important to use the right value in model. In this case gemma:2b.

curl http://localhost:11437/api/generate -d '{
  "model": "gemma:2b",
  "prompt": "Why is the sky blue?"
}'

The container will respond like that:

{"model":"gemma:2b","created_at":"2024-03-26T15:16:56.780177872Z","response":"The","done":false}
{"model":"gemma:2b","created_at":"2024-03-26T15:16:57.003156881Z","response":" sky","done":false}
{"model":"gemma:2b","created_at":"2024-03-26T15:16:57.223483082Z","response":" appears","done":false}
...

Please feel free to comment if that was helpful or if something is not working. In the next few posts I will add some real-life functionality to this.

Knative Serverless in 2024

Jens Gerke — Wed, 20 Mar 2024 08:17:04 +0000

At Direktiv, we're big fans of Knative. It's not just for serverless – it's a fantastic deployment tool for Kubernetes too.

The project is emphasizing the serverless nature but it's just in general a great deployment tool as well. In my opinion, it simplifies that process because the deployment of e.g. a HTTP services comes down to one file which describes the whole service you want to provide to your applications or users.

I could provide a big overview of how Knative works, but in this little tutorial I want to show you the basic installation and configuration and how to deploy your first Knative service.

Installation
Configuration
Creating a Service

Knative Installation

Starting with Knative can be a bit daunting, especially when it comes to choosing the right installation method. There are two primary ways to install Knative: YAML-based installation and the Knative operator.

The YAML-based installation is straightforward, but it's somewhat limited in flexibility. If you need to modify configurations during installation, this method won't be good enough. That's where the Knative operator comes in handy. The operator not only installs the serving component but also the eventing component if required and it is offering more flexibility and customization options.

To get started with the Knative operator, you can use the following command. Do note that it can only be installed in the default namespace:

kubectl apply -f https://github.com/knative/operator/releases/download/knative-v1.13.3/operator.yaml

Once you've executed the command to install the Knative operator, you should have the operator up and running with two pods in the default namespace.

To verify that the operator is running, you can use the command kubectl get pods

knative-operator-6d768fb7-xnjgs      
operator-webhook-7d6b54d78b-q66fh

Knative is requiring a network layer and you have three different options: Istio, Kourier, and Contour.

Istio: Istio is a powerful service mesh that provides advanced networking, security, and observability features.

Kourier: Kourier is purpose-built for Knative, providing a lightweight and efficient network layer specifically designed for serverless workloads.

Contour: Contour is a Kubernetes ingress controller that can also be used as the network layer for Knative. It provides basic routing and load balancing capabilities.

When deciding which option to choose, consider your specific environment, requirements, and preferences. At Direktiv, we typically opt for Contour due to its simplicity. However, your choice may vary depending on your use case and infrastructure setup.

In this tutorial we will use Contour as well:

kubectl apply --filename https://github.com/knative/net-contour/releases/download/knative-v1.13.0/contour.yaml

Contour installs an internal and external service in two namespaces. If external access to your Knative services isn't needed, you can optimize your setup by deleting the contour-external namespace. This eliminates the allocation of an unnecessary external IP within the cluster. Simply run the following command:

kubectl delete ns contour-external

Knative Configuration

Installing Knative using a single YAML file with the operator is convenient, but configuring it can be challenging and I find the documentation a bit thin. Therefore I will explain it a little bit (although there is a lot more).

The basic YAML would look like the following snippet.

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving

Usually you want to configure the network layer, features and other settings in Knative. You can modify this file to change the settings during installation.

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: direktiv-knative
spec:
  ingress:
    contour:
      enabled: true
  deployments:
  - name: activator
    annotations:
      linkerd.io/inject: enabled
  config:
    features:
      multi-container: "enabled"  
      kubernetes.podspec-volumes-emptydir: "enabled"
      kubernetes.podspec-init-containers: "enabled"
    autoscaler:
      initial-scale: "0"
      allow-zero-initial-scale: "true"
      min-scale: "0"
    deployment:
      registries-skipping-tag-resolving: "kind.local,ko.local,dev.local,localhost:5000,localhost:31212"
    network:
      ingress-class: "contour.ingress.networking.knative.dev"

This YAML is a very simple installation file for Knative. Individual components can be addressed under deployments. These components can be activator, autoscaler, controller, webhook or autoscaler-hpa. In this YAML we are setting an annotation for the activator pod.

Under config is the configuration for Knative's different ConfigMaps in kubernetes.

config-autoscaler
config-defaults
config-deployment
config-domain
config-features
config-gc
config-leader-election
config-logging
config-network
config-observability
config-tracing

You can lookup all the different settings in the configmaps after installation and tweak your Knative installation like modifying timeouts and maximum connections.

The most important setting in this case is ingress-class: "contour.ingress.networking.knative.dev" under network. This has to be configured because we are using Contour as network layer in this tutorial.

We recently had one installation where a proxy server was required. Because it took me a little bit to figure out how to set environment variables for the pods I'd like to share the snippet how to do that:

...
  deployments:
  - name: controller
    env:
    - container: controllerserving-certs-ctrl-ca
      envVars:
      - name: HTTP_PROXY
        value: "http://myproxy:3128"
      - name: HTTPS_PROXY
        value: "http://myproxy:3128"
      - name: NO_PROXY
        value: ".svc,.default,.local,.cluster.local,localhost"
...

After appying the YAML with kubectl apply -f knative.yaml the list of pods in the default namespace will look like this and we are ready to install the first service.

knative-operator-6d768fb7-jthff                          1/1     Running   
operator-webhook-7d6b54d78b-75v46                        1/1     Running  
autoscaler-79d9fb98c-5mtnd                               1/1     Running  
controller-cdf856494-lv9qk                               1/1     Running  
webhook-dddf6fcff-jvdjc                                  1/1     Running    
autoscaler-hpa-7969f4f665-kdhv7                          1/1     Running     
activator-74cc7497c9-vqch9                               1/1     Running

Creating a Service

After setting up Knative it is time to run the first service. Applying the following file will create a Knative service.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    spec:
      containers:
      - image: direktiv/simple-hello
        env:
        - name: TARGET
          value: "Go Sample v1"

To check if the service is up and running execute kubectl get ksvc and the service will show up in the list of available Knative services and it's status.

NAMESPACE   NAME            URL                                              LATESTCREATED         LATESTREADY           READY   REASON
default     helloworld-go   http://helloworld-go.default.svc.cluster.local   helloworld-go-00001   helloworld-go-00001   True

By default Knative would start a pod instance as well but because we have configured allow-zero-initial-scale and initial-scale the service will only be prepared for consumption and not started. A simple curl will "activate" the pod though.

kubectl run -it --rm --restart=Never --image curlimages/curl curl-test -- curl http://helloworld-go.default

Maybe you have noticed the delay when calling the service. This happens when the service does a "cold start" with zero pods available.

At the beginning we said, Knative is a great deployment tool even without the serverless component. We can configure the service to have at least X pods available all the time to avoid those cold starts. With that approach Knative can be used as a simplified Kubernetes deployment tool.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/min-scale: "1"
    spec:
      containers:
      - image: direktiv/simple-hello
        env:
        - name: TARGET
          value: "Go Sample v1"

The annotation autoscaling.knative.dev/min-scale would set the minimum number of pods to 1 meaning there is always one pod running at any given time.

I'm hoping this quick introduction to Knative will help you to get started. There is so much more to explore with Knative with e.g. traffic management and versioning. But I will write about this in a different post.

If you have any questions, just leave a comment!