Forem: Pejman Rezaei

Integrating MLflow with KubeFlow (Revised Edition)

Pejman Rezaei — Thu, 08 May 2025 16:39:46 +0000

MLflow—a robust open-source platform that simplifies the management of the machine learning lifecycle, including experimentation, reproducibility, and deployment. By integrating MLflow into Kubeflow, users can leverage MLflow’s intuitive UI and comprehensive model registry capabilities to enhance their machine learning workflows.

In the modern enterprise landscape, the demand for streamlined and scalable Machine Learning Operations (MLOps) frameworks has never been greater. With increasing complexities in model development, tracking, deployment, and monitoring, organizations need tools that seamlessly integrate to ensure efficiency and reliability. MLflow and Kubeflow are two such tools that, when integrated, provide a robust end-to-end solution for managing machine learning workflows. MLflow excels in tracking experiments, managing model lifecycle, and maintaining a centralized model registry. On the other hand, Kubeflow offers scalable pipelines, distributed training capabilities, hyperparameter optimization, and production-grade model serving on Kubernetes. Together, these tools form a comprehensive framework for MLOps that supports continuous integration and deployment (CI/CD), enabling enterprises to automate workflows, improve collaboration between data science and engineering teams, and ensure models are delivered to production faster and with fewer errors. This tutorial will guide you through the detailed process of integrating MLflow and Kubeflow into an enterprise-level MLOps framework, focusing on scalability, reproducibility, and automation.

This framework ensures:

Scalability for high-demand ML workflows.
Automation of CI/CD pipelines.
Centralized tracking and monitoring.

Part 1

The first step will be setting up a Database because if you want to use MLflow's tracking functionality with a relational database backend, you will need a PostgreSQL (or another supported database) instance. Here’s a breakdown of why and how to set it up:

Why Use PostgreSQL with MLflow?

Experiment Tracking: MLflow uses a backend store to log experiments, runs, parameters, metrics, and artifacts. A relational database like PostgreSQL is a robust option for this purpose.
Scalability: Using a database allows you to efficiently manage and query large amounts of experiment data.
Persistence: A database ensures that your experiment data is stored persistently, even if the MLflow server is restarted.

Setting Up PostgreSQL for MLflow

Step 1: Deploy PostgreSQL in Your Kubernetes Cluster

You can deploy PostgreSQL using a Helm chart or a custom YAML configuration. Here’s a basic example using a Helm chart:

Create MLflow namespace:

    kubectl create namespace mlflow

Turn postgres password into base64:

echo -n 'MyPostgresPass.!QAZ' | base64

Create a YAML file:

apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
  namespace: mlflow
data:
  postgresql-password: TUxQbGF0Zm9ybTEyMzQuIVFBWg==
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
  namespace: mlflow
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-postgres
  namespace: mlflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow-postgres
  template:
    metadata:
      labels:
        app: mlflow-postgres
    spec:
      containers:
      - name: postgres
        image: postgres:16
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: postgresql-password
        - name: POSTGRES_DB
          value: mlflow
        ports:
        - containerPort: 5432
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        volumeMounts:
        - name: postgres-storage
          mountPath: /var/lib/postgresql/data
          subPath: pgdata
      volumes:
      - name: postgres-storage
        persistentVolumeClaim:
          claimName: postgres-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-postgres
  namespace: mlflow
spec:
  type: ClusterIP
  ports:
  - port: 5432
    targetPort: 5432
  selector:
    app: mlflow-postgres

Apply:

kubectl apply -f postgresql-deployment.yaml

Create user and db on postgres for mlflow:

To set up a PostgreSQL database for MLflow, you'll need to create a user, set a password, create a database, and grant the necessary permissions. Here’s how you can do it step by step in the PostgreSQL shell (psql):

Step-by-Step Commands

Log into PostgreSQL:
First, log into your PostgreSQL server as a superuser (e.g., postgres):
```
psql -U postgres
```
Create a User:
Replace mlflow and your_password with your desired username and password.
```
CREATE USER mlflow WITH PASSWORD 'your_password';
```
Create a Database:
Replace mlflow_db with your desired database name.
```
CREATE DATABASE mlflow_db;
```
Grant Permissions:
Grant the necessary permissions to the user for the database:
```
GRANT ALL PRIVILEGES ON DATABASE mlflow_db TO mlflow;
```
Exit the PostgreSQL Shell:
After executing the commands, you can exit the psql shell:
```
\q
```

Summary of Commands

Putting it all together, here are the commands you would run in the PostgreSQL shell:

CREATE USER mlflow WITH PASSWORD 'your_password';
CREATE DATABASE mlflow_db;
GRANT ALL PRIVILEGES ON DATABASE mlflow_db TO mlflow;

Additional Considerations

Password Security: Make sure to use a strong password for your database user.
Database Connection: When configuring MLflow, use the following connection string format:
```
postgresql://mlflow_user:your_password@<host>:<port>/mlflow_db
```

Replace <host> and <port> with your PostgreSQL server's address and port (default is 5432).

With these steps, you should have a PostgreSQL user and database set up for MLflow, ready for use!

Storage backend

When considering security for your MLflow setup, both Ceph and MinIO can be configured to be secure, but they have different security features and considerations. Here’s a comparison to help you decide which might be more appropriate for your use case:

Using Ceph

Pros:

Robust Security Features: Ceph supports various security mechanisms, including:
- Authentication: Ceph can use CephX for authentication, ensuring that only authorized clients can access the storage.
- Encryption: Data can be encrypted both in transit (using TLS) and at rest.
- Access Control: You can set fine-grained access control policies to restrict who can access specific buckets or objects.
Scalability: Ceph is designed for scalability, making it suitable for large datasets and high availability.

Cons:

Complexity: Setting up and managing Ceph can be more complex compared to simpler object storage solutions.
Configuration Overhead: You may need to invest time in properly configuring security settings to ensure that your Ceph deployment is secure.

Using MinIO

Pros:

S3 Compatibility: MinIO is compatible with the S3 API, making it easy to integrate with applications designed for S3 storage.
Simplicity: MinIO is easier to set up and manage compared to Ceph, especially for smaller deployments.
Built-in Security Features: MinIO provides:
- Server-Side Encryption: You can enable server-side encryption for data at rest.
- TLS Support: MinIO supports TLS for secure data transmission.
- Access Policies: You can define bucket policies and user access controls.

Cons:

Less Feature-Rich: While MinIO is secure and robust, it may not have the same level of advanced features and scalability as Ceph for very large deployments.

Security Recommendations

For Ceph:

Enable CephX Authentication: Ensure that you are using CephX for authentication.
Use TLS: Configure TLS for secure data transmission.
Regular Audits: Regularly audit your Ceph configuration and access logs to detect any unauthorized access.

For MinIO:

Enable TLS: Always use TLS to encrypt data in transit.
Use Strong Access Keys: Generate strong access and secret keys for your MinIO instance.
Set Bucket Policies: Define strict bucket policies to control access to your data.

Conclusion

Both Ceph and MinIO can be configured to be secure, but your choice may depend on your specific needs:

Choose Ceph if you need a highly scalable, feature-rich solution and are willing to manage its complexity.
Choose MinIO if you prefer a simpler, S3-compatible solution that is easy to set up and manage while still providing solid security features.

For this configuration, we prefer minio over ceph due to its simplicity and efficient resource allocation.

Step 1: Deploy MinIO in Your MLflow Namespace

In this scenario, we will utilize MinIO as the storage backend for MLflow to manage and store artifacts. When considering MinIO, we have two options:

Using Standalone MinIO
Using MinIO which comes with Kubeflow installation

Using Standalone MinIO (Skip this step if you want to use minio of kubeflow)

Pros:

Isolation: Keeps MLflow and its storage independent, simplifying management.
Customization: Allows for tailored configurations specific to MLflow needs.
Version Control: Easier to manage updates and changes without affecting other components.
- Cons:
Resource Duplication: Requires additional resources and management overhead.
Complexity: May complicate the deployment if not properly managed.

To install standalone Minio, follow steps below:

Step 1: Deploy MinIO in Your MLflow Namespace

Base64 Encode Your Keys:
The values for MINIO_ACCESS_KEY and MINIO_SECRET_KEY need to be base64 encoded. You can use the following command in your terminal:

echo -n 'myaccesskey' | base64
echo -n 'mysecretkey' | base64

Again, insert your Base64 encoded string into the secrets section as the values for MINIO_ACCESS_KEY and MINIO_SECRET_KEY entries in the minio-deploy.yaml file. This file includes a deployment, service, persistent volume claim (PVC), and secret. The image uses the latest version of MinIO, which utilizes MINIO_ROOT_USER and MINIO_ROOT_PASSWORD as environment variables for the admin user and password of the MinIO installation. Additionally, a separate port is configured for the console UI in this file, allowing access to the dashboard independently of the API port.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: minio
  namespace: mlflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: minio
  template:
    metadata:
      labels:
        app: minio
    spec:
      containers:
      - name: minio
        image: minio/minio
        args:
          - server
          - /data
          - --console-address # set console ui a dedicated port
          - ":9001"
        ports:
        - containerPort: 9000
        - containerPort: 9001
        env:
        - name: MINIO_ROOT_USER
          valueFrom:
            secretKeyRef:
              name: minio-credentials
              key: MINIO_ACCESS_KEY
        - name: MINIO_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: minio-credentials
              key: MINIO_SECRET_KEY
        - name: MINIO_CONSOLE_PORT
          value: "9001"
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        volumeMounts:
        - name: minio-storage
          mountPath: /data # make minio storage persistent
      volumes:
      - name: minio-storage
        persistentVolumeClaim:
          claimName: minio-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: minio
  namespace: mlflow
spec:
  type: NodePort
  ports:
  - name: api
    port: 9000
    targetPort: 9000
  - name: ui
    port: 9001
    targetPort: 9001
  selector:
    app: minio
---
apiVersion: v1
kind: Secret
metadata:
  name: minio-credentials
  namespace: mlflow
type: Opaque
data:
  MINIO_ACCESS_KEY: EyMzQuIV
  MINIO_SECRET_KEY: TUxQbGF0Zm9yb
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: minio-pvc
  namespace: mlflow
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

Step 2: Access MinIO

Get the MinIO Service URL:

You can access MinIO using the service name within the Kubernetes cluster. If you are using port-forwarding for local access, you can do:

kubectl port-forward svc/minio -n mlflow 9001:9001

Step 3: Create a Bucket in MinIO

Using the MinIO Console:

After logging in, you can create a bucket via the web interface.

Using mc (MinIO Client):

If you prefer to use the command line, you can install mc and create bucket:

    ```
    kubectl port-forward svc/minio -n mlflow 9001:9001
    ```

    then create a bucket:

    ```
    mc alias set mlflow-minio http://localhost:9000 <myaccesskey> <mysecretkey>
    mc mb mlflow-minio/mlflow-bucket
    ```

Step 4. create user and policy, then assign policy to user

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::mlflow-bucket"
    },
    {
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::mlflow-bucket/*"
    }
  ]
}

This policy enables basic read and write operations on the specified S3 bucket and its contents named mlflow-bucket.

The first statement allows the actions s3:GetBucketLocation and s3:ListBucket on the bucket itself, enabling the user to retrieve the bucket's location and list its contents.
The second statement permits the actions s3:PutObject, s3:GetObject, and s3:DeleteObject on all objects within the mlflow-bucket. This allows the user to upload, download, and delete objects stored in the bucket.

Step 5: Configure Istio

When using Istio in your Kubernetes cluster, you may need to consider Istio configurations for MinIO and MLflow to ensure proper traffic management, security, and observability.

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: minio-gateway
  namespace: mlflow
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 9000  # For MinIO API
      name: minio-api
      protocol: HTTP
    hosts:
    - "*"
  - port:
      number: 9001  # For MinIO Web UI
      name: minio-ui
      protocol: HTTP
    hosts:
    - "*"
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: minio
  namespace: mlflow
spec:
  hosts:
  - "*"
  gateways:
  - minio-gateway
  http:
  - match:
    - port: 9000  # Match for API requests
      uri:
        prefix: /
    route:
    - destination:
        host: minio
        port:
          number: 9000
  - match:
    - port: 9001  # Match for UI requests
      uri:
        prefix: /
    route:
    - destination:
        host: minio
        port:
          number: 9001

kubectl apply -f minio/minio-istio.yaml

Using MinIO which comes with Kubeflow installation

step 1: configure Network Policy

If you decided to use minio of kubeflow as your MLflow storage backend, you need to set Minio-service of kubeflow namespace in your MLflow configs.

also, there is a NetworkPolicy in KubeFlow namespace which only allows traffic to minio from two namespaces:

kubectl describe networkpolicy -n kubeflow minio

Name:         minio
Namespace:    kubeflow
Created on:   2025-04-28 14:20:07 +0330 +0330
Labels:       <none>
Annotations:  <none>
Spec:
  PodSelector:     app in (minio)
  Allowing ingress traffic:
    To Port: <any> (traffic allowed to all ports)
    From:
      NamespaceSelector: app.kubernetes.io/part-of in (kubeflow-profile)
    From:
      NamespaceSelector: kubernetes.io/metadata.name in (istio-system)
    From:
      PodSelector: <none>
  Not affecting egress traffic
  Policy Types: Ingress

because we will deploy MLflow in mlflow namespace in this scenario, it doesn’t match any of those From: sources, so its TCP connection to MinIO is dropped. we need to modify this network policy to allow connection between MLFlow and Minio.
we can apply changes using yaml or a patch:

Option 1:

kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: minio
  namespace: kubeflow
spec:
  podSelector:
    matchLabels:
      app: minio
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              app.kubernetes.io/part-of: kubeflow-profile
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: istio-system
        - namespaceSelector:      # NEW: allow mlflow namespace
            matchLabels:
              kubernetes.io/metadata.name: mlflow
      ports:
        - protocol: TCP
          port: 9000           # adjust if your MinIO listens on a different port
EOF

Option 2: Patch command

kubectl patch networkpolicy minio -n kubeflow --type='json' -p='[
  {
    "op": "add",
    "path": "/spec/ingress/0/from/-",
    "value": {
      "namespaceSelector": {
        "matchLabels": {
          "kubernetes.io/metadata.name": "mlflow"
        }
      }
    }
  }
]'

step 2: Create a Bucket in MinIO

    **Using `mc` (MinIO Client)**:

    [install](https://min.io/docs/minio/linux/reference/minio-mc.html#install-mc) `mc` and port forward Minio service:

    ```
    kubectl port-forward svc/minio-service -n kubeflow 9000:9000
    ```

    then create a bucket:

    ```
    mc alias set minio-kf http://localhost:9000 <myaccesskey> <mysecretkey>
    mc mb minio-kf/mlflow-bucket
    ```

MLFLOW

Does MLflow Need MinIO?

MLflow does not strictly require MinIO; however, it does need a storage backend to store artifacts and models. Here are some options:

Local File Storage: You can use local paths to store artifacts, but this is not recommended for production environments due to scalability and persistence issues.
Object Storage:
- MinIO: If you prefer using an S3-compatible object storage service, MinIO is a popular choice for Kubernetes environments. It’s lightweight and easy to deploy.
- Amazon S3: If you have access to AWS, you can use S3 directly.
- Ceph Object Storage: Since you have a Ceph cluster, you can use it as an object storage backend. Ceph provides an S3-compatible interface, allowing you to use it similarly to MinIO or AWS S3.
Database Storage: MLflow can also log to a relational database (e.g., PostgreSQL, MySQL) for tracking experiments.

Setting Up MLflow

We will start by creating a Dockerfile. This step is essential because the default MLflow image lacks the boto3 and psycopg2-binary packages, which are necessary for connecting MLflow to MinIO and PostgreSQL:

FROM ghcr.io/mlflow/mlflow:latest

RUN pip install psycopg2-binary boto3

CMD ["mlflow", "server"]

Then build:

docker build -t prezaei/mlflow-custom:v1.0 .

And deploy MLflow on Kubernetes by creating your own deployment YAML files.

note that because Kubernetes does not do env var substitution inside value: fields — it only sets them as independent environment variables. So $(POSTGRES_PASSWORD) will literally be interpreted as the string "$(POSTGRES_PASSWORD)", not the actual password. so we can not use env value like this:


name: BACKEND_STORE_URI
value: "postgresql+psycopg2://mlflow:$(POSTGRES_PASSWORD)@mlflow-postgres:5432/mlflow_db"

To fix it, you should construct the full URI inside the container, using environment variables.
change your args: to construct the URI inside the container, like this:

command: ["sh", "-c"]
args:
  - |
    mlflow server \
      --host=0.0.0.0 \
      --port=5000 \
      --backend-store-uri=postgresql+psycopg2://mlflow:${POSTGRES_PASSWORD}@mlflow-postgres:5432/mlflow_db \
      --default-artifact-root=s3://mlflow-bucket

Here’s a basic example using a deployment:

apiVersion: v1
kind: Service
metadata:
  name: mlflow-service
  namespace: mlflow
spec:
  selector:
    app: mlflow
  ports:
    - protocol: TCP
      port: 5000
      targetPort: 5000
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mlflow-sa
  namespace: mlflow
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow
  namespace: mlflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      serviceAccountName: mlflow-sa
      containers:
      - name: mlflow
        image: prezaei/mlflow-custom:v1.0
        ports:
          - containerPort: 5000
        env:
          - name: BACKEND_STORE_URI
            value: "postgresql+psycopg2://mlflow@mlflow-postgres:5432/mlflow_db"
          - name: POSTGRES_PASSWORD
            valueFrom:
              secretKeyRef:
                name: mlflow-secret
                key: POSTGRES_MLFLOW_PASS
          - name: MLFLOW_S3_ENDPOINT_URL
            value: "http://minio.mlflow.svc.cluster.local:9000"
          - name: AWS_S3_ADDRESSING_STYLE
            value: "path"
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              secretKeyRef:
                name: mlflow-secret
                key: AWS_ACCESS_KEY_ID
          - name: AWS_SECRET_ACCESS_KEY
            valueFrom:
              secretKeyRef:
                name: mlflow-secret
                key: AWS_SECRET_ACCESS_KEY
        command: ["sh", "-c"]
        args:
          - |
            mlflow server \
              --host=0.0.0.0 \
              --port=5000 \
              --backend-store-uri=postgresql+psycopg2://mlflow:${POSTGRES_PASSWORD}@mlflow-postgres:5432/mlflow_db \
              --default-artifact-root=s3://mlflow-bucket \
              --artifacts-destination s3://mlflow-bucket
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "4Gi"
            cpu: "2"

And a secret:

apiVersion: v1
kind: Secret
metadata:
  name: mlflow-secret
  namespace: mlflow
type: Opaque
data:
  AWS_ACCESS_KEY_ID: bWxmbG93
  AWS_SECRET_ACCESS_KEY: VGsvUEFJa1I5fkxZbVp
  POSTGRES_MLFLOW_PASS: QXliRmoxVFdhMW

Istio

When using Istio in your Kubernetes cluster, you may need to consider Istio configurations for MinIO and MLflow to ensure proper traffic management, security, and observability. Here’s a breakdown of what you might need:

Configure MLflow with Istio

If you are also exposing MLflow outside the cluster or want to manage traffic to it, you should similarly set up an Istio Virtual Service for MLflow.

Example Configuration for MLflow

Create a Virtual Service for MLflow:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: mlflow
  namespace: mlflow
spec:
  gateways:
    - kubeflow/kubeflow-gateway
  hosts:
    - '*'
  http:
    - match:
        - uri:
            prefix: /mlflow/ # match any request with a URI that starts with /mlflow/
      rewrite:
        uri: / #requests matching /mlflow/ are rewritten to /, routing them to the root of the mlflow service
      route:
        - destination:
            host: mlflow-service.mlflow.svc.cluster.local
            port:
              number: 5000
    - match:
        - uri:
            prefix: /graphql
      rewrite:
        uri: /graphql
      route:
        - destination:
            host: mlflow-service.mlflow.svc.cluster.local
            port:
              number: 5000

We configured settings to allow access to the MLflow UI at kubeflow.mydomain.com/mlflow/. However, when selecting run details in the MLflow UI, a 404 HTTP error code is encountered due to issues with the /graphql section. The /graphql prefix is responsible for handling backend GraphQL API requests, which are utilized by the Kubeflow UI to interact with MLflow.

Apply the Configurations:

kubectl apply -f mlflow-virtualservice.yaml

Next, we need to integrate an MLflow tab into the central dashboard of Kubeflow. So we will modify the ConfigMap for Kubeflow's dashboard to make MLflow visible:

kubectl edit cm centraldashboard-config -n kubeflow

and adding this config in menuLinks section:

            { 
                "type": "item",
                "link": "/mlflow/",
                "text": "MlFlow",
                "icon": "icons:cached"
            },

Restarting the central dashboard deployment will result in the tab being added.

kubectl rollout restart deploy centraldashboard -n kubeflow

Part 2

Nice work getting MLflow into Kubeflow! Now let’s walk through a detailed guide on how to test the integration. The goal is to verify that MLflow is working smoothly within the Kubeflow environment—logging experiments, models, parameters, and metrics. Here's how you can do it step by step:

✅ 1. Decide Where to Run the Code

To best test the integration, you should run the MLflow code inside Kubeflow Notebooks (e.g., a Jupyter Notebook in a Kubeflow workspace). This ensures that:

You're using the same Kubernetes network.
MLflow client talks directly to the MLflow tracking server you integrated.
Any paths (e.g., artifact store, model registry) resolve correctly within the cluster.

💡 Running from your laptop is okay only if you expose MLflow’s tracking server externally, which is not recommended for early testing due to security/config complexity.

✅ 2. Prepare the Kubeflow Notebook Environment

Launch a notebook server in Kubeflow:
- Go to the Kubeflow Dashboard → “Notebooks”.
- Create a new notebook server (choose a Python-based image that supports pip).
Install MLflow in the notebook:

pip install mlflow boto3 scikit-learn pandas

You may also install any dependencies your test script needs.

✅ 3. Configure MLflow Client in the Notebook

Set up the MLflow client to point to your MLflow Tracking Server. Usually, this is something like:

import mlflow
import os

# Point to your MLflow tracking server
mlflow.set_tracking_uri("http://mlflow-service.<namespace>.svc.cluster.local:5000")
print("Tracking URI:", mlflow.get_tracking_uri())

Replace with the actual Kubernetes namespace where MLflow is deployed.

✅ 4. Set Minio Credentials

os.environ["MLFLOW_S3_ENDPOINT_URL"] = "http://minio.mlflow.svc.cluster.local:9000"
os.environ["AWS_ACCESS_KEY_ID"] = "mlflow"
os.environ["AWS_SECRET_ACCESS_KEY"] = "***********"

✅ 5. Run a Simple MLflow Test Script

Here’s a minimal working example:

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd

# Data
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

with mlflow.start_run(run_name="kubeflow-test-run") as run:
    model = RandomForestRegressor(n_estimators=100, max_depth=5)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 5)
    mlflow.log_metric("mse", mean_squared_error(y_test, predictions))

    # Log the model
    mlflow.sklearn.log_model(model, "model")

    print("🏃 View run at:", f"{mlflow.get_tracking_uri()}/#/experiments/0/runs/{run.info.run_id}")

✅ 5. Verify Results in Kubeflow Dashboard

Navigate to your MLflow dashboard integrated into Kubeflow.
Check if the experiment, run, parameters, metrics, and model are logged.
Try registering a model and promote it to a stage if the registry is enabled.

In the Experiments section of MLflow, you can view a list of runs and access detailed information for each run by selecting them:

In MLflow, the run details provide a comprehensive overview of a specific experiment run. Here’s the kind of information you can typically find in the run details:

1. Basic Run Information

Run ID: A unique identifier for the run.
Experiment ID: The ID of the experiment to which the run belongs.
Start Time: The timestamp when the run started.
End Time: The timestamp when the run finished.
Duration: The total time taken for the run.

2. Parameters

Parameters: Key-value pairs representing the hyperparameters or configurations used during the run.

3. Metrics

Metrics: Key-value pairs of numerical values that represent the performance of the model (e.g., accuracy, loss) at various stages of the run.
Logging: Metrics can be logged at different intervals throughout the run.

4. Artifacts

Artifacts: Files or outputs generated during the run, such as:
- Model files
- Plots and figures
- Data files
- Logs

5. Tags

Tags: Key-value pairs used to categorize and add metadata to the run (e.g., version of the code, experiment type).

6. Source Information

Source: Information about the source of the run, including:
- The script or notebook used to run the experiment
- The entry point of the run (if applicable)

7. Status

Status: The current state of the run (e.g., RUNNING, FINISHED, FAILED, or KILLED).

8. User Information

User: Information about the user who initiated the run (if applicable).

Understanding MLflow UI Components

The MLflow UI is an integral part of the MLflow platform, providing a visual interface for monitoring and comparing machine learning experiments. Here's an in-depth look at its components:
Experiments and Runs

Experiments: Group related runs for easy comparison and analysis.
Runs: Individual executions of a machine learning model, each with its own set of parameters, metrics, and artifacts.

Detailed Run Information

Access detailed information for each run, including parameters, metrics, and artifacts.
View the history of a metric by selecting its name under the Metrics section.

Views and Comparisons

Table View: Lists runs with sortable columns for names, creation times, and other key data.
Chart View: Visualize and compare runs using various charts, such as parallel coordinates.

Artifacts

Store and retrieve output such as models and visualizations.

Metric History

Track the performance of metrics over time, such as Mean Average Precision.

Integration and Extensibility

MLflow UI can be extended to track runs from various sources, including local and remote servers.

Also, In MLflow, logging and registering a model serve different purposes in the machine learning lifecycle. Here's a breakdown of the differences:

Logging a Model

Definition: Logging a model refers to the process of saving model artifacts (like the model itself, parameters, metrics, and artifacts) during an experiment.
Purpose: It allows you to keep track of different versions of models and their performance metrics during experimentation.
Usage: Typically done during training or evaluation, using functions like mlflow.log_model().
Scope: Logged models are associated with a specific run in the MLflow tracking server.

Registering a Model

Definition: Registering a model involves adding a model to the MLflow Model Registry, which is a centralized repository for managing and versioning models.
Purpose: It allows you to organize, manage, and deploy models in a more structured way. You can also promote models through stages (e.g., Staging, Production).
Usage: Done after logging a model, using functions like mlflow.register_model().
Scope: Registered models can be accessed and used independently of specific runs, facilitating model sharing and deployment.

Now let’s take it further and go through the Model Registry, versioning, staging, and visualizations. Here's a full guide with examples that you can use in your workflows.

📘 MLflow Model Registry – End-to-End Example

✅ Prerequisites

MLflow Tracking Server with PostgreSQL backend and MinIO set up ✅
Models logged to tracking server ✅
MLflow client access from notebook ✅

1. 🔖 Registering a Model

Once a model is logged (as you've done with mlflow.sklearn.log_model(model, "model")), you can register it.

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register the model under a name
model_uri = f"runs:/{run.info.run_id}/model"
model_name = "DiabetesRandomForest"

result = mlflow.register_model(model_uri=model_uri, name=model_name)
print(f"🎯 Registered model version: {result.version}")

2. 📌 Add a Description

Adding helpful descriptions helps with collaboration.

client.update_registered_model(
    name=model_name,
    description="A RandomForestRegressor trained on the diabetes dataset."
)

client.update_model_version(
    name=model_name,
    version=result.version,
    description="Version 1: 100 estimators, max depth 5."
)

🆕 3. Add a New Model Version

To add a new version, you can log a different model (e.g., new params or retrained model) and then register it under the same name.

# Train a new model with different parameters
model_v2 = RandomForestRegressor(n_estimators=200, max_depth=8)
model_v2.fit(X_train, y_train)

mlflow.sklearn.log_model(model_v2, "model")

# Register as new version
new_model_uri = f"runs:/{mlflow.active_run().info.run_id}/model"
registered_v2 = mlflow.register_model(model_uri=new_model_uri, name=model_name)

print(f"📦 Registered new model version: {registered_v2.version}")

🏷️ 4. Add Tags to a Model Version

Tags are useful for categorization or additional metadata like author, dataset, accuracy, etc.

client.set_model_version_tag(
    name=model_name,
    version=registered_v2.version,
    key="model_type",
    value="random_forest"
)

client.set_model_version_tag(
    name=model_name,
    version=registered_v2.version,
    key="dataset_version",
    value="v1.1"
)

🔗 5. Use Aliases for Model Versions (MLflow ≥ 2.2)

Aliases allow you to define human-readable names for model versions like @latest, @staging, etc.

# Add alias to version 2
client.set_registered_model_alias(
    name="DiabetesRandomForest",
    alias="prod",
    version="1"
)

# You can now refer to this model like:
model = mlflow.sklearn.load_model(f"models:/{model_name}@prod")

You can also update or delete an alias:

# Change alias to another version
client.set_model_version_alias(name=model_name, version="1", alias="prod")

# Delete an alias (MLflow >= 2.6)
client.delete_model_version_alias(name=model_name, alias="prod")

6. 🔄 List Model Versions

for mv in client.search_model_versions(f"name='{model_name}'"):
    print(f"🔢 Version {mv.version} - Status: {mv.current_stage}")

7. 🚦 Transition Between Stages

MLflow supports these stages: None, Staging, Production, Archived.

client.transition_model_version_stage(
    name=model_name,
    version=result.version,
    stage="Staging",  # or "Production", "Archived"
    archive_existing_versions=True
)

In MLflow, stages refer to the different phases that a model can be in within the Model Registry. These stages help manage the lifecycle of machine learning models, allowing teams to organize, promote, and deploy models systematically. Here are the main stages in MLflow:

1. Staging

Definition: A model in the Staging stage is considered to be ready for testing and validation.
Purpose: This stage allows users to evaluate the model in a controlled environment before it is promoted to production.
Usage: Typically used for models that have been recently logged and need to be tested for performance.

2. Production

Definition: A model in the Production stage is actively being used in a live environment.
Purpose: This indicates that the model has passed all necessary tests and is deemed reliable for making predictions on real data.
Usage: Models in this stage are often monitored for performance and may be updated or replaced as new models are developed.

3. Archived

Definition: A model in the Archived stage is no longer in active use.
Purpose: This stage is used to keep the model in the registry for historical reference while indicating that it should not be used for new predictions.
Usage: Models may be archived for various reasons, such as being replaced by newer versions or being deemed obsolete.

Summary of Stages

Staging: For testing and validation.
Production: For live use and active predictions.
Archived: For historical reference, not in active use.

Transitioning Between Stages

Models can transition between these stages based on their performance, testing results, and the needs of the organization. This structured approach to model management helps ensure that only the best-performing models are deployed in production, while also maintaining a clear history of model versions and their statuses.

8. 📊 View in UI

Visit: http://<your-mlflow-host>/#/models
Click on DiabetesRandomForest
You’ll see all versions, stages, parameters, metrics, and artifacts.

9. 🎯 Load Model by Stage (e.g., for serving)

model = mlflow.sklearn.load_model(model_uri=f"models:/{model_name}/Staging")
predictions = model.predict(X_test)

10. 📉 Visualizations: Auto-Generated in UI

In the MLflow UI (under the experiment or model), MLflow provides charts like:

Line chart of metrics per run
Parallel coordinates for comparing multiple runs
Run comparison and filtering

If you want to create custom charts:

import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x=predictions, y=y_test)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Prediction vs Actual")
plt.show()

✅ Summary

Step	Description
Log Model	`mlflow.sklearn.log_model(...)`
Register	`mlflow.register_model(...)`
Describe	`client.update_registered_model(...)`
Transition	`client.transition_model_version_stage(...)`
Load by Stage	`mlflow.sklearn.load_model("models:/ModelName/Stage")`

Resources

Introduction to Neural Networks

Pejman Rezaei — Sat, 15 Feb 2025 21:41:18 +0000

Neural networks are the backbone of modern Artificial Intelligence (AI) and Machine Learning (ML). They power everything from image recognition and natural language processing to self-driving cars and recommendation systems. But what exactly are neural networks, and how do they work? In this article, we’ll break down the basics of neural networks, explain key concepts like layers and activation functions, and walk through a simple example using TensorFlow.

What is a Neural Network?

A neural network is a computational model inspired by the structure and function of the human brain. It consists of interconnected nodes (called neurons) organized into layers. These networks are designed to recognize patterns in data and make predictions or decisions based on that data.

Neural networks are particularly powerful because they can learn complex relationships in data without being explicitly programmed. This makes them ideal for tasks like image classification, speech recognition, and more.

Key Components of a Neural Network

Let’s dive into the key components that make up a neural network:

1. Neurons

A neuron is the basic unit of a neural network. It takes one or more inputs, applies a mathematical operation to them, and produces an output. Each input is multiplied by a weight, which represents the importance of that input.

2. Layers

Neurons are organized into layers:

Input Layer: The first layer that receives the input data.
Hidden Layers: Intermediate layers that process the data. A network can have one or more hidden layers.
Output Layer: The final layer that produces the result (e.g., a classification or prediction).

3. Activation Functions

Activation functions introduce non-linearity into the network, allowing it to learn complex patterns. Some common activation functions include:

ReLU (Rectified Linear Unit): f(x) = max(0, x) – The most popular activation function for hidden layers.
Sigmoid: f(x) = 1 / (1 + e^(-x)) – Often used in the output layer for binary classification.
Softmax: Used in the output layer for multi-class classification.

4. Weights and Biases

Weights: Parameters that determine the strength of the connection between neurons.
Biases: Additional parameters that allow the model to fit the data better.

5. Loss Function

A loss function measures how well the model’s predictions match the actual data. The goal of training is to minimize this loss.

6. Optimizer

An optimizer adjusts the weights and biases to minimize the loss. Common optimizers include Stochastic Gradient Descent (SGD) and Adam.

How Neural Networks Learn

Neural networks learn through a process called backpropagation. Here’s how it works:

Forward Pass: The input data is passed through the network, and the output is computed.
Loss Calculation: The loss function compares the predicted output to the actual output.
Backward Pass: The gradients of the loss with respect to the weights and biases are calculated.
Weight Update: The optimizer updates the weights and biases to reduce the loss.

This process is repeated for many iterations (epochs) until the model performs well.

A Simple Neural Network Example Using TensorFlow

Let’s build a simple neural network to classify handwritten digits using the MNIST dataset. This dataset contains 28x28 pixel images of digits (0-9) and their corresponding labels.

Step 1: Install TensorFlow

If you don’t have TensorFlow installed, you can install it using pip:

pip install tensorflow

Step 2: Load and Preprocess the Data

TensorFlow provides the MNIST dataset as part of its datasets module.

import tensorflow as tf
from tensorflow.keras import layers, models
import matplotlib.pyplot as plt

# Load the MNIST dataset
mnist = tf.keras.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Normalize the pixel values to the range [0, 1]
X_train, X_test = X_train / 255.0, X_test / 255.0

Step 3: Build the Neural Network

We’ll create a simple feedforward neural network with one hidden layer.

# Define the model
model = models.Sequential([
    layers.Flatten(input_shape=(28, 28)),  # Flatten the 28x28 images into a 784-dimensional vector
    layers.Dense(128, activation='relu'),  # Hidden layer with 128 neurons and ReLU activation
    layers.Dropout(0.2),                   # Dropout layer to prevent overfitting
    layers.Dense(10, activation='softmax') # Output layer with 10 neurons (one for each digit) and softmax activation
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Step 4: Train the Model

Train the model on the training data.

# Train the model
history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test))

Step 5: Evaluate the Model

Evaluate the model’s performance on the test data.

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=2)
print(f"Test Accuracy: {test_acc:.4f}")

Step 6: Make Predictions

Use the trained model to make predictions on new data.

# Make predictions
predictions = model.predict(X_test)

# Display the first prediction
print(f"Predicted Label: {tf.argmax(predictions[0])}")
print(f"Actual Label: {y_test[0]}")

# Visualize the first test image
plt.imshow(X_test[0], cmap='gray')
plt.show()

Real-World Applications of Neural Networks

Neural networks are used in a wide range of applications, including:

Image Recognition: Identifying objects, faces, or scenes in images.
Natural Language Processing (NLP): Powering chatbots, translation systems, and sentiment analysis.
Autonomous Vehicles: Enabling self-driving cars to perceive and navigate their environment.
Healthcare: Diagnosing diseases from medical images or predicting patient outcomes.

Integrating MLflow into Kubeflow: A Solution for Model Management

Pejman Rezaei — Wed, 12 Feb 2025 12:35:07 +0000

In the rapidly evolving field of machine learning, the ability to efficiently manage and track models is essential for success. Kubeflow has emerged as a powerful platform designed to streamline the deployment and orchestration of machine learning workflows. However, one notable limitation of Kubeflow is its lack of a user-friendly interface for model registration and management. This gap can hinder data scientists and machine learning engineers in effectively tracking their experiments and models.

While a user-friendly UI installation is scheduled for future releases of Kubeflow (as noted in this GitHub issue), we will explore a practical solution to address this challenge in the meantime.

Enter MLflow—a robust open-source platform that simplifies the management of the machine learning lifecycle, including experimentation, reproducibility, and deployment. By integrating MLflow into Kubeflow, users can leverage MLflow’s intuitive UI and comprehensive model registry capabilities to enhance their machine learning workflows.

Our initial objective is to deploy MLflow on Kubernetes, a process that follows the same principles as any standard deployment. We will start by creating a Dockerfile. This step is essential because the default MLflow image lacks the boto3 and psycopg2-binary packages, which are necessary for connecting MLflow to MinIO and PostgreSQL:

FROM ghcr.io/mlflow/mlflow:latest

RUN pip install psycopg2-binary boto3

CMD ["mlflow", "server"]

Generate a deployment.yaml file:

Make sure to provide the environment variables using a config map or another appropriate method. We will also include the credentials needed to connect MLflow to MinIO and PostgreSQL in this file:

metadata:
  name: mlflow-service
  namespace: model-registry
spec:
  selector:
    app: mlflow
  ports:
    - protocol: TCP
      port: 5000
      targetPort: 5000
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mlflow-sa
  namespace: model-registry
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow
  namespace: model-registry
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      serviceAccountName: mlflow-sa
      containers:
      - name: mlflow
        image: harbor.partdp.ir/devops/custom-mlflow:v2.17.2
        ports:
          - containerPort: 5000
        env:
          - name: MLFLOW_TRACKING_URI
            value: "postgresql+psycopg2://mlflow:<PASSWORD>@192.168.34.85:5000/mlflow_db"
          - name: MLFLOW_S3_ENDPOINT_URL
            value: "https://minio.partdp.ir:9000"
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              secretKeyRef:
                name: mlflow-secret
                key: AWS_ACCESS_KEY_ID
          - name: AWS_SECRET_ACCESS_KEY
            valueFrom:
              secretKeyRef:
                name: mlflow-secret
                key: AWS_SECRET_ACCESS_KEY
        command: ["mlflow", "server"]
        args:
          - "--host=0.0.0.0"
          - "--port=5000"
          - "--backend-store-uri=$(MLFLOW_TRACKING_URI)"
          - "--default-artifact-root=s3://pejman"
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "4Gi"
            cpu: "2"

and a secret:

apiVersion: v1
kind: Secret
metadata:
  name: mlflow-secret
  namespace: model-registry
type: Opaque
data:
  AWS_ACCESS_KEY_ID: <BASE64 encoded>
  AWS_SECRET_ACCESS_KEY: <BASE64 encoded>

Next, we need to integrate an MLflow tab into the central dashboard of Kubeflow. To accomplish this, we need to set up a virtual service that will expose the MLflow service through Istio Ingress.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: mlflow
  namespace: model-registry
spec:
  gateways:
  - kubeflow/kubeflow-gateway
  hosts:
  - '*'
  http:
  - match:
    - uri:
        prefix: /mlflow/
    rewrite:
      uri: /
    route:
    - destination:
        host: mlflow-service.model-registry.svc.cluster.local
        port:
          number: 5000

Next, we will modify the ConfigMap for Kubeflow's dashboard to make MLflow visible:

kubectl edit cm centraldashboard-config -n kubeflow

and adding this config in menuLinks section:

            { 
                "type": "item",
                "link": "/mlflow/",
                "text": "MlFlow",
                "icon": "icons:cached"
            },

Now just restart the Dashboard:

kubectl rollout restart deploy centraldashboard -n kubeflow

And we have MLflow there:

Predicting House Prices as Your First ML Project

Pejman Rezaei — Sat, 01 Feb 2025 07:07:32 +0000

Machine Learning (ML) can seem intimidating at first, but the best way to learn is by doing. In this article, we’ll walk through a beginner-friendly ML project: predicting house prices using the Boston Housing dataset. By the end of this guide, you’ll have built your first ML model using Python and Scikit-learn. Let’s get started!

What is the Boston Housing Dataset?

The Boston Housing dataset is a classic dataset used for regression problems. It contains information about housing prices in the Boston area, along with features that might influence those prices, such as:

CRIM: Per capita crime rate by town.
RM: Average number of rooms per dwelling.
AGE: Proportion of owner-occupied units built before 1940.
DIS: Weighted distances to five Boston employment centers.
LSTAT: Percentage of lower status of the population.
MEDV: Median value of owner-occupied homes in $1000s (the target variable we want to predict).

Our goal is to build a model that predicts the median house price (MEDV) based on these features.

Step 1: Set Up Your Environment

Before we start, make sure you have the necessary libraries installed. You can install them using pip:

pip install numpy pandas scikit-learn matplotlib

Step 2: Load the Dataset

Scikit-learn provides the Boston Housing dataset as part of its built-in datasets. Let’s load it and explore the data.

# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt

# Load the dataset
boston = load_boston()

# Convert it to a Pandas DataFrame for easier manipulation
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['MEDV'] = boston.target  # Add the target variable to the DataFrame

# Display the first few rows
print(data.head())

Step 3: Explore the Data

Before building a model, it’s important to understand the data. Let’s perform some basic exploratory data analysis (EDA).

# Check for missing values
print(data.isnull().sum())

# Get basic statistics
print(data.describe())

# Visualize the relationship between features and the target variable
plt.scatter(data['RM'], data['MEDV'])
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('Median House Price (MEDV)')
plt.title('Rooms vs. Price')
plt.show()

From the scatter plot, you can see that houses with more rooms tend to have higher prices. This is a good sign that our features are relevant to the target variable.

Step 4: Prepare the Data

Next, we’ll split the data into features (X) and labels (y), and then split it into training and testing sets.

from sklearn.model_selection import train_test_split

# Features (X) and labels (y)
X = data.drop('MEDV', axis=1)  # All columns except 'MEDV'
y = data['MEDV']  # Only the 'MEDV' column

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Build and Train the Model

We’ll use a Linear Regression model, which is a simple and effective algorithm for regression problems.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

Step 6: Evaluate the Model

To see how well our model performs, we’ll calculate two common metrics: Mean Squared Error (MSE) and R-squared (R²).

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Calculate R-squared (R²)
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.2f}")

MSE: Measures the average squared difference between the predicted and actual values. Lower is better.
R²: Represents the proportion of variance in the target variable that’s explained by the model. Closer to 1 is better.

Step 7: Interpret the Results

Let’s interpret the results:

A low MSE indicates that the model’s predictions are close to the actual values.
An R² value close to 1 suggests that the model explains a large portion of the variance in house prices.

For example, if your R² is 0.75, it means that 75% of the variability in house prices can be explained by the features in the dataset.

Step 8: Make Predictions

Now that the model is trained, you can use it to make predictions on new data. For example, let’s predict the price of a house with the following features:

# Example input (replace with your own values)
new_house = np.array([[0.02731, 0.0, 7.07, 0.0, 0.469, 6.421, 78.9, 4.9671, 2.0, 242.0, 17.8, 396.90, 9.14]])

# Predict the price
predicted_price = model.predict(new_house)
print(f"Predicted Price: ${predicted_price[0] * 1000:.2f}")

Real-World Applications

Predicting house prices is just one example of how ML can be applied in the real world. Here are some other applications of regression models:

Stock Price Prediction: Predicting the future price of stocks based on historical data.
Sales Forecasting: Estimating future sales based on past trends and external factors.
Healthcare: Predicting patient outcomes based on medical data.

Conclusion

We just built our first ML project using the Boston Housing dataset. Here’s a quick recap of what we covered:

Loaded and explored the dataset.
Prepared the data for training.
Built and trained a Linear Regression model.
Evaluated the model’s performance.
Made predictions on new data.

This is just the beginning of your ML journey. As you continue learning, you can explore more advanced algorithms, work with larger datasets, and tackle real-world problems.

If you have any questions or want to share your results, feel free to leave a comment below.

Follow Me

For more updates, check out my Mastodon blog: @prezaei@mastodon.social.

High Availability PostgreSQL: Clustering with Patroni

Pejman Rezaei — Tue, 28 Jan 2025 17:29:27 +0000

A complete high-availability architecture involves a number of components and processes working together to replicate data. Any organization implementing a high-availability solution should define target metrics for database uptime, switchover recovery time, and acceptable data loss.

Some of the most important concepts involving database high availability are as follows:

Data Replication: Data replication generates multiple copies of the original database data. It logs any database additions and updates and transmits them to all nodes in the HA Cluster. These changes can be database data transactions or alterations to the database schema or table structure. Replication can be either synchronous or asynchronous.
High Availability Cluster (HA Cluster): A HA Cluster is a collection of nodes that each have a copy of the same underlying data. Having multiple copies of the dataset is essential for data redundancy. Any one of the database servers can respond to queries, and any node can potentially become the master node. From the user’s point of view, the HA Cluster appears as a single database. In most cases, users do not know which node responded to their query.
Primary Node: This is the master node for the HA cluster. It is the recipient of all database changes, including writes and schema updates. Therefore, it always has the most current data set. It replicates these changes to the other instances in the HA cluster, sending them the transactions in either list or stream format. Primary nodes can also handle read requests, but these are typically distributed between the different nodes for load-balancing purposes. The primary node is elected through a primary election.
Replica Node: Also known as a secondary node, a replica receives updates from the primary node. During regular operation, these nodes can handle read requests. However, depending on the HA architecture, the data in the replica data set might not be completely up to date. Each HA cluster can contain multiple replica nodes for added redundancy and load balancing.
Failover: In the event of a primary node failure, a failover event occurs. One of the secondary nodes becomes the primary node and supervises database updates. Administrators can initiate a manual failover for database maintenance purposes. This scheduled activity is sometimes known as a manual switchover. A switch back to the original master is known as a fallback.
Write-ahead log (WAL): This log stores a record of all changes to the database. A unique sequence number identifies each WAL record. In PostgreSQL, the WAL is stored in a segment file. A segment file typically contains a large number of records.

Methods for Implementing Database Replication

There are two main forms of data replication and two methods of implementing it. The two main approaches are as follows:

Synchronous replication: In this approach, the primary node waits for confirmation from at least one replica before confirming the transaction. This guarantees the database is consistent across the HA cluster in the event of a failure. Consistency eliminates potential data loss and is vital for organizations that demand transactional data integrity. However, it introduces latency and can reduce throughput.
Asynchronous replication: In asynchronous replication, the primary node sends updates to the replicas without waiting for a response. It immediately confirms a successful commit after updating its own database, reducing latency. However, this approach increases the chances of data loss in the event of an unexpected failover. This is the default PostgreSQL replication method.

The following algorithms are used to implement replication:

File-based log shipping: In this replication method, the primary node asynchronously transmits segment files containing the WAL logs to the replicas. This method cannot be used synchronously because the WAL files build up over a large number of transactions. The primary node continually records all transactions, but the replicas only process the changes after they receive a copy of the file. This is a good approach for latency-sensitive loss-tolerant applications.
Streaming replication: A streaming-based replication algorithm immediately transmits each update to the replicas. The primary node does not have to wait for transactions to build up in the WAL before transmitting the updates. This results in more timely updates on the replicas. Streaming can be either asynchronous, which is the default setting, or synchronous. In both cases, the updates are immediately sent over to the replicas. However, in synchronous streaming, the primary waits for a response from the replicas before confirming the commit. Users can enable synchronous streaming on PostgreSQL through the sychronous_commit configuration option.

Another relevant set of concepts relates to how the HA cluster handles a split-brain condition. This occurs when multiple segments of the HA cluster are active but are not able to communicate with each other. In some circumstances, more than one node might attempt to become the primary. To handle this situation, the replication manager structures the rules for a primary election or adds a quorum. This problem can also be eliminated through the use of an external monitor.

Patroni High Availability Solution

A specialized replication manager application is almost always used to configure PostgreSQL HA Clusters. These applications automatically handle data replication and node monitoring, which are otherwise very difficult to implement. There are a number of different choices. Each alternative has its own series of strengths and drawbacks. This section explains each of the three most common solutions and compares them.

Patroni is a Python-based software template for enabling high availability in PostgreSQL databases. This framework requires some template customization to work most effectively. It also requires a distributed configuration store (DCS) but supports a number of different storage solutions. Patroni works well on a two-node HA cluster consisting of a primary node and a single replica.

Patroni configures a set of nodes into an HA cluster and configures streaming replication to share updates. It runs an agent on each node in the HA cluster to share node health updates between the members. The primary node is responsible for regularly updating the leader key, which is stored in the DCS. If it fails to do so, it is evicted as the primary and another node is elected to take over. After a switchover, the replicas coordinate their position with respect to the database updates. The most up-to-date node typically takes over. In the event of a tie, the first node to create a new leader key wins. Only one node can hold the leader key at any time. This reduces any ambiguity about the identity of the primary node and avoids a split-brain scenario.

Patroni can be installed on Linux nodes using pip. Mandatory configuration settings can be configured globally, locally using a YAML file, or through environment variables. The global settings are dynamic and are applied asynchronously to all nodes in the HA cluster. However, local configuration always takes precedence over any global settings. Patroni supports a REST API, which is useful for monitoring and automation purposes. This API is used to determine the status and role of each node in the HA cluster.

Advantages:

It is a mature open-source product.
It performs very well in standard high-availability test scenarios. It is able to handle more failure scenarios than the alternatives.
In some circumstances, it is able to restore a failed PostgreSQL process. It also includes a fallback function to restore the HA cluster to a healthy state after failures. This involves initializing the affected node as a replica.
It enables a standard end-to-end solution on all nodes in the HA cluster based on global configuration settings.
It has a wide set of features and is highly configurable.
It includes monitoring functionality.
The associated REST API permits script access to all attributes.
It includes watchdog support and callbacks for event notifications.
It can be integrated with HaProxy, a popular high-performance load balancer.
Patroni works well with Kubernetes as part of an automated pipeline.
Storing the leader key in the DCS enforces consensus about the primary node and avoids multiple masters.

Drawbacks:

It is unable to detect a misconfigured replica node.
It requires manual intervention in a few cases, such as when the Patroni process itself fails.
It requires a separate DCS application, which must be configured by the user. DCS requires two open communications ports in addition to the main Patroni port.
Configuration is more complex than the other solutions.
It uses more memory and CPU than the alternatives.

For more information on Patroni, see the Patroni website and documentation or Patroni GitHub .

This implementation consists of 9 Nodes, following best practices of concept to have servies externally. 3 vms for HAproxy and PaceMaker, 3 vms for etcd-server and 3 vms for Postgres servers. all nodes operating systems are debian 12. I also use different LVM to use for services on /data path.

No.	Hostname	Role	CPU (Cores)	RAM (GB)	Disk (GB)	NIC	IP	OS
1	dc1-psql-node1	psql, patroni	16	16	100	1	192.168.34.124	Debian 12
2	dc1-psql-node2	psql, patroni	16	16	100	1	192.168.34.136	Debian 12
3	dc1-psql-node3	psql, patroni	16	16	100	1	192.168.34.113	Debian 12
4	hap1	HAProxy, PaceMaker	4	8	25	1	192.168.34.132	Debian 12
5	hap2	HAProxy, PaceMaker	4	8	25	1	192.168.34.133	Debian 12
5	hap2	HAProxy, PaceMaker	4	8	25	1	192.168.34.134	Debian 12
5	etcd1	etcd1	4	8	25	1	192.168.34.137	Debian 12
5	etcd2	etcd2	4	8	25	1	192.168.34.138	Debian 12
5	etcd3	etcd3	4	8	25	1	192.168.34.139	Debian 12

The final setup will look like this:

LoadBalancing Layer

in the fist step, we begin to configure loadbalancers. so add nodes and ip addresses to /etc/hosts:

127.0.0.1 localhost

192.168.34.132 node1

192.168.34.133 node2

192.168.34.134 node3

To access the multiple hosts using a single interface, we need to create a cluster of LoadBalancer nodes and that is managed by PCS. so now install pcs:

sudo apt-get install pacemaker corosync crmsh pcs haproxy

systemctl enable corosync

systemctl enable pacemaker

systemctl enable pcsd

while Installing the PCS and other packages, the package manager also creates a user “hacluster” which is used with PCS for configuring the cluster nodes. and Before we can use PCS we need to set the password for user “hacluster” on all nodes:

passwd hacluster

Now using the user “hacluser” and its password we need to authenticate the nodes for PCS cluster.

[root@HOST-1 ~]# sudo pcs cluster auth node1 node2 node3

we can setup cluster by running:

sudo pcs cluster setup haproxy node1 addr=192.168.34.132 node2 addr=192.168.34.133 node3 addr=192.168.34.134 --force

pcs cluster start --all

pcs status cluster

pcs property set stonith-enabled=false

pcs property set no-quorum-policy=ignore

Virtual IP (VIP) and Pacemaker

A Virtual IP (VIP) is an IP address that is not tied to a specific physical network interface or device but can be dynamically assigned to one or more nodes in a high-availability (HA) cluster. In the context of using Pacemaker for HA, a VIP allows clients to connect to a single address regardless of which node is currently active or serving requests. Pacemaker manages the VIP by monitoring the health of the nodes in the cluster. If the primary node fails or becomes unavailable, Pacemaker automatically reassigns the VIP to a standby node, ensuring continuous availability of services. This seamless failover process allows developers and applications to interact with a consistent endpoint, minimizing downtime and enhancing reliability in distributed applications.

now we continue to configure PCS to use VIP. my VIP is 192.168.34.85, so i run:

sudo pcs resource create virtual_ip ocf:heartbeat:IPaddr2 ip=192.168.34.85 cidr_netmask=24 op monitor interval=30s

sudo pcs resource create haproxy systemd:haproxy op monitor interval=10s

sudo pcs resource group add HAproxyGroup virtual_ip haproxy

sudo pcs constraint order virtual_ip then haproxy

last command ensures the VIP is assigned on the node, then starts haproxy on it.

then, to check the status of the cluster run:

sudo pcs status resources

the output displays the current status of the resources managed by the cluster. This includes information about each resource, such as:

Resource Name: The name of the resource being monitored.
Resource Type: The type of resource (e.g., virtual IP, service, etc.).
Current State: The current status of the resource (e.g., started, stopped, or failed).
Node Location: The node on which the resource is currently active or running.
Resource Stickiness: If applicable, it shows how much the resource prefers to stay on its current node.
Fail Count: The number of times the resource has failed and been restarted.

Resource Group: HAproxyGroup:

* virtual_ip (ocf:heartbeat:IPaddr2): Started node3

* haproxy (systemd:haproxy): Started node3

after successfully setting up PCS, we configure the HAproxy by editing /etc/haproxy/haproxy.cfg file:

# Global configuration settings
global
    # Maximum connections globally
    maxconn 4096
    # Logging settings
    log /data/logs/haproxy local0
    user haproxy
    group haproxy
    daemon

# Default settings
defaults
    # Global log configuration
    log global
    # Number of retries
    retries 2
    # Client timeout
    timeout client 30m
    # Connect timeout
    timeout connect 4s
    # Server timeout
    timeout server 30m
    # Check timeout
    timeout check 5s

# Stats configuration
listen stats
    # Set mode to HTTP
    mode http
    # Bind to port 7000
    bind *:7000
    # Enable stats
    stats enable
    # Stats URI
    stats uri /
    stats auth pejman:**************

# Frontend for Write Requests
listen production
    # Bind to port 5000
    bind *:5000
    # Enable HTTP check
    option httpchk OPTIONS/master
    # Expect status 200
    http-check expect status 200
    # Server settings
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    # Define PostgreSQL servers
    server dc1-psql-node1 192.168.34.124:5432 maxconn 100 check port 8008
    server dc1-psql-node2 192.168.34.136:5432 maxconn 100 check port 8008
    server dc1-psql-node3 192.168.34.113:5432 maxconn 100 check port 8008

# Backend for Standby Databases (for read requests)
listen standby
    bind *:5001
    option httpchk OPTIONS/replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server dc1-psql-node1 192.168.34.124:5432 maxconn 100 check port 8008
    server dc1-psql-node2 192.168.34.136:5432 maxconn 100 check port 8008
    server dc1-psql-node3 192.168.34.113:5432 maxconn 100 check port 8008

# Frontend for Primary Read Requests
frontend read_requests
    bind *:5002
    acl is_primary path_beg /read_primary
    use_backend production if is_primary
    default_backend standby

Based on the above configuration:

each HAProxy node is waiting for connections to the Primary node on port 5000 and for connections to the replica nodes on port 5001. Therefore, write requests can now be sent to the primary node, while read requests can be distributed in a round-robin manner among the three servers defined as back-end at the end of the configuration.
We also utilized HAProxy's path-based routing feature, allowing developers to access the most up-to-date data from the leader by using the /read_primary path on port 5002 in the last part of config.
Additionally, according to the settings of the listen stats block, the HAProxy status dashboard will be accessible on port 7000. we also used authentication to secure the dashboard access.
In the settings of both the listen production and listen standby blocks, parameters are also specified in the default-server section that determine HAProxy's behavior towards the back-end servers:
- The interval between health checks is 3 seconds (inter 3s).
- If the health check fails three times in a row, that node is considered down (fall 3).
- If the health check succeeds two times in a row, that node is considered back up (rise 2).
- If a server is considered down, HAProxy immediately closes all sessions for that server. This helps to quickly remove faulty servers and redirect traffic to healthy servers (on-marked-down shutdown-sessions).

To validate your configuration syntax, run:

haproxy -c -f /etc/haproxy/haproxy.cfg

Then restart the haproxy resource by running:

sudo pcs resource restart haproxy

etcd cluster

Before configuring the patroni nodes, it is necessary to set up the infrastructure required to record and maintain the status of the Postgres cluster. For this purpose, we use etcd, which is a key-value database, as a distributed configuration store (DCS). In order for this store to be resilient against failures and to continue operating in the event of a node failure, it is necessary to set up an instance of etcd on each of the three nodes and to cluster them together. Note that the etcd cluster uses the Raft algorithm for consensus and leader election, so it is essential that the number of nodes is odd. To install etcd and other required packages, we use the following command on etcd nodes:

sudo apt install etcd-server etcd-client python3-etcd

The etcd-server package contains the etcd daemon binaries.
The etcd-client package contains the etcd client binaries.
The python3-etcd package is a Python client for interacting with etcd, allowing Python programs to communicate with and manage etcd clusters.

To configure each etcd node, it is necessary to first delete the files located in the /var/lib/etcd/ directory. The presence of default files in this path causes all etcd nodes to be created with the same UUID, and for this reason, they cannot recognize each other as members of a cluster.

sudo systemctl stop etcd
sudo mkdir /data/etcd
sudo chown -R etcd:etcd /data/
sudo rm -rf /var/lib/etcd/*

Then we open the file located at etc/default/etcd/ in the editor and add the following lines to it. be sure to change the values according to your setup:

ETCD_NAME="etcd-2"
ETCD_DATA_DIR="/data/etcd"
ETCD_LISTEN_PEER_URLS="http://192.168.34.138:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://192.168.34.138:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.34.138:2380"
ETCD_INITIAL_CLUSTER="etcd-1=http://192.168.34.137:2380,etcd-2=http://192.168.34.138:2380,etcd-3=http://192.168.34.139:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.34.138:2379"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_ENABLE_V2="true"

in the time of writing this, patroni just supports etcd api v2, so we enabled it above.

I also changed ETCD_DATA_DIR in etcd service in order to use /data/etcd as data dir:

[Unit]
Description=etcd - highly-available key value store
Documentation=https://etcd.io/docs
Documentation=man:etcd
After=network.target
Wants=network-online.target

[Service]
Environment=DAEMON_ARGS=
Environment=ETCD_NAME=%H
Environment=ETCD_DATA_DIR=/data/etcd
EnvironmentFile=-/etc/default/%p
Type=notify
User=etcd
PermissionsStartOnly=true
#ExecStart=/bin/sh -c "GOMAXPROCS=$(nproc) /usr/bin/etcd $DAEMON_ARGS"
ExecStart=/usr/bin/etcd $DAEMON_ARGS
Restart=on-abnormal
#RestartSec=10s
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
Alias=etcd2.service

Then restart the etcd-server:

sudo systemctl restart etcd

DataBase Layer

We will use version 17 of PostgreSQL. To install this version, we first need to add the PostgreSQL APT repository to the machines. According to the PostgreSQL documentation, we proceed as follows:

# Import the repository signing key:
sudo apt install curl ca-certificates
sudo install -d /usr/share/postgresql-common/pgdg
sudo curl -o /usr/share/postgresql-common/pgdg/apt.postgresql.org.asc --fail https://www.postgresql.org/media/keys/ACCC4CF8.asc

# Create the repository configuration file:
sudo sh -c 'echo "deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.asc] https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'

After creating pdgd.list, we update the APT cache once and install it alongside patroni on all postgres nodes:

sudo apt update
sudo apt -y install postgresql-17 postgresql-server-dev-17 patroni etcd-client etcd-python3 python3-psycopg2

Since we need to delegate the control of Postgres to Patroni, we will stop the currently running service that has been started automatically:

which psql
sudo systemctl stop postgresql
sudo systemctl stop patroni

Patroni uses some of the tools that are installed with Postgres, so it is necessary to create a symbolic link (symlink) from its binaries in the /usr/sbin/ directory to ensure that Patroni will have access to them:

sudo ln -s /usr/lib/postgresql/17/bin/* /usr/sbin/

Now according to Patroni Docs we will config Patroni using etc/patroni/config.yml/ file. mine looks like this:

scope: postgres
namespace: /db/
name: node1

restapi:
    listen: 192.168.34.124:8008
    connect_address: 192.168.34.124:8008

etcd:
    hosts: 192.168.34.137:2379,192.168.34.138:2379,192.168.34.139:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: hot_standby # or replica
        hot_standby: "on"
        wal_keep_segments: 8
        max_wal_senders: 5
        max_replication_slots: 5
        max_connections: 100
        max_worker_processes: 8
        max_locks_per_transaction: 64
        wal_log_hints: "on"
        track_commit_timestamp: "off"
        archive_mode: "on"
        archive_timeout: 1800s
        # Command to archive WAL files. This command creates a directory named 'wal_archive', checks if the file doesn't already exist, and then copies it
        archive_command: mkdir -p ../wal_archive && test ! -f ../wal_archive/%f && cp %p ../wal_archive/%f
      recovery_conf:
        # Command used to retrieve archived WAL files during recovery. It copies files from the 'wal_archive' directory.
        restore_command: cp ../wal_archive/%f %p

  initdb:
  - auth: scram-sha-256
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - host replication replicator 127.0.0.1/32 scram-sha-256
  - host replication replicator 192.168.34.124/0 scram-sha-256
  - host replication replicator 192.168.34.136/0 scram-sha-256
  - host replication replicator 192.168.34.113/0 scram-sha-256
  - host all all 0.0.0.0/0 scram-sha-256

  users:
    admin:
      password: admin
      options:
        - createrole
        - createdb

postgresql:
  listen: 192.168.34.124:5432 # This can remain as is for local connections
  connect_address: 192.168.34.124:5432 # or Point to PgBouncer if you have it externally
  data_dir: /data/patroni
  pgpass: /tmp/pgpass
  authentication:
    replication:
      username: replicator
      password: ***********
    superuser:
      username: postgres
      password: ************
  parameters:
      unix_socket_directories: '.'
      max_connections: 100
      shared_buffers: 512MB
      wal_level: replica
      hot_standby: "on"
      max_wal_senders: 5
      max_replication_slots: 5
      password_encryption: 'scram-sha-256'

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

We added the following configurations to the bootstrap section. In this section, we define what default values and settings the Postgres node should use during the initial startup. We also configure how Patroni interacts with the distributed configuration store (DCS), or etcd, here.
Next, we determined how to authenticate the clients of the cluster (pg_hba.conf) and allow the user 'replicator', which is used for replication between the nodes of the cluster, to access the PostgreSQL cluster nodes using the scram-sha-256 mechanism, which is the most secure method of authentication via password. We also set the password for the admin user in this section.
Then, we entered the remaining configurations for Postgres and Patroni, such as the Postgres address on each node, the password for the replicator and postgres users, etc., and saved the changes.

After editing the config.yml file, it is necessary to transfer the ownership of the directory we designated for storing Patroni data (/data/patroni/) to the postgres user and restrict read and write access to only the owner of the directory:

sudo mkdir -p /data/patroni

sudo chown -R postgres:postgres /data/patroni

sudo chmod 700 /data/patroni

also check patroni service to ensure it is using the correct config file in ExecStart section:

ExecStart=/usr/bin/patroni /etc/patroni/config.yml

Now we can start the Patroni service on all related nodes:

sudo systemctl restart patroni

If it becomes necessary to make changes to the parameters in the bootstrap.dcs section after the initial setup, we should use the command patronictl edit-config.

To perform a final check on the PostgreSQL nodes managed by Patroni, execute the following command on one of the nodes:

patronictl -c /etc/patroni/config.yml list

the output should be something like this:

+ Cluster: postgres (7431525978089455740) ------+----+-----------+
| Member | Host           | Role    | State     | TL | Lag in MB |
+--------+----------------+---------+-----------+----+-----------+
| node1  | 192.168.34.124 | Leader  | running   |  2 |           |
| node2  | 192.168.34.136 | Replica | streaming |  2 |         0 |
| node3  | 192.168.34.113 | Replica | streaming |  2 |         0 |
+--------+----------------+---------+-----------+----+-----------+

Finally, we disable the automatic execution of the Postgres service after the system reboots so that the control of the Postgres cluster remains with Patroni. If the execution of this command is successful, there will be no specific output:

sudo systemctl disable --now postgresql

now if we go to the VIP address on port 7000, the haproxy dashboard will display the status of nodes:

- one leader in production
- two replica nodes listening for read operations in standby section except the leader
- and read requests to the leader node if is_primary request has been used

sources

Supervised vs. Unsupervised Learning

Pejman Rezaei — Sat, 25 Jan 2025 18:45:41 +0000

Machine Learning (ML) is a powerful tool that enables computers to learn from data and make predictions or decisions. But not all ML is the same—there are different types of learning, each suited for specific tasks. Two of the most common types are Supervised Learning and Unsupervised Learning. In this article, we’ll explore the differences between them, provide real-world examples, and walk through code snippets to help you understand how they work.

What is Supervised Learning?

Supervised Learning is a type of ML where the algorithm learns from labeled data. In other words, the data you provide to the model includes both input features and the correct output (labels). The goal is for the model to learn the relationship between the inputs and outputs so it can make accurate predictions on new, unseen data.

Real-World Examples of Supervised Learning

Email Spam Detection:

Input: The text of an email.
Output: A label indicating whether the email is "spam" or "not spam."
The model learns to classify emails based on labeled examples.

House Price Prediction:

Input: Features of a house (e.g., square footage, number of bedrooms, location).
Output: The price of the house.
The model learns to predict prices based on historical data.

Medical Diagnosis:

Input: Patient data (e.g., symptoms, test results).
Output: A diagnosis (e.g., "healthy" or "diabetic").
The model learns to diagnose conditions based on labeled medical records.

What is Unsupervised Learning?

Unsupervised Learning is a type of ML where the algorithm learns from unlabeled data. Unlike supervised learning, there are no correct outputs provided. Instead, the model tries to find patterns, structures, or relationships in the data on its own.

Real-World Examples of Unsupervised Learning

Customer Segmentation:

Input: Customer data (e.g., age, purchase history, location).
Output: Groups of similar customers (e.g., "frequent buyers," "budget shoppers").
The model identifies clusters of customers with similar behaviors.

Anomaly Detection:

Input: Network traffic data.
Output: Identification of unusual patterns that could indicate a cyberattack.

The model detects outliers or anomalies in the data.

Market Basket Analysis:

Input: Transaction data from a grocery store.
Output: Groups of products frequently bought together (e.g., "bread and butter").
The model identifies associations between products.

Key Differences Between Supervised and Unsupervised Learning

Aspect	Supervised Learning	Unsupervised Learning
Data	Labeled (inputs and outputs provided)	Unlabeled (only inputs provided)
Goal	Predict outcomes or classify data	Discover patterns or structures in data
Examples	Classification, Regression	Clustering, Dimensionality Reduction
Complexity	Easier to evaluate (known outputs)	Harder to evaluate (no ground truth)
Use Cases	Spam detection, price prediction	Customer segmentation, anomaly detection

Code Examples

Let’s dive into some code to see how supervised and unsupervised learning work in practice. We’ll use Python and the popular Scikit-learn library.

Supervised Learning Example: Predicting House Prices

We’ll use a simple linear regression model to predict house prices based on features like square footage.

# Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Create a sample dataset
data = {
    'SquareFootage': [1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700],
    'Price': [245000, 312000, 279000, 308000, 199000, 219000, 405000, 324000, 319000, 255000]
}
df = pd.DataFrame(data)

# Features (X) and labels (y)
X = df[['SquareFootage']]
y = df['Price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

Unsupervised Learning Example: Customer Segmentation

We’ll use the K-Means clustering algorithm to group customers based on their age and spending habits.

# Import libraries
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Create a sample dataset
data = {
    'Age': [25, 34, 22, 45, 32, 38, 41, 29, 35, 27],
    'SpendingScore': [30, 85, 20, 90, 50, 75, 80, 40, 60, 55]
}
df = pd.DataFrame(data)

# Features (X)
X = df[['Age', 'SpendingScore']]

# Train a K-Means clustering model
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(X)

# Visualize the clusters
plt.scatter(df['Age'], df['SpendingScore'], c=df['Cluster'], cmap='viridis')
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.title('Customer Segmentation')
plt.show()

When to Use Supervised vs. Unsupervised Learning

Use Supervised Learning when:

You have labeled data.
You want to predict outcomes or classify data.
Examples: Predicting sales, classifying images, detecting fraud.

Use Unsupervised Learning when:

You have unlabeled data.
You want to discover hidden patterns or structures.
Examples: Grouping customers, reducing data dimensions, finding anomalies.

Conclusion

Supervised and Unsupervised Learning are two fundamental approaches in Machine Learning, each with its own strengths and use cases. Supervised Learning is great for making predictions when you have labeled data, while Unsupervised Learning shines when you want to explore and uncover patterns in unlabeled data.

By understanding the differences and practicing with real-world examples (like the ones in this article), you’ll be well on your way to mastering these essential ML techniques. If you have any questions or want to share your own experiences, feel free to leave a comment below.

Understanding Unix Sockets: A Deep Dive into Inter-Process Communication

Pejman Rezaei — Tue, 21 Jan 2025 21:05:00 +0000

If you've ever worked on a Unix-based system, chances are you've encountered the term "Unix sockets." But what exactly are they, and why should you care? In this article, we'll explore Unix sockets, how they work, and why they are a powerful tool for inter-process communication (IPC).

What Are Unix Sockets?

Unix sockets, also known as Unix domain sockets, are a form of IPC that allows processes running on the same machine to communicate with each other. Unlike network sockets, which use IP addresses and ports to facilitate communication between machines, Unix sockets operate entirely within the kernel, making them faster and more efficient for local communication.

Key Characteristics of Unix Sockets

Local Communication: Unix sockets are designed for communication between processes on the same machine. This makes them ideal for scenarios where you need high-speed, low-latency communication.

File-Based: Unix sockets are represented as files in the filesystem. This means you can use standard file operations to manage them, such as creating, deleting, and setting permissions.

Stream and Datagram Support: Unix sockets support both stream-oriented (like TCP) and datagram-oriented (like UDP) communication, giving you flexibility in how you design your IPC mechanisms.

How Do Unix Sockets Work?

Unix sockets operate using a client-server model. One process acts as the server, creating a socket and listening for incoming connections. Another process acts as the client, connecting to the server's socket to establish communication.

Creating a Unix Socket

To create a Unix socket, you typically follow these steps:

Create the Socket: Use the socket() system call to create a new socket. You'll specify the domain as AF_UNIX to indicate that you're creating a Unix socket.

int sockfd = socket(AF_UNIX, SOCK_STREAM, 0);
if (sockfd == -1) {
    perror("socket");
    exit(EXIT_FAILURE);
}

Bind the Socket: Use the bind() system call to bind the socket to a filesystem path. This path will be used by clients to connect to the socket.

struct sockaddr_un addr;
memset(&addr, 0, sizeof(struct sockaddr_un));
addr.sun_family = AF_UNIX;
strncpy(addr.sun_path, "/tmp/my_socket", sizeof(addr.sun_path) - 1);

if (bind(sockfd, (struct sockaddr *)&addr, sizeof(struct sockaddr_un)) == -1) {
    perror("bind");
    exit(EXIT_FAILURE);
}

Listen for Connections: If you're creating a server, use the listen() system call to start listening for incoming connections.

if (listen(sockfd, 5) == -1) {
    perror("listen");
    exit(EXIT_FAILURE);
}

Accept Connections: Use the accept() system call to accept incoming connections from clients.

int client_sockfd = accept(sockfd, NULL, NULL);
if (client_sockfd == -1) {
    perror("accept");
    exit(EXIT_FAILURE);
}

Communicate: Once a connection is established, you can use read() and write() system calls to communicate between the client and server.

char buffer[256];
ssize_t n = read(client_sockfd, buffer, sizeof(buffer));
if (n == -1) {
    perror("read");
    exit(EXIT_FAILURE);
}

printf("Received: %s\n", buffer);

Close the Socket: Finally, use the close() system call to close the socket when you're done.

close(sockfd);

Connecting to a Unix Socket (Client Side)

On the client side, the process is simpler:

Create the Socket: Just like on the server side, you start by creating a socket.

int sockfd = socket(AF_UNIX, SOCK_STREAM, 0);
if (sockfd == -1) {
    perror("socket");
    exit(EXIT_FAILURE);
}

Connect to the Server: Use the connect() system call to connect to the server's socket.

struct sockaddr_un addr;
memset(&addr, 0, sizeof(struct sockaddr_un));
addr.sun_family = AF_UNIX;
strncpy(addr.sun_path, "/tmp/my_socket", sizeof(addr.sun_path) - 1);

if (connect(sockfd, (struct sockaddr *)&addr, sizeof(struct sockaddr_un)) == -1) {
    perror("connect");
    exit(EXIT_FAILURE);
}

Communicate: Once connected, you can use read() and write() to communicate with the server.

char *message = "Hello, Server!";
if (write(sockfd, message, strlen(message)) == -1) {
    perror("write");
    exit(EXIT_FAILURE);
}

Close the Socket: Don't forget to close the socket when you're done.

close(sockfd);

Why Use Unix Sockets?

Performance

Since Unix sockets operate entirely within the kernel, they are significantly faster than network sockets for local communication. There's no overhead associated with network protocols, making them ideal for high-performance applications.

Security

Unix sockets can be secured using filesystem permissions. You can control which users and processes have access to the socket by setting appropriate permissions on the socket file.

Simplicity

Unix sockets are straightforward to use, especially for developers already familiar with network programming. The API is similar to that of network sockets, so the learning curve is minimal.

Flexibility

Unix sockets support both stream and datagram communication, giving you the flexibility to choose the best approach for your application. Stream sockets are reliable and ensure that data is delivered in the correct order, while datagram sockets are faster but do not guarantee delivery or order.

Real-World Use Cases

Databases

Many databases, such as PostgreSQL, use Unix sockets for local connections. This allows the database server to communicate with client applications running on the same machine with minimal overhead.

Web Servers

Web servers like Nginx and Apache can use Unix sockets to communicate with backend application servers, such as PHP-FPM or uWSGI. This setup is common in high-performance web applications.

Containerization

In containerized environments, Unix sockets are often used to facilitate communication between containers running on the same host. For example, Docker uses Unix sockets /var/run/docker.sock to allow containers to communicate with the Docker daemon.

Conclusion

Unix sockets are a powerful tool for inter-process communication on Unix-based systems. They offer high performance, security, and flexibility, making them an excellent choice for a wide range of applications. Whether you're building a database, a web server, or a containerized application, understanding Unix sockets can help you design more efficient and secure systems.

So next time you're working on a project that requires local IPC, consider using Unix sockets. They might just be the perfect solution for your needs.

Getting Started with Python for Machine Learning

Pejman Rezaei — Sat, 18 Jan 2025 21:18:01 +0000

Python has become the go-to programming language for Machine Learning (ML) thanks to its simplicity, versatility, and the vast ecosystem of libraries it offers. If you’re new to ML and want to get started with Python, this guide will walk you through the basics, introduce you to essential libraries, and show you how to build a simple ML model.

Why Python for Machine Learning?

Python is widely used in the ML community because:

It’s easy to learn and read, even for beginners.
It has a rich set of libraries for data manipulation, visualization, and ML.
It’s supported by a large and active community.

Whether you’re analyzing data, training models, or deploying ML solutions, Python has the tools to make your life easier.

Essential Python Libraries for Machine Learning

Before diving into ML, let’s take a look at some of the most important Python libraries you’ll need:

NumPy:
NumPy (Numerical Python) is the foundation for numerical computing in Python. It provides support for arrays, matrices, and mathematical functions.

Use it for: Basic numerical operations, linear algebra, and array manipulation.

Pandas:
Pandas is a powerful library for data manipulation and analysis. It introduces data structures like DataFrames, which make it easy to work with structured data.

Use it for: Loading, cleaning, and exploring datasets.

Scikit-learn:
Scikit-learn is the most popular library for ML in Python. It provides simple and efficient tools for data mining and analysis, including algorithms for classification, regression, clustering, and more.

Use it for: Building and evaluating ML models.

Setting Up Your Environment

To get started, you’ll need to install these libraries. If you haven’t already, you can install them using pip:

pip install numpy pandas scikit-learn

Once installed, you’re ready to start coding!

A Simple Machine Learning Workflow

Let’s walk through a basic ML workflow using Python. We’ll use the famous Iris dataset, which contains information about different species of iris flowers. Our goal is to build a model that can classify the species based on features like petal length and width.

Step 1: Import Libraries

First, import the necessary libraries:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

Step 2: Load the Dataset

Scikit-learn provides built-in datasets, including the Iris dataset. Let’s load it:

# Load the Iris dataset
iris = load_iris()

# Convert it to a Pandas DataFrame for easier manipulation
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = iris.target

Step 3: Explore the Data

Before building a model, it’s important to understand the data:

# Display the first few rows
print(data.head())

# Check for missing values
print(data.isnull().sum())

# Get basic statistics
print(data.describe())

Step 4: Prepare the Data

Split the data into features (X) and labels (y), and then split it into training and testing sets:

# Features (X) and labels (y)
X = data.drop('species', axis=1)
y = data['species']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Train a Model

Let’s use a Random Forest classifier, a popular ML algorithm:

# Initialize the model
model = RandomForestClassifier(random_state=42)

# Train the model
model.fit(X_train, y_train)

Step 6: Make Predictions and Evaluate the Model

Use the trained model to make predictions on the test set and evaluate its accuracy:

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Congratulations! You’ve just built your first ML model using Python. Here are some next steps to continue your learning journey:

Experiment with other datasets from Kaggle or the UCI Machine Learning Repository.
Explore different ML algorithms like linear regression, decision trees, or support vector machines.
Learn about data preprocessing techniques like scaling, encoding, and feature selection.

Resources to Learn More

If you’re interested in diving deeper, here are some great resources:

Scikit-learn Documentation: The official guide to using Scikit-learn.
Kaggle Learn: Hands-on tutorials for ML beginners.
Python Machine Learning by Sebastian Raschka: A beginner-friendly book on ML with Python.

Advanced Load Balancing with Traefik: An Introduction to Progressive Delivery, Mirroring, Sticky Sessions, and Health Checks

Pejman Rezaei — Wed, 15 Jan 2025 14:35:53 +0000

In modern cloud-native environments, efficient traffic management is critical for ensuring high availability, scalability, and reliability of applications. Traefik Proxy, a popular cloud-native edge router, offers advanced load-balancing features that go beyond simple round-robin or least-connections algorithms. In this article, i’ll explore some of Traefik’s advanced load-balancing capabilities, including progressive delivery, traffic mirroring, sticky sessions, and nested health checks. I’ll also provide practical examples and configurations to help you implement these features in your environment.

1. Progressive Delivery with Traefik’s Weighted Round Robin (WRR)

What is Progressive Delivery?

Progressive delivery is a deployment strategy that allows you to gradually roll out new versions of an application to a subset of users or traffic. This approach minimizes risk by enabling you to test new versions in production with real traffic before fully committing to the rollout.

How Traefik’s Weighted Round Robin (WRR) Works

Traefik’s Weighted Round Robin (WRR) load-balancing algorithm allows you to distribute traffic across multiple services based on predefined weights. This is particularly useful for progressive delivery, where you want to route a small percentage of traffic to a new version of your application while keeping the majority of traffic on the stable version.

Example: Progressive Rollout of a New Application Version

Let’s say you have two versions of an application:

app-v1: The stable version (handles 90% of traffic).
app-v2: The new version (handles 10% of traffic).

Here’s how you can configure Traefik to achieve this:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: app-ingressroute
spec:
  entryPoints:
    - web
  routes:
    - match: Host(`app.example.com`)
      kind: Rule
      services:
        - name: app-v1
          weight: 90  # 90% of traffic goes to app-v1
        - name: app-v2
          weight: 10  # 10% of traffic goes to app-v2

Key Benefits of WRR for Progressive Delivery

Risk Mitigation: Gradually expose new versions to real traffic, reducing the risk of widespread failures.
Flexibility: Easily adjust weights to increase or decrease traffic to the new version.
Real-World Testing: Test new versions in production without affecting all users.

2. Traffic Mirroring for Testing New Application Versions

What is Traffic Mirroring?

Traffic mirroring (or shadowing) is a technique where a copy of live traffic is sent to a new version of an application without affecting the response returned to the client. This allows you to test new versions under real-world conditions without impacting users.

How Traefik Implements Traffic Mirroring

Traefik supports traffic mirroring through its service mirroring feature. You can configure Traefik to send a copy of incoming requests to a secondary service while still routing the original request to the primary service.

Example: Mirroring Traffic to a New Version

Suppose you want to test app-v2 by mirroring 10% of traffic from app-v1. Here’s how you can configure it:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: app-ingressroute
spec:
  entryPoints:
    - web
  routes:
    - match: Host(`app.example.com`)
      kind: Rule
      services:
        - name: app-v1
        - name: app-v2
          mirror: true
          mirrorPercent: 10  # 10% of traffic is mirrored to app-v2

Key Benefits of Traffic Mirroring

Zero-Risk Testing: Test new versions with real traffic without affecting users.
Performance Insights: Observe how the new version performs under real-world conditions.
Debugging: Identify and fix issues in the new version before full deployment.

3. Sticky Sessions and Session Replication

What are Sticky Sessions?

Sticky sessions (or session affinity) ensure that requests from the same client are always routed to the same backend server. This is particularly useful for stateful applications where user sessions are stored locally on the server.

How Traefik Implements Sticky Sessions

Traefik supports sticky sessions using cookies. When a client makes its first request, Traefik assigns a cookie to the client, which is then used to route subsequent requests to the same backend server.

Example: Enabling Sticky Sessions

Here’s how you can configure sticky sessions in Traefik:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: app-ingressroute
spec:
  entryPoints:
    - web
  routes:
    - match: Host(`app.example.com`)
      kind: Rule
      services:
        - name: app-v1
          sticky:
            cookie:
              name: traefik-sticky-cookie
              secure: true
              httpOnly: true

Key Benefits of Sticky Sessions

Session Consistency: Ensures that user sessions are maintained on the same server.
Improved Performance: Reduces the overhead of session replication or database lookups.
Stateful Applications: Ideal for applications that rely on local session storage.

4. Nested Health Checks for Advanced Routing

What are Nested Health Checks?

Nested health checks allow you to define custom health-checking logic for your services. This is particularly useful in complex environments where you need to route traffic based on the health of multiple components or dependencies.

How Traefik Implements Nested Health Checks

Traefik supports nested health checks through its health-check middleware. You can define custom health-check endpoints and use them to determine the health of your services.

Example: Routing Between Two Data Centers

Suppose you have two data centers (dc1 and dc2) and want to route traffic to the healthier data center. Here’s how you can configure it:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: app-ingressroute
spec:
  entryPoints:
    - web
  routes:
    - match: Host(`app.example.com`)
      kind: Rule
      services:
        - name: dc1
          healthCheck:
            path: /health
            interval: 10s
            timeout: 5s
        - name: dc2
          healthCheck:
            path: /health
            interval: 10s
            timeout: 5s

Example: Routing to a Non-Kubernetes Service (e.g., a VM)

If you have a service running on a VM outside Kubernetes, you can still use Traefik’s health checks to route traffic:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: app-ingressroute
spec:
  entryPoints:
    - web
  routes:
    - match: Host(`app.example.com`)
      kind: Rule
      services:
        - name: vm-service
          url: http://vm-ip:8080
          healthCheck:
            path: /health
            interval: 10s
            timeout: 5s

Key Benefits of Nested Health Checks

Custom Health Logic: Define health checks tailored to your application’s needs.
Failover Support: Automatically route traffic to healthy services or data centers.
Hybrid Environments: Seamlessly integrate Kubernetes and non-Kubernetes services.

Conclusion

Traefik Proxy’s advanced load-balancing features, such as Weighted Round Robin (WRR), traffic mirroring, sticky sessions, and nested health checks, provide powerful tools for managing traffic in modern cloud-native environments. By leveraging these features, you can implement progressive delivery, test new application versions safely, maintain session consistency, and ensure high availability across complex infrastructures.

Whether you’re running applications on Kubernetes, VMs, or hybrid environments, Traefik’s flexibility and extensibility make it an excellent choice for advanced traffic management. Start experimenting with these features today and take your load-balancing strategies to the next level!

What is Machine Learning? A Beginner’s Guide

Pejman Rezaei — Thu, 09 Jan 2025 19:53:50 +0000

Machine Learning (ML) is one of the most exciting and transformative technologies of our time. From personalized Netflix recommendations to self-driving cars, ML is powering innovations across industries. But what exactly is Machine Learning, and how does it work? If you’re new to the field, this guide will break it down in simple terms and help you get started.

What is Machine Learning?

At its core, Machine Learning is a subset of Artificial Intelligence (AI) that enables computers to learn from data and make decisions without being explicitly programmed. Instead of writing rules for every possible scenario, we feed data to an algorithm, and it learns patterns to make predictions or decisions.

For example, if you want to build a system that can identify cats in photos, you don’t need to write rules like "cats have pointy ears and whiskers." Instead, you show the algorithm thousands of cat pictures, and it learns to recognize cats on its own.

Types of Machine Learning

There are three main types of Machine Learning:

Supervised Learning:
The algorithm learns from labeled data. For example, if you’re training a model to predict house prices, you provide it with data that includes features (e.g., square footage, number of bedrooms) and labels (the actual prices). The model learns the relationship between the features and the labels.
Unsupervised Learning:
The algorithm learns from unlabeled data. It identifies patterns or groups in the data without any guidance. A common example is clustering, where the algorithm groups similar data points together (e.g., grouping customers based on purchasing behavior).
Reinforcement Learning:
The algorithm learns by interacting with an environment and receiving feedback in the form of rewards or penalties. This is how AI systems like AlphaGo (which plays the board game Go) learn to make strategic decisions.

Machine Learning is everywhere! Here are a few examples of how it’s used in the real world:

Recommendation Systems: Platforms like Netflix and Spotify use ML to recommend movies, shows, and songs based on your preferences.

Healthcare: ML models can analyze medical images to detect diseases like cancer or predict patient outcomes.

Finance: Banks use ML to detect fraudulent transactions and assess credit risk.

Autonomous Vehicles: Self-driving cars use ML to recognize objects, navigate roads, and make driving decisions.

Now How Does Machine Learning Work?

Here’s a simplified breakdown of the ML process:

Collect Data: Gather relevant data for your problem. For example, if you’re building a spam filter, you’ll need a dataset of emails labeled as "spam" or "not spam."

Preprocess Data: Clean and prepare the data for training. This might involve handling missing values, scaling features, or splitting the data into training and testing sets.

Choose a Model: Select an algorithm (e.g., linear regression, decision trees, neural networks) that fits your problem.

Train the Model: Feed the training data to the algorithm so it can learn patterns.

Evaluate the Model: Test the model on unseen data to see how well it performs.

Deploy the Model: Once the model is trained and tested, you can use it to make predictions on new data.

Getting Started with Machine Learning

If you’re eager to dive in, here’s how you can get started:

Learn Python: Python is the most popular programming language for ML. Start with libraries like NumPy, Pandas, and Scikit-learn.
Explore Datasets: Websites like Kaggle and UCI Machine Learning Repository offer free datasets to practice on.
Build Simple Projects: Start with beginner-friendly projects like predicting house prices or classifying iris flowers.

Machine Learning is a powerful tool that’s changing the way we solve problems. While it might seem complex at first, breaking it down into simple concepts makes it much more approachable. Whether you’re interested in building recommendation systems, analyzing data, or creating AI-powered applications, ML offers endless possibilities.

So what excites you most about Machine Learning? Let me know in the comments, and feel free to ask any questions you have about getting started. Don’t forget to follow me for more beginner-friendly guides on ML and MLOps!

sources and credits:

https://www.ntiva.com/blog/what-is-machine-learning
https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained
https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-ml/
https://www.scribbr.com/ai-tools/machine-learning/
"Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron – This book was my go-to guide for understanding ML concepts and implementing them in Python.
"Python Machine Learning" by Sebastian Raschka and Vahid Mirjalili – A beginner-friendly book that helped me grasp the fundamentals of ML algorithms.