Forem: Hikaru

Let's build a simple MLOps workflow on AWS! #3 - Running ML training as a container

Hikaru — Thu, 30 May 2024 09:38:16 +0000

About this post

This post is a sequel to the previous one below. Please refer to the earlier post before reading this one.

Let's build a simple MLOps workflow on AWS! #2 - Building infrastructure on AWS - DEV Community
https://dev.to/hikarunakatani/lets-build-a-simple-mlops-workflow-on-aws-2-building-infrastructure-on-aws-3h2j

Overview

We've prepared the ML model in the first post and set up the infrastructure in the second post. Now, we've configured the necessary settings to run the training process as a container image. Let's work on the necessary settings and see how it works.

Access to S3 bukcet

As we run this traing task on the ECS cluster in private subnet, we can't use internet connection to download training data nor uploading trained model. Instead, we preplace them on an S3 bucket and access them via VPC Endpoint using the AWS SDK. To PUT/GET obeject from S3 bucket, you can implement the code like below:

import zipfile
import boto3
import botocore
from botocore.exceptions import ClientError


def download_data():
    """Download training data from Amazon S3 bucket
    Used when run as a ECS task
    """

    s3 = boto3.client("s3")
    bucket_name = "cifar10-mlops-bucket"
    file_key = "data.zip"
    local_file_path = "data.zip"
    extract_to = "./"

    botocore.session.Session().set_debug_logger()

    # Download the file from S3
    try:
        s3.download_file(bucket_name, file_key, local_file_path)
        print("File downloaded successfully.")

        # Extract the contents of the zip file
        with zipfile.ZipFile(local_file_path, "r") as zip_ref:
            zip_ref.extractall(extract_to)
            print(f"Zip file extracted successfully to '{extract_to}'.")

    except ClientError as e:
        print(f"An error occurred while downloading training data {e}")


def upload_model():
    """Upload pre-trained model to S3 bucket"""

    s3_client = boto3.client("s3")
    file_path = "model.pth"
    bucket_name = "cifar10-mlops-bucket"
    object_key = "model.pth"

    try:
        s3_client.upload_file(file_path, bucket_name, object_key)
        print(f"Uploaded {file_path} to {bucket_name}/{object_key}")
    except ClientError as e:
        print(f"An error occurred while uploading {e}")

When accessing an S3 bucket from a private subnet, ensure that the access policy of the VPC endpoint and the S3 bucket itself allows the necessary permissions. Additionally, the target ECS task must be explicitly permitted to access the bucket through its task role.

Making a Docker image

# Use Python baseimage
FROM python:3.8-slim-buster

# Set the working directory in the container to /app
WORKDIR /app

# Add the current directory contents into the container at /app
ADD . /app

# Install any needed packages specified in requirements.txt
RUN pip install --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Make port 80 available to the world outside this container
EXPOSE 80

# Run main.py when the container launches
ENTRYPOINT ["python", "main.py"]
CMD ["--env", "ecs"]

This is a basic example of a Dockerfile to run a training task as a Docker image. An important point here is specifying the environment with
CMD ["--env", "ecs"] in the command argument. This is necessary because when downloading training data in a local environment, the data needs to be downloaded through the internet as shown below. By adding this argument, you can change the behavior of the program depending on the environment:

download_flag = False

# Dataset directory
data_dir = "./data"

# Download training data if it doesn't exist
if not os.path.exists(os.path.join(data_dir, "cifar-10-batches-py")):
    if env == "local":
        download_flag = True
    elif env == "ecs":
        aws_action.download_data()

# Preprocess data
# Transform PIL Image to tensor
# Normalize a tensor image in each channel with mean and standard deviation
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)

# Downaload training data of CIFAR-10
trainset = torchvision.datasets.CIFAR10(
    root="./data", train=True, download=download_flag, transform=transform
)
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=4, shuffle=True, num_workers=os.cpu_count()
)

Setting up GitHub Actions

Now that we have a Dockerfile, we can build a Docker image locally and manually push it to the ECR repository. However, our initial goal is to fully automate the entire training process, so we want to avoid doing this manually. Instead, let's manage our training program on GitHub and push the image using GitHub Actions.

Here's a sample GitHub Actions workflow:

name: Push Docker image to Amazon ECR (manual)

on:
  workflow_dispatch:

env:
    AWS_REGION: ap-northeast-1

jobs:
  push:
    runs-on: ubuntu-latest

    permissions:
        id-token: write
        contents: read
        pull-requests: write

    steps:
    - name: Checkout
      uses: actions/checkout@v2

    - name: Get OIDC token
      uses: aws-actions/configure-aws-credentials@v1 # Use OIDC token
      with:
        role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
        aws-region: ${{ env.AWS_REGION }}

    - name: Login to Amazon ECR
      id: login-ecr
      uses: aws-actions/amazon-ecr-login@v1

    - name: Build, tag, and push image to Amazon ECR
      env:
        ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
        ECR_REPOSITORY: cifar10-mlops-repository
        # IMAGE_TAG: ${{ github.sha }}
        IMAGE_TAG: latest
      run: |
        docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
        docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG

In this example, we execute the docker build command to build a container image and the docker push command to push the image to an ECR repository within the GitHub Actions workflow. For a simplified deployment process, we use the latest tag in the Dockerfile. However, using the latest tag in a production environment is not recommended because it does not provide a way to track the specific version of the image being deployed. Instead of using the latest tag, it's better to use a unique tag based on the Git commit SHA, such as github.sha, as shown in the official GitHub example.

When changing the tag name of the Docker image, you also need to update the image specified in the ECS task definition. Therefore, in the example below, the task definition is dynamically rendered and deployed with the appropriate image tag.

- name: Build, tag, and push image to Amazon ECR
id: build-image
env:
    ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
    IMAGE_TAG: ${{ github.sha }}
run: |
    # Build a docker container and
    # push it to ECR so that it can
    # be deployed to ECS.
    docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
    docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
    echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT

- name: Fill in the new image ID in the Amazon ECS task definition
id: task-def
uses: aws-actions/amazon-ecs-render-task-definition@c804dfbdd57f713b6c079302a4c01db7017a36fc
with:
    task-definition: ${{ env.ECS_TASK_DEFINITION }}
    container-name: ${{ env.CONTAINER_NAME }}
    image: ${{ steps.build-image.outputs.image }}

- name: Deploy Amazon ECS task definition
uses: aws-actions/amazon-ecs-deploy-task-definition@df9643053eda01f169e64a0e60233aacca83799a
with:
    task-definition: ${{ steps.task-def.outputs.task-definition }}
    service: ${{ env.ECS_SERVICE }}
    cluster: ${{ env.ECS_CLUSTER }}
    wait-for-service-stability: true

Deploying to Amazon Elastic Container Service - GitHub Docs
https://docs.github.com/en/actions/deployment/deploying-to-your-cloud-provider/deploying-to-amazon-elastic-container-service

Testing the whole process

Since we have all the components for the entire CI/CD process, we can now test the complete training process. When you push changes to the model repository, GitHub Actions will automatically build the image and push it to the ECR repository. If it succeeds, you'll see a message like the one below in the Actions tab.

After that, the Lambda function will be triggered by an EventBridge rule. Let's check if the Lambda function is running properly.

In the response of the lambda function, you'll see an in information like below:

{
  "tasks": [
    {
      "attachments": [
        {
          "id": "********-****-****-****-************",
          "type": "ElasticNetworkInterface",
          "status": "PRECREATED",
          "details": [
            {
              "name": "subnetId",
              "value": "subnet-************"
            }
          ]
        }
      ],
      "attributes": [
        {
          "name": "ecs.cpu-architecture",
          "value": "x86_64"
        }
      ],
      "availabilityZone": "ap-northeast-1a",
      "clusterArn": "arn:aws:ecs:ap-northeast-1:************:cluster/*************-cluster",
      "containers": [
        {
          "containerArn": "arn:aws:ecs:ap-northeast-1:************:container/*************-cluster/*************/************",
          "taskArn": "arn:aws:ecs:ap-northeast-1:************:task/*************-cluster/*************",
          "name": "*************-container",
          "image": "************.dkr.ecr.ap-northeast-1.amazonaws.com/*************-repository:latest",
          "lastStatus": "PENDING",
          "networkInterfaces": [],
          "cpu": "2048",
          "memory": "4098"
        }
      ],
      "cpu": "2048",
      "createdAt": "2024-05-26T03:40:35.437000+00:00",
      "desiredStatus": "RUNNING",
      "enableExecuteCommand": false,
      "group": "family:*************-task",
      "lastStatus": "PROVISIONING",
      "launchType": "FARGATE",
      "memory": "8192",
      "overrides": {
        "containerOverrides": [
          {
            "name": "*************-container"
          }
        ],
        "inferenceAcceleratorOverrides": []
      },
      "platformVersion": "1.4.0",
      "platformFamily": "Linux",
      "tags": [],
      "taskArn": "arn:aws:ecs:ap-northeast-1:************:task/*************-cluster/*************",
      "taskDefinitionArn": "arn:aws:ecs:ap-northeast-1:************:task-definition/*************-task:15",
      "version": 1,
      "ephemeralStorage": {
        "sizeInGiB": 20
      }
    }
  ],
  "failures": [],
  "ResponseMetadata": {
    "RequestId": "********-****-****-****-************",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "********-****-****-****-************",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "1556",
      "date": "Sun, 26 May 2024 03:40:34 GMT"
    },
    "RetryAttempts": 0
  }
}

The response is quite long, but essentially, what you need to check is that the status code is 200 and that the "failures" array is empty. By verifying this information, you can confirm that the Lambda function is properly triggered by the EventBridge rule.
After that, let's also check whether the ECS task is properly invoked. If it's properly invoked, you'll see the new task created on the ECS console screen

If the training is successfully getting started, you can view the current status of the training process in the logs.

After finishing the training, if a pre-trained model is successfully uploaded to an S3 bucket, you can confirm that the training process has completed properly.

Well done! We've completed the entire workflow! 👏
There are indeed many improvements and tasks to implement if you want to work on a production-level MLOps setup, but it's perfectly fine to start from what you're familiar with now. Building upon your current knowledge and gradually expanding your skills is a solid approach to mastering MLOps.

Let's build a simple MLOps workflow on AWS! #2 - Building infrastructure on AWS

Hikaru — Thu, 23 May 2024 09:37:40 +0000

About this post

This post is a sequel to the previous one below. Please refer to the earlier post before reading this one.

Let's build a simple MLOps workflow on AWS! #1 - ML model preperation - DEV Community
https://dev.to/hikarunakatani/lets-build-a-simple-mlops-workflow-on-aws-1-ml-model-preperation-3af8

Overview

In the previous post, I showed how to implement a simple deep learning model. However, that code was intended for a local laptop environment and was purely experimental. By containerizing the application, you can ensure consistent and reproducible execution across different environments. This approach also enables the use of container orchestration tools like Kubernetes, which simplify managing, scaling, and orchestrating ML training jobs. Running machine learning tasks on a container orchestration tool is especially beneficial for training large ML models, as it allows for distributed training across multiple nodes and efficient resource utilization.
In this post, I'll explain how to run the training code as a Docker container on Amazon ECS. Additionally, I'll demonstrate how to automatically build and deploy the container when changes are made to the model.
Without further ado, let's first look at the overall architecture needed to implement this workflow!

Architecture of the system

In this system, the following workflow will be executed:

A developer pushes an ML model to the GitHub repository
The training task, including the model, is automatically built as a Docker image and pushed to the ECR repository
EventBridge detects the push in the ECR repository and invokes a Lambda function
Lambda function invokes the ECS task
The pre-trained ML model gets automatically saved to an S3 bucket

To achieve this, we'll tackle the following tasks step-by-step:

Preparing AWS resources to automate the deployment process of the training task.
Building a CI/CD pipeline for the ML model to automatically push Docker images to the repository.
Testing that the automated deployment process works properly.

In this post, I'll only explain how to implement the first step. Regarding building AWS resources, I chose Terraform so that we can test the code experimentally.

Preparing AWS resources by Terraform

There are a number of small resources to implement the whole system, but I'll focus on introducing the core service setting required to implement the workflow.

EventBridge

In order to trigger the ECS task in an event-driven manner, you have to prepare the event pattern in EventBridge. I used an event pattern to detect the push event in the ECR repository. After that, you need to set the Lambda as a target of the event rule.

# EventBridge
resource "aws_cloudwatch_event_rule" "ecr_push_rule" {
  name        = "${var.project_name}-run-ecs-task"
  description = "Trigger an ECS task when an image is pushed to ECR"

  event_pattern = jsonencode({
    "source" : ["aws.ecr"],
    "detail-type" : ["ECR Image Action"],
    "detail" : {
      "repository-name" : [aws_ecr_repository.main.name],
      "action-type" : ["PUSH"],
    },
  })
}

resource "aws_cloudwatch_event_target" "ecr_push_target" {
  rule      = aws_cloudwatch_event_rule.ecr_push_rule.name
  target_id = "run-index-py-function"
  arn       = aws_lambda_function.invoke_task.arn
}

Lambda

We use Lambda function to invoke training task in ECS. The content of the Lambda function is like below:

import json
import logging
import os
import sys
import boto3

# Setting up logging
logger = logging.getLogger()
for h in logger.handlers:
    logger.removeHandler(h)
h = logging.StreamHandler(sys.stdout)
FORMAT = "%(levelname)s [%(funcName)s] %(message)s"
h.setFormatter(logging.Formatter(FORMAT))
logger.addHandler(h)
logger.setLevel(logging.INFO)

ecs = boto3.client("ecs")


def run_ecs_task(cluster, task_definition, subnets, security_groups):
    """
    Function to run an ECS task.

    Parameters:
    cluster (str): The name of the ECS cluster.
    task_definition (str): The ARN of the task definition.
    subnets (str): The subnets for the task.
    security_groups (str): The security groups for the task.

    Returns:
    None
    """
    try:
        response = ecs.run_task(
            cluster=cluster,
            taskDefinition=task_definition,
            launchType="FARGATE",
            count=1,
            networkConfiguration={
                "awsvpcConfiguration": {
                    "subnets": subnets.split(","),
                    "securityGroups": security_groups.split(","),
                    "assignPublicIp": "ENABLED",
                }
            },
        )
        logger.info(f"Response: {response}")
        failures = response.get("failures", [])
        if failures:
            logger.error(f"Task failures: {failures}")
    except Exception as e:
        logger.error(f"Error running ECS task: {e}")


def lambda_handler(event, context):
    """
    AWS Lambda function handler.

    Parameters:
    event (dict): The event data passed by AWS Lambda service.
    context (LambdaContext): The context data passed by AWS Lambda service.

    Returns:
    None
    """
    try:
        # Get configuration from environmental variables
        ECS_CLUSTER = os.environ["ECS_CLUSTER"]
        TASK_DEFINITION_ARN = os.environ["TASK_DEFINITION_ARN"]
        AWSVPC_CONF_SUBNETS = os.environ["AWSVPC_CONF_SUBNETS"]
        AWSVPC_CONF_SECURITY_GROUPS = os.environ["AWSVPC_CONF_SECURITY_GROUPS"]

        logger.info(f"ECS_CLUSTER: {ECS_CLUSTER}")
        logger.info(f"TASK_DEFINITION_ARN: {TASK_DEFINITION_ARN}")
        run_ecs_task(
            ECS_CLUSTER,
            TASK_DEFINITION_ARN,
            AWSVPC_CONF_SUBNETS,
            AWSVPC_CONF_SECURITY_GROUPS,
        )
    except Exception as e:
        logger.error(f"An error occured while running ECS task: {e}")

Basically, it sends an API call to an ECS cluster to start the task using the AWS SDK (boto3). Please note that you need to specify some settings, such as the ECS cluster name, task definition ARN, VPC subnet, and security groups, to invoke the task. These settings are acquired through the environment variables embedded in the Lambda runtime.

To build this handler, we need to prepare the Lambda function in Terraform as shown below:

# Lambda function
resource "aws_lambda_function" "invoke_task" {
  # If the file is not in the current working directory you will need to include a
  # path.module in the filename.
  filename         = "lambda_function.zip"
  function_name    = "${var.project_name}-invoke-task"
  role             = aws_iam_role.lambda_execution_role.arn
  handler          = "invoke_task.lambda_handler"
  source_code_hash = data.archive_file.lambda.output_base64sha256
  runtime          = "python3.9"
  environment {
    variables = {
      ECS_CLUSTER                 = aws_ecs_cluster.main.name
      TASK_DEFINITION_ARN         = aws_ecs_task_definition.main.arn
      AWSVPC_CONF_SUBNETS         = "${aws_subnet.private1a.id}"
      AWSVPC_CONF_SECURITY_GROUPS = "${aws_security_group.ecs.id}"
    }
  }
}

An important point here is properly setting environment variables so that the Lambda function gets the necessary information to run the training task. Also, keep in mind to avoid hardcoding environment variables for better security and operational efficiency. For a more secure solution, I highly recommend using AWS Secrets Manager or AWS Systems Manager Parameter Store instead of using environment variables.

ECS cluster

# Task Definition
resource "aws_ecs_task_definition" "main" {
  family                   = "${var.project_name}-task"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = "2048" # 2 vCPU
  memory                   = "8192" # 8GB RAM
  task_role_arn            = aws_iam_role.ecs_task_role.arn
  execution_role_arn       = aws_iam_role.ecs_task_exec.arn

  container_definitions = jsonencode([
    {
      name      = "${var.project_name}-container"
      image     = "${aws_ecr_repository.main.repository_url}:latest"
      cpu       = 2048
      memory    = 4098
      essential = true
      portMappings = [
        {
          "containerPort" : 80,
          "hostPort" : 80
        }
      ],
      logConfiguration = {
        options = {
          "awslogs-create-group": "true",  
          "awslogs-region"        = "ap-northeast-1"
          "awslogs-group"         = "${var.project_name}-log-group"
          "awslogs-stream-prefix" = "ecs"
        }
        logDriver = "awslogs"
      }
    }
  ])
}

Training ML model usually requires GPU, but I chose CPU because it doesn't demand as many computing resources. Also, GPU is only suported for ECS on EC2, which requires more complex settings.

There's a bunch of resources you need to define, but I won't cover all of them here to keep this post simple. If you're interested in the complete resource settings, please refer to the repository below:

hikarunakatani/cifar10-aws: Simple MLOps workflows
https://github.com/hikarunakatani/cifar10-aws

CI/CD of Infrastructue using GitHub Actions

As we defined Infrastructue using Terraform, we can apply CI/CD practice to infrastructure. We use GitHub Actions to build CI/CD pipeline. The definition of the workflows is as follows:

# Execute terraform apply when changes are merged to main branch

name: "Terraform Apply"
on:
  push:
    branches: main
env:
  TF_VERSION: 1.6.5
  AWS_REGION: ap-northeast-1

jobs:
  terraform:
    name: terraform
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: write
      pull-requests: write
      issues: write
      statuses: write
    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - uses: aws-actions/configure-aws-credentials@v1 # Use OIDC token
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Terraform setup
        uses: hashicorp/setup-terraform@v1
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Setup tfcmt
        env:
          TFCMT_VERSION: v3.4.1
        run: |
          wget "https://github.com/suzuki-shunsuke/tfcmt/releases/download/${TFCMT_VERSION}/tfcmt_linux_amd64.tar.gz" -O /tmp/tfcmt.tar.gz
          tar xzf /tmp/tfcmt.tar.gz -C /tmp
          mv /tmp/tfcmt /usr/local/bin
          tfcmt --version

      - name: Terraform init
        run: terraform init

      - name: Terraform fmt
        run: terraform fmt

      - name: Terraform apply
        id: apply
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        # Make apply results comment on commit 
        run: tfcmt apply -- terraform apply -auto-approve -no-color -input=false

This is an example of a workflow for the apply process. I set the trigger for this workflow to activate on pull requests, so the terraform apply command runs when a pull request is opened in the repository.
When you want to manipulate AWS resources from GitHub, obviously you need to set AWS credential information. However, directly putting secret information in your repository poses security risks. Instead, you can use an OIDC token to get temporary AWS credential information. This way, you only need to put the ARN of the IAM role in your GitHub account, which is much safer.

Once the workflows have executed properly, you can view the results in the "Actions" tab on GitHub, like this:

If you see the output saying "Apply complete!", you can confirm that your infrastructure has been successfully deployed to the AWS environment.

In the next post, I'll explain how to integrate the training code we created in the first post into the system.

Let's build a simple MLOps workflow on AWS! #1 - ML model preperation

Hikaru — Mon, 20 May 2024 09:07:50 +0000

About this post

In this series, I'll explain how I implemented a simple MLOps workflow on AWS.

Intro

Currently, I'm working as a cloud engineer, but I was studying machine learning when I was in graduate school.
Back then, training of ML models was done by manually sending jobs to an on-premise server. So, whenever I changed experimental settings such as the training dataset or hyperparameters. etc, I had to manually run the job. I wasn't even familiar with concepts like version control, container, and CI/CD, so I had no clue about how I could improve the efficiency of this training process.
However, after starting my career in my current position, I noticed that there are so many useful technologies or concepts to accelerate the development process. Nowadays, I guess it's pretty common to run ML tasks in containerized applications on container orchestration tools like Kubernetes and automate testing and model deployment processes using CI/CD tools. So I felt like connecting the dots by doing some experiments that combine these two different areas of knowledge I've learned so far.

Training task

I trained a pretty basic CNN-based image classification model in PyTorch using the CIFAR-10 dataset. To implement the model, I followed the following officail tutorial from PyTorch.

Training a Classifier — PyTorch Tutorials 2.3.0+cu121 documentation
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

The CIFAR-10 dataset is a collection of 60,000 32x32 color images in 10 classes, with 6,000 images per class. It is one of the most widely used datasets for image classification tasks in the field of machine learning and computer vision. The images in this dataset look like below:

CIFAR-10 and CIFAR-100 datasets
https://www.cs.toronto.edu/~kriz/cifar.html

Regarding the image classification model, I chose the Convolutional neural network (CNN) as it's commonly used in image classification tasks. The model architecture is below:

import torch.nn as nn
import torch.nn.functional as F


# Define CNN
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(
            in_channels=32, out_channels=64, kernel_size=3, padding=1
        )
        self.conv3 = nn.Conv2d(
            in_channels=64, out_channels=128, kernel_size=3, padding=1
        )
        self.pool = nn.MaxPool2d(2, 2)
        self.dropout = nn.Dropout(0.5)
        self.fc1 = nn.Linear(128 * 4 * 4, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = x.view(-1, 128 * 4 * 4)  # flatten tensor
        x = self.dropout(F.relu(self.fc1(x)))
        x = self.dropout(F.relu(self.fc2(x)))
        x = self.fc3(x)

        return x

In terms of the difference from the normal CNN model architecture, I added the dropout layer after the fully-connected layer to prevent overfitting. Since the main goal of this series is automating the training process, I won't dive deep into the model itself or the accuracy of the classification task. However, at least we need to confirm that the training process of the model properly works.
Let's add the training setting. The basic training settings are as follows:

Loss function: Cross-entropy loss function
Optimizer: Adam
Learning rate: 0.001
Momentum: 0.9
The number of training epochs: 10

An overview of the training process is like this:

# Define loss function and optimization method
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=lr, momentum=momentum)

    num_epochs = num_epochs

    for epoch in range(num_epochs):  # Loop for the specified number of times
        print(f"Epoch {epoch+1}/{num_epochs}")

        for phase in ["train", "val"]:
            if phase == "train":
                net.train()
            else:
                net.eval()
            epoch_loss = 0.0
            epoch_corrects = 0

            for input_data, label in tqdm(dataloaders_dict[phase]):
                input_data, label = input_data.to(device), label.to(
                    device
                )  # Enable GPU
                optimizer.zero_grad()

                # Compute gradient in training phase
                with torch.set_grad_enabled(phase == "train"):
                    predicted_label = net(input_data)
                    loss = criterion(predicted_label, label)
                    _, pred_index = torch.max(predicted_label, axis=0)
                    _, label_index = torch.max(label, axis=0)

                    if phase == "train":
                        loss.backward()
                        optimizer.step()

                    # Update loss summary
                    epoch_loss += loss.item() * input_data.size(0)
                    # Update the number of correct prediction
                    epoch_corrects += torch.sum(pred_index == label_index)

            # show loss and accuracy for each epoch
            epoch_loss = epoch_loss / len(dataloaders_dict[phase].dataset)
            epoch_acc = epoch_corrects / len(dataloaders_dict[phase].dataset)

            print(f"{phase} Loss: {epoch_loss} Acc: {epoch_acc}")

To see the whole source, please refere to the repository below:

https://github.com/hikarunakatani/cifar10-model

Basically, I put both training and validation processes in each epoch. In each epoch, the training loss and the model accuracy are computed and shown in the standard output.

Results

Let's run the code and see how it works. I added an option to run this task on GPU, but you can run this model on the CPU as it doesn't require that much computing resources.
In my environment, I used TensorBoard to see the logs of the training process graphically. The results of the training task are as follows:

The improvement of the model accuracy is quite small, but at least we can observe the constant improvement of the accuracy and the decrease of the loss.

Testing the model

Finally, let's prepare a random image and test it to see what kind of category it gets clarified. Testing code is as follows:

import os
import torch
import torchvision.transforms as transforms
from torch.autograd import Variable
from PIL import Image
import model
from model import MODEL_PATH


def main():
    transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
    )
    classes = (
        "plane",
        "car",
        "bird",
        "cat",
        "deer",
        "dog",
        "frog",
        "horse",
        "ship",
        "truck",
    )

    net = model.CNN()
    net.cpu()
    if os.path.exists(MODEL_PATH):
        state_dict = torch.load(MODEL_PATH, map_location=torch.device("cpu"))
        net.load_state_dict(state_dict)
        print("Model loaded successfully.")
    else:
        print("No model checkpoint found at the specified path.")

    img = Image.open("bird.jpg")
    img = img.resize((32, 32))
    img = transform(img).float()
    img = Variable(img)
    img = img.unsqueeze(0)

    with torch.no_grad():
        outputs = net(img)

        # print(outputs)
        _, predicted = torch.max(outputs, 1)
        print(classes[predicted])


if __name__ == "__main__":
    main()

I prepared an image of bird below to test the model.

The results of the test is as follows:

$ python test.py
Model loaded successfully.
deer

Hmm, it was obvious from the results, but it misrecognized the bird as a deer. I ran the script several times, but about half of them are misrecognized as deer.

In the next article, I'll introduce how we can run this as a containerized application on AWS.