Forem: gdemarcq

Easy stable diffusion inpainting with Segment Anything Model (SAM)

gdemarcq — Tue, 03 Oct 2023 16:37:28 +0000

With the Ikomia API, creating a workflow using Segment Anything Model (SAM) for segmentation followed by Stable diffusion inpainting becomes effortless, requiring only a few lines of code. To get started, you need to install the API in a virtual environment.

How to install a virtual environment

pip install ikomia

API documentation

API repo

Run SAM and stable diffusion inpainting with a few lines of code

You can also charge directly the open-source notebook we have prepared.

Note: The workflow bellow requires 6.1 GB of GPU RAM. However, by choosing the smallest SAM model, the memory usage can be decreased to 4.9 GB of GPU RAM.

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils import ik
from ikomia.utils.displayIO import display

# Init your workflow
wf = Workflow()

# Add the SAM algorithm
sam = wf.add_task(ik.infer_segment_anything(
    model_name='vit_l',
    input_box='[204.8, 221.8, 769.7, 928.5]'
),
    auto_connect=True
)

# Add the stable diffusion inpainting algorithm
sd_inpaint = wf.add_task(ik.infer_hf_stable_diffusion_inpaint(
    model_name='stabilityai/stable-diffusion-2-inpainting',
    prompt='dog, high resolution',
    negative_prompt='low quality',
    num_inference_steps='100',
    guidance_scale='7.5',
    num_images_per_prompt='1'),
    auto_connect=True
)

# Run directly on your image
wf.run_on(url="https://raw.githubusercontent.com/Ikomia-dev/notebooks/main/examples/img/img_cat.jpg")

# Inspect your result
display(sam.get_image_with_mask())
display(sd_inpaint.get_output(0).get_image())

Introducing SAM: The Segment Anything Model

Image segmentation is a critical task in Computer Vision, enabling machines to understand and analyze the contents of images at a pixel level. The Segment Anything Model (SAM) is a groundbreaking instance segmentation model developed by Meta Research, which has taken the field by storm since its release in April 2023.

SAM offers unparalleled versatility and efficiency in image analysis tasks, making it a powerful tool for a wide range of applications.

SAM's promptable features

SAM was specifically designed to address the limitations of existing image segmentation models and to introduce new capabilities that revolutionize the field.

One of SAM's standout features is its promptable segmentation task, which allows users to generate valid segmentation masks by providing prompts such as spatial or text clues (feature not yet released at the time of writing) that identify specific objects within an image.

This flexibility empowers users to obtain precise and tailored segmentation results effortlessly:

Generate segmentation masks for all objects SAM can detect.

Provide boxes to guide SAM in generating a mask for specific objects in an image.

Provide a box and a point to guide SAM in generating a mask with an area to exclude.

Key features of the Segment Anything Model (SAM)

At the core of SAM lies its advanced architecture, which comprises three key components: an image encoder, a prompt encoder, and a lightweight mask decoder. This design enables SAM to perform real-time mask computation, adapt to new image distributions and tasks without prior knowledge, and exhibit ambiguity awareness in segmentation tasks.

By leveraging these capabilities, SAM offers remarkable flexibility and adaptability, setting new standards in image segmentation models.

The SA-1B dataset: enabling unmatched training data scale

A fundamental factor contributing to SAM's exceptional performance is the SA-1B dataset, the largest segmentation dataset to date, introduced by the Segment Anything project. With over 1 billion masks spread across 11 million carefully curated images, the SA-1B dataset provides SAM with a diverse and extensive training data source.

This abundance of high-quality training data equips SAM with a comprehensive understanding of various object categories, enhancing its ability to generalize and perform accurately across different segmentation tasks.

Zero-shot transfer: adapting to new tasks without prior knowledge

One of SAM's most impressive attributes is its zero-shot transfer capability. SAM has been trained to achieve outstanding zero-shot performance, surpassing previous fully supervised results in numerous cases.

Zero-shot transfer refers to SAM's ability to adapt to new tasks and object categories without requiring explicit training or prior exposure to specific examples. This feature allows users to leverage SAM for diverse applications with minimal need for prompt engineering, making it a truly versatile and ready-to-use tool.

Diverse applications of SAM in image segmentation

With its numerous applications and innovative features, SAM unlocks new possibilities in the field of image segmentation. As a zero-shot detection model, SAM can be paired with object detection models to assign labels to specific objects accurately. Additionally, SAM serves as an annotation assistant, supporting the annotation process by generating masks for objects that require manual labeling.

Moreover, SAM can be used as a standalone tool for feature extraction. It allows users to extract object features or remove backgrounds from images effortlessly.

Versatility in image analysis tasks

In conclusion, the Segment Anything Model represents a significant leap forward in the field of image segmentation. With its promptable segmentation task, advanced architecture, zero-shot transfer capability, and access to the SA-1B dataset, SAM offers unparalleled versatility and performance.

As the capabilities of Computer Vision continue to expand, SAM paves the way for cutting-edge applications and facilitates breakthroughs in various industries.

Exploring stable diffusion inpainting

Inpainting refers to the process of restoring or repairing an image by filling in missing or damaged parts. It is a valuable technique widely used in image editing and restoration, enabling the removal of flaws and unwanted objects to achieve a seamless and natural-looking final image. Inpainting finds applications in film restoration, photo editing, and digital art, among others.

Understanding stable diffusion inpainting

Stable Diffusion Inpainting is a specific type of inpainting technique that leverages the properties of heat diffusion to fill in missing or damaged areas of an image. It accomplishes this by applying a heat diffusion process to the surrounding pixels.

During this process, values are assigned to these pixels based on their proximity to the affected area. The heat equation is then utilized to redistribute intensity values, resulting in a seamless and natural patch. The repetition of this equation ensures the complete filling of the image patch, ultimately creating a smooth and seamless result that blends harmoniously with the rest of the image.

Unique advantages of stable diffusion inpainting

Stable Diffusion Inpainting sets itself apart from other inpainting techniques due to its notable stability and smoothness. Unlike slower or less reliable alternatives that can produce visible artifacts, Stable Diffusion Inpainting guarantees a stable and seamless patch. It excels particularly in handling images with complex structures, including textures, edges, and sharp transitions.

Applications of stable diffusion inpainting

Stable Diffusion Inpainting finds practical applications in various fields.

In photography

it proves valuable for removing unwanted objects or blemishes from images.

In film restoration

it aids in repairing damaged or missing frames.

Medical imaging

It benefits from Stable Diffusion Inpainting by removing artifacts or enhancing scan quality.

In digital art

It can be utilized to create seamless compositions or eliminate undesired elements.

Useful tips for effective inpainting

To achieve optimal inpainting results, consider the following tips:

Experiment with different inpainting techniques to find the most suitable one for your specific use case.
Utilize good-quality source images to achieve accurate and efficient inpainting results.
Adjust the parameters of Stable Diffusion Inpainting to optimize outcomes for your particular needs.
Combine Stable Diffusion Inpainting with other segmentation algorithm such as YOLOv8-seg, for enhanced results.

Stable Diffusion Inpainting stands out as an advanced and effective image processing technique for restoring or repairing missing or damaged parts of an image. Its applications include film restoration, photography, medical imaging, and digital art.

Step by step segmentation and inpainting with the Ikomia API

In this section, we will demonstrate how to utilize the Ikomia API to create a workflow for segmentation and diffusion inpainting as presented above.‍

Step 1: import

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils import ik
from ikomia.utils.displayIO import display

The ‘Workflow’ class is the base object for creating a workflow. It provides methods for setting inputs (image, video, directory), configuring task parameters, obtaining time metrics, and retrieving specific task outputs, such as graphics, segmentation masks, and texts.
‘ik’ is an auto-completion system designed for convenient and easy access to algorithms and settings.
The ‘display’ function offers a flexible and customizable way to display images (input/output) and graphics, such as bounding boxes and segmentation masks

Step 2: create workflow

wf = Workflow()

We initialize a workflow instance. The “wf” object can then be used to add tasks to the workflow instance, configure their parameters, and run them on input data.

Step 3: add and connect SAM

sam = wf.add_task(ik.infer_segment_anything(
    model_name='vit_l',
    input_box='[204.8, 221.8, 769.7, 928.5]',
),
    auto_connect=True

‘model_name’: The SAM model can be loaded with three different encoders: ‘vit_b’, ‘vit_l’, ‘vit_h’. The encoders differ in parameter counts, with ViT-B (base) containing 91M, ViT-L (large) containing 308M, and ViT-H (huge) containing 636M parameters.
ViT-H offers significant improvements over ViT-B, though the gains over ViT-L are minimal.
Based on our tests, ViT-L presents the best balance between performance and accuracy. While ViT-H is the most accurate, it's also the slowest, and ViT-B is the quickest but sacrifices accuracy.
'input_box' (list): A Nx4 array of given box prompts to the model, in [XYXY] or [[XYXY], [XYXY]] format.

Additional SAM parameters

'draw_graphic_input' (Boolean): When set to True, it allows you to draw graphics (box or point) over the object you wish to segment. If set to False, SAM will automatically generate masks for the entire image.
'points_per_side' (int or None): The number of points to be sampled for mask generation when running automatic segmentation.
'input_point' (list): A Nx2 array of point prompts to the model. Each point is in [X,Y] in pixels.
'input_point_label' (list): A length N array of labels for the point prompts. 1 indicates a foreground point and 0 indicates a background point.

Step 4: add and connect the stable diffusion inpainting algorithm

sd_inpaint = wf.add_task(ik.infer_hf_stable_diffusion_inpaint(
    model_name='stabilityai/stable-diffusion-2-inpainting',
    prompt='tiger, high resolution',
    negative_prompt='low quality',
    num_inference_steps='100',
    guidance_scale='7.5',
    num_images_per_prompt='1'),
    auto_connect=True
)

'prompt' (str): Input prompt.
'negative_prompt' (str): The prompt not to guide the image generation. Ignored when not using guidance (i.e., ignored if **guidance_scale** is less than 1).
‘num_inference_steps’:Number of denoising steps (minimum: 1; maximum: 500).
‘guidance_scale’: Scale for classifier-free guidance (minimum: 1; maximum: 20).
‘num_images_per_prompt’: Number of images to output.

Step 5: Apply your workflow to your image

You can apply the workflow to your image using the ‘run_on()’ function. In this example, we use the image path:

wf.run_on(path="path/to/your/image")

Step 6: Display your results

Finally, you can display our image results using the display function:

display(sam.get_image_with_mask())
display(sd_inpaint.get_output(0).get_image())

First, we show the segmentation mask output from the Segment Anything Model. Then, display the stable diffusion inpainting output.

Here are some more stable diffusion inpainting outputs (prompts: ‘dog’, ‘fox’, ‘lioness’, ‘tiger’, ‘white cat’):

Image Segmentation with SAM and the Ikomia ecosystem

In this tutorial, we have explored the process of creating a workflow for image segmentation with SAM, followed by stable diffusion inpainting.

The Ikomia API simplifies the development of Computer Vision workflows and provides easy experimentation with different parameters to achieve optimal results.

To learn more about the API, refer to the documentation. You may also check out the list of state-of-the-art algorithms on Ikomia HUB and try out Ikomia STUDIO, which offers a friendly UI with the same features as the API.

Complete OpenPose guide

gdemarcq — Tue, 26 Sep 2023 12:05:14 +0000

OpenPose is one of the most popular pose estimation libraries. Its 2D and 3D keypoint detection features are widely used by data science researchers all over the world.

Here is an analysis of its features, application fields, cost for commercial use and alternatives. This should help you decide whether OpenPose is the right choice for your project in artificial intelligence.

What is OpenPose?

OpenPose is a real-time multi-person keypoint detection library for body, face, and hand estimation. It is capable of detecting 135 keypoints.

It is a deep learning-based approach that can infer the 2D location of key body joints (such as elbows, knees, shoulders, and hips), facial landmarks (such as eyes, nose, mouth), and hand keypoints (such as fingertips, wrist, and palm) from RGB images or videos.

The library was created by a group of researchers from Carnegie Melon University and is now maintained by two of its initial creators.

OpenPose is known for its robustness to multi person pose estimation settings and is the winner of the COCO 2016 Keypoints Challenge.

How does OpenPose work?

The initial step of the OpenPose library involves extracting features from an image by utilizing the initial layers.‍

These extracted features are then fed into two separate divisions of convolutional neural network layers. One division is responsible for predicting 18 confidence maps, each representing a specific part of the human pose skeleton.

Simultaneously, the other division predicts a set of 38 Part Affinity Fields (PAFs) that indicate the level of association between different body parts. The subsequent stages are utilized to refine the predictions generated by these divisions.‍

Confidence map assist in constructing bipartite graphs between pairs of body parts, while Affinity Field PAF values help identify and eliminate weaker connections within these bipartite graphs.

By following these steps, it becomes possible to estimate and allocate human pose skeletons to each individual depicted in the image.‍

OpenPose Pipeline Steps

So in summary, OpenPose will do these tasks in sequence:

Initially, the entire image, whether it’s an image or a video frame, is taken as input.
Next, two-branch Convolutional Neural Networks (CNNs) work together to predict confidence maps, which aid in body part detection.
The estimation of Part Affinity Fields (PAFs) comes next, which enables the association of different body parts.
A collection of bipartite matchings is then created to link body part candidates.
Finally, these matched body parts are assembled to form complete full-body poses for all individuals present in the image.

OpenPose features

OpenPose allows computer science professionals across the globe to use a vast selection of features for different computer vision applications.

2D real-time multi-person keypoint detection

2D human pose estimation is one of the most appreciated tasks that OpenPose model can do. Here’s a few frequently used estimations that can be achieved with OpenPose:

15, 18 or 25-keypoint body/foot keypoint estimation, including 6 foot key points. Runtime invariant to the number of detected people.
2x21-keypoint hand key point estimation. Runtime depends on the number of detected people.
70-keypoint face keypoint estimation. Runtime depends on the number of detected people. See OpenPose Training for a runtime invariant alternative.

3D real-time single-person keypoint detection

3D pose estimation is another OpenPose feature that makes this a very powerful library of algorithms.

3D triangulation from multiple single views.
Synchronization of Flir cameras handled.
Compatible with Flir/Point Grey cameras.

Calibration toolbox

Estimation of distortion, intrinsic, and extrinsic camera parameters.

Single-person tracking for further speedup or visual smoothing.

OpenPose input

Input can be image, video, webcam, Flir/Point Grey, IP camera, and support to add your own custom input source (e.g., depth camera). This means you can estimate human movement in real time as well as analyze still images.

OpenPose output

Basic image + keypoint display/saving (PNG, JPG, AVI, …), keypoint saving (JSON, XML, YML, …), keypoints as array class, and support to add your own custom output code (e.g., some fancy UI).

OpenPose can output the keypoints as 2D coordinates, 3D coordinates, or heatmap values, providing flexibility for different applications.

OpenPose OS

Ubuntu (20, 18, 16, 14), Windows (10, 8), Mac OSX, Nvidia TX2.

OpenPose hardware compatibility

CUDA (Nvidia GPU), OpenCL (AMD GPU), and non-GPU (CPU-only) versions.

OpenPose APIs

OpenPose has APIs in several programming languages such as Python, C++, and MATLAB, and can be integrated with other machine learning libraries and frameworks such as TensorFlow, PyTorch, and Caffe.

OpenPose applications

Before we jump into the areas of OpenPose human pose estimation algorithm uses, let’s first take a look at the most important tasks you can do with OpenPose.

Multi-person pose estimation

OpenPose can detect the poses of multiple people in the same image or video stream simultaneously, making it ideal for applications such as action recognition, gesture recognition, and human-computer interaction.

Real-time performance

OpenPose can process images and videos in real-time on modern GPUs, making it suitable for real-time applications such as sports analysis, gaming, and virtual reality.

Accurate keypoint detection

OpenPose can detect key body, face, and hand keypoints with high accuracy, even in challenging scenarios such as occlusion and cluttered backgrounds.

OpenPose has a wide range of applications in various fields. Here are some examples of OpenPose applications in different domains

OpenPose in different industries

Due to its outstanding ability to find and track human poses, OpenPose became a Computer Vision staple in many different industries.

OpenPose for sports Analysis

OpenPose algorithm can be used for many different sports applications, such as injury prevention and gaming.

Human kinetics analysis

Analyzing movements and techniques of athletes to improve their performance in sports like basketball, tennis, and golf.

Injury prevention

Identifying improper posture or movement that could lead to injuries in sports like running, weightlifting, and football.

Gaming

Using motion tracking to control game characters using the player’s body movements, as seen in games like Kinect Sports and Just Dance.

OpenPose for robotics

As you might imagine, OpenPose has multiple applications within the robotics industry.

Human-Robot interaction

Developing robots that can interact with humans using natural body movements, like in personal assistance robots, factory automation, and social robots.

Object manipulation

Controlling robotic arms using hand and finger movements detected by OpenPose, like in manufacturing and assembly line robots.

Gesture recognition

Detecting and recognizing human gestures, like waving, pointing, and hand signals, to control robots, like in home automation and virtual assistants.

OpenPose for healthcare

Healthcare is another area that OpenPose can help with loads of tasks.

Physical therapy

Monitoring patients’ movements during rehabilitation exercises and providing real-time feedback to improve their posture and technique.

Elderly care

Detecting falls and monitoring the activities of elderly people in their homes using OpenPose-based cameras.

Surgery

Providing surgeons with real-time feedback on the positioning and movement of their hands during surgical procedures.

OpenPose for security and surveillance

When it comes to security and surveillance, OpenPose finds many application fields for humans, objects and animals.

Intrusion detection

Detecting and tracking human movements in restricted areas or identifying suspicious activities in real-time.

Crowd monitoring

Analyzing crowd behavior, detecting anomalies, and providing insights for crowd management and public safety.

Perimeter security

Monitoring and analyzing human presence along the perimeter of secure areas, detecting unauthorized entry attempts or potential breaches.

Crowd behavior analysis

Analyzing crowd dynamics, crowd density, and movement patterns in crowded public spaces, assisting in crowd management, event planning, and emergency response.

Traffic surveillance

Tracking and analyzing pedestrian movements at intersections, crosswalks, or public transportation hubs, facilitating traffic management and improving pedestrian safety.

OpenPose for Entertainment

OpenPose is used by the entertainment industry for various applications.

Virtual reality

Tracking body movements to provide an immersive experience in virtual reality environments, like in VR games and simulations.

Animation

Capturing the motion of actors’ bodies and facial expressions to create realistic and expressive animated characters.

Film and TV

Tracking actors’ movements during motion capture sessions and applying them to digital characters in movies and TV shows.

OpenPose for retail and e-commerce

Virtual try-on

Helping customers virtually try on clothes, accessories, or makeup, providing a more personalized and engaging shopping experience.

Customer behavior analysis

Track and analyzing customers’ movements within a store, allowing retailers to optimize store layouts and product placements.

How much does OpenPose cost?

OpenPose is freely available for free non-commercial use, and may be redistributed under these conditions.

The license agreement can be used for academic or non-profit organization noncommercial research only.‍

There is a non-exclusive commercial license. It requires a non-refundable $25,000 USD annual royalty.

Note that the commercial license cannot be used in the field of sports.‍

How to use OpenPose?

The code base is open-sourced on Github and is very well documented.

You can read the official installation documentation.

Install OpenPose

The first step is to install OpenPose on your system. OpenPose is available for various platforms, including Windows, Linux, and macOS.

You can download the latest version of the OpenPose package from the official website.‍

The package includes pre-trained models and configurations that are ready to use, but can also be further customized according to your application needs.‍

Prepare the input data

OpenPose requires input data in the form of images or video streams. The input data can be captured using a camera or loaded from a file.

Preprocessing the data before inputting it into OpenPose is necessary to ensure the best performance and accuracy of the model. This can be done through resizing, cropping, and filtering.

Configure OpenPose

Configuring OpenPose is an essential step in optimizing the model’s performance and accuracy. OpenPose provides various configuration options that can be adjusted.

The configuration options include model type, output format, resolution, and keypoint detection threshold. These options can be selected according to your application’s specific requirements to achieve the best results.

Run OpenPose

Once the input data is prepared and the configuration options are set, OpenPose can be run on the data. OpenPose will analyze the input data and detect the keypoints of the human body, including the position, orientation, and movement of various body parts.

Visualize the output

The final step is to visualize the output of OpenPose. OpenPose provides various output formats, including JSON, XML, and CSV, which can be used to display the detected keypoints in real-time or post-processing analysis The output can be visualized using various tools, such as OpenCV, Matplotlib, or Unity.

OpenPose Alternatives and Comparisons

As powerful as OpenPose is, it’s always worth exploring alternative pose estimation algorithms to determine which is best suited for your use case.‍

Here are a few OpenPose alternatives to consider.

OpenPose vs Mediapipe

Lightweight, cross-platform framework for mobile devices and desktops that enables real-time, high-accuracy hand, facial, and pose tracking.

One of the major advantages of MediaPipe is that it is optimized for mobile devices and can run on resource-constrained devices.

However, it has limited support for 3D pose estimation and requires a significant amount of preprocessing for input data.

OpenPose vs Detectron2

Provides pre-trained models for keypoint detection and pose estimation. Detectron2 is highly customizable and supports a wide range of models, including Mask R-CNN and RetinaNet.

However, it is more complex than other libraries, and its performance may be affected by hardware limitations.‍

OpenPose vs MMPose

A high-accuracy pose estimation framework that includes support for multi-person, 3D, and hand pose estimation. It also includes a variety of pre-trained models and data augmentation techniques for improved performance.

However, it may require more computational resources than some of the other algorithms, and it is currently only available in PyTorch.

OpenPose vs Lightweight-human-pose-estimation.pytorch

PyTorch-based pose estimation algorithm that is designed to be lightweight and fast. It uses a human pose estimation model that has been optimized for running on devices with limited computational resources, such as mobile devices and Raspberry Pi boards.

It can achieve real-time performance, making it suitable for applications such as human-computer interaction and sports analysis.

However, its accuracy may be lower than some of the more complex algorithms.

OpenPose vs Freemocap

Open-source, markerless motion capture system that uses computer vision techniques to estimate the 3D position of a person’s joints from a video stream. It includes support for multi-person pose estimation, as well as body and facial expression recognition.

It can be used for a variety of applications, including animation, gaming, and biomechanics research.

‍However, it may require more computational resources than some of the other algorithms, and its accuracy may be lower in challenging lighting conditions or with occlusions.

OpenPose vs AlphaPose

Offers faster performance than OpenPose and can detect multiple people in a single image or video stream.

However, it may have lower accuracy for small or occluded body parts due to its reliance on bottom-up detection and clustering.

OpenPose vs DeeperCut

Offers higher accuracy than OpenPose, making it a good choice for fine-grained pose estimation and occluded body parts.

However, it is slower than OpenPose due to its reliance on graphical models and requires careful tuning of its hyperparameters.

OpenPose vs HRNet

Boasts state-of-the-art accuracy and fast inference time, making it well-suited for real-time pose estimation and multi-person scenarios.

However, it requires more computational resources than OpenPose due to its use of a deeper network architecture.

OpenPose vs EfficientPose

Offers efficient inference time and improved accuracy compared to other lightweight models, making it ideal for mobile and embedded applications.

However, it may not be as accurate as some of the more complex algorithms due to its lightweight nature.

OpenPose vs DensePose

Can handle more complex poses and motions and estimate detailed body part textures, making it a good choice for fashion and retail applications, virtual try-ons, and gaming and animation.

However, it requires higher quality input images and is only available for non-commercial use due to licensing restrictions.

Compare OpenPose to Other Human Pose Estimation Algorithms

Here is a table with these OpenPose alternatives:

Note: The license type and cost may vary depending on the specific use case and the terms of the license agreement. Please refer to the individual project websites for more information.

Best alternatives to OpenPose for commercial use

If you are planning to create a solution for commercial use requiring multi-person keypoint detection, the Ikomia team advises choosing either Detectron2 or MMPose.

Both of these alternatives are freely available for commercial use under the Apache 2.0 license and are actively maintained by a strong community. You can also discover these resources within the Ikomia HUB and leverage them through either the open-source Ikomia API or Ikomia STUDIO.

🚀 Exciting News: Beta Launch of a SaaS for Computer Vision Projects! 🚀

gdemarcq — Wed, 20 Sep 2023 15:44:14 +0000

Hey there, AI developers and data scientists! The Ikomia team is thrilled to announce the Beta launch of our new SaaS platform for deploying Computer Vision projects. If you're into AI and Computer Vision, this might pique your interest.

🔗 Python API: First, a quick intro for those who aren't familiar with us. We already have a powerful Python API and an desktop app that lets you prototype workflows across multiple frameworks at a lightning-fast 5x pace. This includes popular frameworks like HuggingFace🤗, OpenMMlab, YOLO, and more.

🆓 What's in it for you? The Beta we are opening up allows you to deploy endpoint APIs for free while contributing to the platform's improvement. 💝

In exchange for your valuable feedback, here's what we offer:

2 months of free serverless CPU deployment, letting you process up to 1000 images per month.
Access to a collaborative workspace where you can share and collaborate on up to 10 projects.
A fantastic 30% lifetime discount on all your future SCALE deployments.

✏️ Ready to get started? You can register here: Ikomia Beta Registration. If you have any questions or need more information, feel free to contact me.

🌟 Don't wait too long! Please note that this Beta program is limited to 100 users, so grab your spot while you can. Let's shape the future of Computer Vision together! 🌟

How to train a classification model on a custom dataset

gdemarcq — Tue, 19 Sep 2023 09:47:49 +0000

In this blog post, we will cover the necessary steps to train a custom image classification model and test it on images.

The Ikomia API simplifies the development of Computer Vision workflows and provides an easy way to experiment with different parameters to achieve optimal results.

Get started with Ikomia API

You can train a custom classification model with just a few lines of code. To begin, you will need to install the API within a virtual environment.

How to install a virtual environment

pip install ikomia

API documentation

API repo

In this tutorial, we will use the Rock, Paper, Scissor dataset from Roboflow.

Ensure that the dataset is organized in the correct format, as shown below:

(Note: The “validation” folder should be renamed to “val”.)

‍Run the train ResNet algorithm

You can also charge directly the open-source notebook we have prepared.

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils import ik

# Init your workflow
wf = Workflow()
# Add the training task to the workflow
resnet = wf.add_task(ik.train_torchvision_resnet(
    model_name="resnet34",
    batch_size="16",
    epochs="5",
    output_folder="Path/To/Output/Folder"
),
    auto_connect=True
)

# Set the input path of your dataset
dataset_folder = "Path/To/Rock Paper Scissors.v1-hugggingface.folder"
# Launch your training on your data
wf.run_on(folder=dataset_folder)

After 5 epochs of training, you will see the following metrics:

train Loss: 0.3751 Acc:0.8468
val Loss: 0.5611 Acc:0.7231
val per class Acc:tensor([0.75806, 1.00000, 0.41129])
Training complete in 1m57s
Best accuracy: 0.838710

Image Classification

Before experimenting with TorchVision ResNet, let’s dive deeper into image classification and the characteristics of this particular algorithm.

What is Image Classification?

Image classification is a fundamental task in Computer Vision that involves categorizing images into predefined classes based on their visual content. It enables computers to recognize objects, scenes, and patterns within images. The importance of image classification lies in its various applications:

Object Recognition

It allows computers to identify and categorize objects in images, essential for applications like autonomous vehicles and surveillance systems.

Image Understanding

Classification helps machines interpret image content and extract meaningful information, enabling advanced analysis and decision-making based on visual data.

Visual Search and Retrieval‍

By assigning tags or labels to images, classification models facilitate efficient searching and retrieval of specific images from large databases.

Content Filtering and Moderation‍

Image classification aids in automatically detecting and flagging inappropriate or offensive content, ensuring safer online environments.

Medical Imaging and Diagnosis‍

Classification assists in diagnosing diseases and analyzing medical images, enabling faster and more accurate diagnoses.

Quality Control and Inspection‍

By classifying images, defects or anomalies in manufactured products can be identified, ensuring quality control in various industries.

Visual Recommendation Systems‍

Image classification enhances recommendation systems by analyzing visual content and suggesting related items or content.

Security and Surveillance‍

Classification enables the identification of objects or individuals of interest in security and surveillance applications, enhancing threat detection and public safety.

In summary, image classification is essential for object recognition, image understanding, search and retrieval, content moderation, medical imaging, quality control, recommendation systems, and security applications in computer vision.

What is TorchVision ResNet?

A DCNN architecture

TorchVision is a popular Computer Vision library in PyTorch that provides pre-trained models and tools for working with image data. One of the widely used models in TorchVision is ResNet. ResNet, short for Residual Network, is a deep convolutional neural network architecture introduced by Kaiming He et al. in 2015. It was designed to address the challenge of training deep neural networks by introducing a residual learning framework.

Residual blocks to train deeper networks

ResNet uses residual blocks with skip connections to facilitate information flow between layers, mitigating the vanishing gradient problem and enabling the training of deeper networks.

The key idea behind ResNet is the use of residual blocks, which allow the network to learn residual mappings. These residual blocks contain skip connections that bypass one or more layers, enabling the flow of information from earlier layers to later layers.

This helps alleviate the vanishing gradient problem and facilitates the training of deeper networks.

The residual connection creates a shortcut path by adding the value at the beginning of the block, x, directly to the end of the block (F(x) + x) [Source].

This allows information to pass through multiple layers without degradation, making training and optimization easier.‍

The Microsoft Research team won the ImageNet 2015 competition using these deep residual layers, which use skip connections. They used ResNet-152 convolutional neural network architecture, comprising a total of 152 layers.

ResNet34 Architecture [Source].

Various ResNet models

ResNet models are available in torchvision with different depths, including ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152. These pre-trained models have been trained on large-scale image classification tasks, such as the ImageNet dataset, and achieved state-of-the-art performance.

By using pre-trained ResNet models from torchvision, researchers and developers can leverage the learned representations for various Computer Vision tasks, including image classification, object detection, and feature extraction.‍

Step by step: Train ResNet Image Classification Model using Ikomia API

With the dataset of Rock, Paper & Scissor images that you have downloaded, you can easily train a custom ResNet model using the Ikomia API. Let’s go through the process together:

Step 1: import

from ikomia.dataprocess.workflow 
import Workflowfrom ikomia.utils import ik

Workflow is the base object to create a workflow. It provides methods for setting inputs such as images, videos, and directories, configuring task parameters, obtaining time metrics, and accessing specific task outputs such as graphics, segmentation masks, and texts.
Ik is an auto-completion system designed for convenient and easy access to algorithms and settings.

Step 2: create workflow

Initialize a workflow instance by creating a ‘wf’ object. This object will be used to add tasks to the workflow, configure their parameters, and run them on input data.

wf = Workflow()

Step 3: add the torchvision ResNet algorithm and set the parameters

Now, let’s add the train_torchvision_resnet task to train our custom image classifier. We also need to specify a few parameters for the task:

resnet = wf.add_task(ik.train_torchvision_resnet(
    model_name="resnet34",
    batch_size="16",
    epochs="5"
),
    auto_connect=True
)

model_name: name of the pre-trained model
batch_size: Number of samples processed before the model is updated.
epochs: Number of complete passes through the training dataset.
input_size: Input image size during training.
learning_rate: Step size at which the model’s parameters are updated during training.
momentum: Optimization technique that accelerates convergence
weight_decay: Regularization technique that reduces the magnitude of the model’s
output_folder: Path to where the trained model will be saved.

Step 4: set the input path of your dataset

Next, provide the path to the dataset folder for the task input.

dataset_folder = "Path/To/Rock Paper Scissors.v1-raw-300x300.folder"

Step 5: run your workflow

Finally, it’s time to run the workflow and start the training process.

wf.run_on(folder=dataset_folder)

Test your custom ResNet image classifier

First, we can run a rock/paper/scissor image on the pre-trained ResNet34 model:

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display
from ikomia.utils import ik

# Initialize the workflow
wf = Workflow()
# Add the image classification algorithm  
resnet = wf.add_task(ik.infer_torchvision_resnet(model_name="resnet34"), auto_connect=True)
# Run on your image
wf.run_on(path="Path/To/Rock Paper Scissors/Dataset/test/rock/rock8_png.rf.8b06573ed8208e085c3b2e3cf06c7888.jpg")
# Inspect your results
display(resnet.get_image_with_graphics())

We can observe that ResNet34 pre-trained model doesn’t detect rock signs. This is because the model has been trained on the ImageNet dataset, which does not contain images of rock/paper/scissor hand signs.

To test the model we just trained, we specify the path to our custom model and class names using the ’model_weight_file’ and “class_file” parameters. We then run the workflow on the same image we used previously.

# Add the custom ResNet model  
resnet = wf.add_task(ik.infer_torchvision_resnet(
    model_name="resnet34",
    model_weight_file="path/to/output_folder/timestamp/06-06-2023T14h32m40s/resnet34.pth",
    class_file="path/to/output_folder/timestamp/classes.txt"),
    auto_connect=True
)

Here are some more examples of image classification using the pre-trained (left) and our custom model (right):‍

Build your own Computer Vision workflow

To learn more about the API, refer to the documentation. You may also check out the list of state-of-the-art algorithms on Ikoma HUB and try out Ikomia STUDIO, which offers a friendly UI with the same features as the API.

How to train YOLOv8 instance segmentation on a custom dataset

gdemarcq — Mon, 11 Sep 2023 14:12:01 +0000

In this case study, we will cover the process of fine-tuning the YOLOv8-seg pre-trained model to improve its accuracy for specific object classes. The Ikomia API simplifies the development of Computer Vision workflows and allows for easy experimentation with different parameters to achieve the best results.

Get started with Ikomia API

With the Ikomia API, we can train a custom YOLOv8 Instance Segmentation model with just a few lines of code. To get started, you need to install the API in a virtual environment.

How to install a virtual environment

pip install ikomia

API documentation

API repo

In this tutorial, we will use the coral dataset from Roboflow. You can download this dataset by following this link: Dataset Download Link.

Run the train YOLOv8 instance segmentation algorithm with a few lines of code

You can also charge directly the open-source notebook we have prepared.

from ikomia.dataprocess.workflow import Workflow

# Initialize the workflow
wf = Workflow()

# Add the dataset loader to load your custom data and annotations
dataset = wf.add_task(name='dataset_coco')

# Set the parameters of the dataset loader
dataset.set_parameters({
    'json_file': 'Path/To/Mesophotic Coral/Dataset/train/_annotations.coco.json',
    'image_folder': 'Path/To/Mesophotic Coral/Dataset/train',
    'task': 'instance_segmentation',
})

# Add the YOLOv8 segmentation algorithm
train = wf.add_task(name='train_yolo_v8_seg', auto_connect=True)

# Set the parameters of the YOLOv8 segmentation algorithm
train.set_parameters({
    'model_name': 'yolov8m-seg',
    'batch_size': '4',
    'epochs': '50',
    'input_size': '640',
    'dataset_split_ratio': '0.8',
    'output_folder': 'Path/To/Folder/Where/Model-weights/Will/Be/Saved'
})

The training process for 50 epochs was completed in approximately 1h using an NVIDIA GeForce RTX 3060 Laptop GPU with 6143.5MB.

What is YOLOv8 instance segmentation?

Before going through a step by step approach with all parameters details, let's dive deeper into instance segmentation and YOLOv8.

What is instance segmentation?

Instance segmentation is a Computer Vision task that involves identifying and delineating individual objects within an image. Unlike semantic segmentation, which classifies each pixel into pre-defined categories, instance segmentation aims to differentiate and separate instances of objects from one another.

In instance segmentation, the goal is to not only classify each pixel but also assign a unique label or identifier to each distinct object instance. This means that objects of the same class are treated as separate entities. For example, if there are multiple instances of cars in an image, instance segmentation algorithms will assign a unique label to each car, allowing for precise identification and differentiation.

Instance segmentation provides more detailed and granular information about object boundaries and spatial extent compared to other segmentation techniques. It is widely used in various applications, including autonomous driving, robotics, object detection, medical imaging, and video analysis.

Many modern instance segmentation algorithms, like YOLOv8-seg, employ deep learning techniques, particularly convolutional neural networks (CNNs), to perform pixel-wise classification and object localization simultaneously. These algorithms often combine the strengths of object detection and semantic segmentation to achieve accurate instance-level segmentation results.

Overview of YOLOv8

Release and benefits

YOLOv8, developed by Ultralytics, is a model that specializes in object detection, image classification, and instance segmentation tasks. It is known for its accuracy and compact model size, making it a notable addition to the YOLO series, which has seen success with YOLOv5. With its improved architecture and user-friendly enhancements, YOLOv8 offers a great option for Computer Vision projects.

Architecture and innovations

While an official research paper for YOLOv8 is currently unavailable, an analysis of the repository and available information provide insights on its architecture. YOLOv8 introduces anchor-free detection, which predicts object centers instead of relying on anchor boxes. This approach simplifies the model and improves post-processing steps like Non-Maximum Suppression.

The architecture also incorporates new convolutions and module configurations, leaning towards a ResNet-like structure. For a detailed visualization of the network's architecture, refer to the image created by GitHub user RangeKing.

Training routine and augmentation

The training routine of YOLOv8 incorporates mosaic augmentation, where multiple images are combined to expose the model to variations in object locations, occlusion, and surrounding pixels. However, this augmentation is turned off during the final training epochs to prevent performance degradation.

Accuracy and performance

The accuracy improvements of YOLOv8 have been validated on the widely used COCO benchmark, where the model achieves impressive mean Average Precision (mAP) scores. For instance, the YOLOv8m-seg model achieves a remarkable 49.9% mAP on COCO. The following table provides a summary of the model sizes, mAP scores, and other performance metrics for different variants of YOLOv8-seg:

Here is an example of outputs using YOLOv8x detection and instance segmentation models:

Step by step: Fine tune a pre-trained YOLOv8-seg model using Ikomia API

With the dataset of aerial images that you downloaded, you can train a custom YOLO v7 model using the Ikomia API.

Step 1: import and create workflow

from ikomia.dataprocess.workflow import Workflow

wf = Workflow()

Workflow is the base object to create a workflow. It provides methods for setting inputs such as images, videos, and directories, configuring task parameters, obtaining time metrics, and accessing specific task outputs such as graphics, segmentation masks, and texts.

We initialize a workflow instance. The “wf” object can then be used to add tasks to the workflow instance, configure their parameters, and run them on input data.

Step 2: add the dataset loader

The downloaded COCO dataset includes two main formats: .JSON and image files. Images are split into train, val, test folders, with each associated a .json file containing the images annotations:

Image file name
Image size (width and height)
List of objects with the following information: Object class (e.g., "person," "car"); Bounding box coordinates (x, y, width, height) and Segmentation mask (polygon)

We will use the dataset_coco module provided by Ikomia API to load the custom data and annotations.

# Add the dataset loader to load your custom data and annotations
dataset = wf.add_task(name='dataset_coco')

# Set the parameters of the dataset loader
dataset.set_parameters({
    'json_file': 'Path/To/Mesophotic Coral/Dataset/train/_annotations.coco.json',
    'image_folder': 'Path/To/Mesophotic Coral/Dataset/train,
                    'task': 'instance_segmentation'
})

Step 3: add the YOLOv8 segmentation model and set the parameters

We add the ‘**train_yolo_v8_seg’ **task to our workflow for training our custom YOLOv8-seg model. To customize our training, we specify the following parameters:

# Add the YOLOv8 segmentation algorithm
train = wf.add_task(name='train_yolo_v8_seg', auto_connect=True)

# Set the parameters of the YOLOv8 segmentation algorithm
train.set_parameters({
    'model_name': 'yolov8m-seg',
    'batch_size': '4',
    'epochs': '50',
    'input_size': '640',
    'dataset_split_ratio': '0.8',
    'output_folder': 'Path/To/Folder/Where/Model-weights/Will/Be/Saved'
})

Here are the configurable parameters and their respective descriptions:

batch_size: Number of samples processed before the model is updated.
epochs: Number of complete passes through the training dataset.
input_size: Input image size during training and validation.
dataset_split_ratio: the algorithm automatically divides the dataset into train and evaluation sets. A value of 0.8 means the use of 80% of the data for training and 20% for evaluation.

You also have the option to modify the following parameters:

workers: Number of worker threads for data loading. Currently set to '0'.
optimizer: The optimizer to use. Available choices include SGD, Adam, Adamax, AdamW, NAdam, RAdam, RMSProp, and auto.
weight_decay: The weight decay for the optimizer. Currently set to '5e-4'.
momentum: The SGD momentum/Adam beta1 value. Currently set to '0.937'.
lr0: Initial learning rate. For SGD, it is set to 1E-2, and for Adam, it is set to 1E-3.‍
lrf: Final learning rate, calculated as lr0 * lrf. Currently set to '0.01'.

Step 4: run your workflow

Finally, we run the workflow to start the training process.

wf.run()

You can monitor the progress of your training using tools like Tensorboard or MLflow.

Once the training is complete, the train_yolo_v8_seg task will save the best model in a folder named with a timestamp inside the output_folder. You can find your best.pt model in the weights folder of the time-stamped folder.

Test your fine-tuned YOLOv8-seg model

First, can we run a coral image on the pre-trained YOLOv8-seg model:

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display

# Initialize the workflow
wf = Workflow()

# Add the YOLOv8 segmentation alrogithm
yolov8seg = wf.add_task(name='infer_yolo_v8_seg', auto_connect=True)

# Set the parameters of the YOLOv8 segmentation algorithm
yolov8seg.set_parameters({
    'model_name': 'yolov8m-seg',
    'conf_thres': '0.2',
    'iou_thres': '0.7'
})
# Run on your image
wf.run_on(
    path="Path/To/Mesophotic Coral Identification.v1i.coco-segmentation/valid/TCRMP20221021_clip_LBP_T109_jpg.rf.a4cf5c963d5eb62b6dab06b8d4b540f2.jpg")

# Inspect your results
display(yolov8seg.get_image_with_mask_and_graphics())
})

We can observe that the infer_yolo_v8_seg default pre-trained mistake a coral for a bear. This is because the model has been trained on the COCO dataset, which does not contain any coral objects.

To test the model we just trained, we specify the path to our custom model using the ’model_weight_file’ argument. We then run the workflow on the same image we used previously.

# Set the path of you custom YOLOv8-seg model to the parameter
yolov8seg.set_parameters({
    'model_weight_file': 'Path/To/Output_folder/[timestamp]/train/weights/best.pt',
    'conf_thres': '0.5',
    'iou_thres': '0.7'
})

Comparing our results to the ground truth, we successfully identified the species Orbicella spp. Nevertheless, we did observe some instances of false negatives. To enhance the performance of our custom model, further training for additional epochs and augmenting our dataset with more images could be beneficial.

Another example showcasing effective detection results is demonstrated with the Agaricia agaricites species:

Start training easily with Ikomia

To learn more about the API, you can refer to the documentation. Additionally, you can explore the list of state-of-the-art algorithms on Ikomia HUB and try out Ikomia STUDIO, which provides a user-friendly interface with the same features as the API.

YOLOv7 real-time object detection

gdemarcq — Mon, 04 Sep 2023 08:53:29 +0000

In this blog post, we will outline the essential steps for achieving real-time object detection alongside your webcam.

To this end, I will use the Ikomia API which enables you to utilize a ready-to-use detection model for real-time object detection in a video stream captured from your camera. To begin, you’ll need to install the API within a virtual environment.

How to install a virtual environment‍

pip install ikomia

API documentation

API repo

Running YOLOv7 algorithm on your webcam using Ikomia API

Alternatively, you can directly access the open-source notebook that we have prepared.

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils import ik
from ikomia.utils.displayIO import display
import cv2

stream = cv2.VideoCapture(0)

# Init the workflow
wf = Workflow()

# Add color conversion
cvt = wf.add_task(ik.ocv_color_conversion(code=str(cv2.COLOR_BGR2RGB)), auto_connect=True)

# Add YOLOv7 detection
yolo = wf.add_task(ik.infer_yolo_v7(conf_thres="0.7"), auto_connect=True)

while True:
    ret, frame = stream.read()

    # Test if streaming is OK
    if not ret:
        continue

    # Run workflow on image
    wf.run_on(frame)

    # Display results from "yolo"
    display(
        yolo.get_image_with_graphics(),
        title="Object Detection - press 'q' to quit",
        viewer="opencv"
    )

    # Press 'q' to quit the streaming process
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

# After the loop release the stream object
stream.release()

# Destroy all windows
cv2.destroyAllWindows()

Camera stream processing

Camera stream processing involves the real-time analysis and manipulation of images and video streams captured from a camera. This technique finds widespread application in diverse fields such as Computer Vision, surveillance, robotics, and entertainment.

In Computer Vision, camera stream processing plays a pivotal role in tasks like object detection and recognition, face detection, motion tracking, and image segmentation.

For surveillance purposes, camera stream processing aids in detecting anomalies and events such as intrusion detection and crowd behavior analysis.
In the realm of robotics, camera stream processing facilitates autonomous navigation, object detection, and obstacle avoidance.
The entertainment industry leverages camera stream processing for exciting applications like augmented reality, virtual reality, and gesture recognition.

Camera stream processing assumes a critical role across various domains, enabling the realization of numerous exciting applications that were once considered unattainable.

To embark on camera stream processing, we will make use of OpenCV and VideoCapture with the YOLOv7 algorithm.

Step by step: camera stream processing for object detection using Ikomia API

Here are the detailed steps followed in the first code snippet with all parameters explained.

Step 1: import dependencies

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils import ik
from ikomia.utils.displayIO import display
import cv2

The ‘Workflow’ class is the base object for creating a workflow. It provides methods for setting inputs (image, video, directory), configuring task parameters, obtaining time metrics, and retrieving specific task outputs, such as graphics, segmentation masks, and texts.
‘ik’ is an auto-completion system designed for convenient and easy access to algorithms and settings.
The ‘display’ function offers a flexible and customizable way to display images (input/output) and graphics, such as bounding boxes and segmentation masks.
‘cv2’ corresponds to the popular OpenCV library.

Step 2: define the video stream

Initialize a video capture object to retrieve frames from a camera device. Use the following code:

stream = cv2.VideoCapture(0)

The parameter 0 passed to VideoCapture indicates that you want to capture video from the default camera device connected to your system. If you have multiple cameras connected, you can specify a different index to capture video from a specific camera (e.g., 1 for the second camera), or you can give the path to a video.

Step 3: create workflow

We initialize a workflow instance using the following code:

wf = Workflow()

The ‘wf’ object can then be used to add tasks to the workflow instance, configure their parameters, and run them on input data.

Step 4: add the OpenCV color conversion algorithm

cvt = wf.add_task(ik.ocv_color_conversion(code=str(cv2.COLOR_BGR2RGB)), auto_connect=True)

By default, OpenCV uses the BGR color format, whereas Ikomia works with RGB images. To display the image output with the right colors, we need to flip the blue and red planes.

Step 5: add the YOLOv7 Object Detection Model

Add the ‘infer_yolo_v7’ task, setting the pre-trained model and the confidence threshold parameter using the following code:

yolo = wf.add_task(ik.infer_yolo_v7(model_name='yolov7', conf_thres="0.7"), auto_connect=True)

Step 6: run the workflow on the stream

We read the frames from a video stream using a continuous loop. If there is an issue reading a frame, it skips to the next iteration.

It then runs the workflow on the current frame and displays the results using OpenCV. The displayed image includes graphics generated by the "YOLO" object detection system.

The displayed window allows the user to quit the streaming process by pressing the 'q' key. If the 'q' key is pressed, the loop is broken, and the streaming process ends.

while True:
    ret, frame = stream.read()

    # Test if streaming is OK
    if not ret:
        continue

    # Run workflow on image
    wf.run_on(frame)

    # Display results from "yolo"
    display(
        yolo.get_image_with_graphics(),
        title="Object Detection - press 'q' to quit",
        viewer="opencv"
    )

    # Press 'q' to quit the streaming process
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

Step 7: end the video stream

After the loop, release the stream object and destroy all windows created by OpenCV.

# After the loop release the stream object
stream.release()

# Destroy all windows
cv2.destroyAllWindows()

Perform real-time object detection from your own video stream

By leveraging Ikomia API, developers can streamline the creation of Computer Vision workflows and explore various parameters to attain the best possible outcomes.

For additional insights into the API, we recommend referring to the comprehensive documentation. Additionally, you can explore the selection of cutting-edge algorithms available on Ikomia HUB and experiment with Ikomia STUDIO, a user-friendly interface that encompasses the same functionality as the API. Take advantage of these resources to further enhance your Computer Vision endeavors.

Source of the illustration image: Photo by Drazen Zigic.

How to train YOLOv7 object detection on a custom dataset

gdemarcq — Wed, 30 Aug 2023 13:27:39 +0000

How to train YOLOv7 object detection on a custom dataset using Ikomia API

With the Ikomia API, we can train a custom YOLOv7 model with just a few lines of code. To get started, you need to install the API in a virtual environment.

How to install a virtual environment

pip install ikomia

API documentation

API repo

In this tutorial, we will use the aerial airport dataset from Roboflow. You can download this dataset by following this link: Dataset Download Link.

Run the train YOLOv7 algorithm with a few lines of code using Ikomia API

You can also charge directly the open-source notebook we have prepared.

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils import ik

# Initialize the workflow
wf = Workflow()

# Add the dataset loader to load your custom data and annotations
dataset = wf.add_task(ik.dataset_yolo(
    dataset_folder="path/to/aerial/dataset/train", 
    class_file="path/to/aerial/dataset/train/_darknet.labels")
)

# Add the Yolov7 training algorithm
yolo = wf.add_task(ik.train_yolo_v7(
    batch_size="4",
    epochs="10",
    output_folder="path/to/output/folder"),
    auto_connect=True
)

# Launch your training on your data
wf.run()

The training process for 10 epochs was completed in approximately 14 minutes using an NVIDIA GeForce RTX 3060 Laptop GPU with 6143.5MB.

What is YOLOv7?

What Makes YOLO popular for object detection?

YOLO stands for “You Only Look Once”; it is a popular family of real-time object detection algorithms. The original YOLO object detector was first released in 2016. It was created by Joseph Redmon, Ali Farhadi, and Santosh Divvala. At release, this architecture was much faster than other object detectors and became state-of-the-art for real-time Computer Vision applications.

High mean Average Precision (mAP)

YOLO (You Only Look Once) has gained popularity in the field of object detection due to several key factors. making it ideal for real-time applications. Additionally, YOLO achieves higher mean Average Precision (mAP) than other real-time systems, further enhancing its appeal.

High detection accuracy

Another reason for YOLO's popularity is its high detection accuracy. It outperforms other state-of-the-art models with minimal background errors, making it reliable for object detection tasks.

YOLO also demonstrates good generalization capabilities, especially in its newer versions. It exhibits better generalization for new domains, making it suitable for applications that require fast and robust object detection. For example, studies comparing different versions of YOLO have shown improvements in mean average precision for specific tasks like the automatic detection of melanoma disease.

An open-source algorithm

Furthermore, YOLO's open-source nature has contributed to its success. The community's continuous improvements and contributions have helped refine the model over time.

YOLO's outstanding combination of speed, accuracy, generalization, and open-source nature has positioned it as the leading choice for object detection in the tech community. Its impact in the field of real-time Computer Vision cannot be understated.‍

YOLO architecture

The YOLO architecture shares similarities with GoogleNet, featuring convolutional layers, max-pooling layers, and fully connected layers.

The architecture follows a streamlined approach to object detection and work as follows:

Starts by resizing the input image to a fixed size, typically 448x448 pixels.
This resized image is then passed through a series of convolutional layers, which extract features and capture spatial information.
The YOLO architecture employs a 1x1 convolution followed by a 3x3 convolution to reduce the number of channels and generate a cuboidal output.
The Rectified Linear Unit (ReLU) activation function is used throughout the network, except for the final layer, which utilizes a linear activation function.

To improve the model's performance and prevent overfitting, techniques such as batch normalization and dropout are employed. Batch normalization normalizes the output of each layer, making the training process more stable. Dropout randomly ignores a portion of the neurons during training, which helps prevent the network from relying too heavily on specific features.

How does YOLO object detection work?

In terms of how YOLO performs object detection, it follows a four-step approach:

First, the image is divided into grid cells (SxS) responsible for localizing and predicting the object's class and confidence values.
Next, bounding box regression is used to determine the rectangles highlighting the objects in the image. The attributes of these bounding boxes are represented by a vector containing probability scores, coordinates, and dimensions.
Intersection Over Unions (IoU) is then employed to select relevant grid cells based on a user-defined threshold.
Finally, Non-Max Suppression (NMS) is applied to retain only the boxes with the highest probability scores, filtering out potential noise.

Overview of the YOLOv7 model

Compared to its predecessors, YOLOv7 introduces several architectural reforms that contribute to improved performance. These include:

● Architectural reform:

Model scaling for concatenation-based models allows the model to meet the needs of different inference speeds.
E-ELAN (Extended Efficient Layer Aggregation Network) which allows the model to learn more diverse features for better learning.

● Trainable Bag-of-Freebies (BoF) improving the model’s accuracy without increasing the training cost using:

Planned re-parameterized convolution.
Coarse for auxiliary and fine for lead loss.

YOLO v7 introduces a notable improvement in resolution compared to its predecessors. It operates at a higher image resolution of 608 by 608 pixels, surpassing the 416 by 416 resolution employed in YOLO v3. By adopting this higher resolution, YOLO v7 becomes capable of detecting smaller objects more effectively, thereby enhancing its overall accuracy.

These enhancements result in a 13.7% higher Average Precision (AP) on the COCO dataset compared to YOLOv6.‍

Parameters and FPS

The YOLOv7 model has six versions with varying parameters and FPS (Frames per Second) performance. Here are the details:

Step by step: Fine tune a pre-trained YOLOv7 model using Ikomia API

With the dataset of aerial images that you downloaded, you can train a custom YOLO v7 model using the Ikomia API.

Step 1: import

from ikomia.dataprocess.workflow
import Workflowfrom ikomia.utils import ik‍

“Workflow” is the base object to create a workflow. It provides methods for setting inputs such as images, videos, and directories, configuring task parameters, obtaining time metrics, and accessing specific task outputs such as graphics, segmentation masks, and texts.
“ik” is an auto-completion system designed for convenient and easy access to algorithms and settings.

Step 2: create workflow

We initialize a workflow instance. The “wf” object can then be used to add tasks to the workflow instance, configure their parameters, and run them on input data.

wf = Workflow()

Step 3: add the dataset loader

The downloaded dataset is in YOLO format, which means that for each image in each folder (test, val, train), there is a corresponding .txt file containing all bounding box and class information associated with airplanes. Additionally, there is a _darknet.labels file containing all class names. We will use the dataset_yolo module provided by Ikomia API to load the custom data and annotations.

dataset = wf.add_task(ik.dataset_yolo(
    dataset_folder="path/to/aerial/dataset/train",
    class_file="path/to//aerial/dataset/train/_darknet.labels")
)

Step 4: add the YOLOv7 model and set the parameters

We add a train_yolo_v7 task to train our custom YOLOv7 model. We also specify a few parameters:‍

yolo = wf.add_task(ik.train_yolo_v7(
    batch_size="4",
    epochs="10",
    output_folder="path/to/output/folder"),
    auto_connect=True
)

batch_size: Number of samples processed before the model is updated.
epochs: Number of complete passes through the training dataset.
train_imgz: Input image size during training.
test_imgz: Input image size during testing.
dataset_spilt_ratio: the algorithm divides automatically the dataset into train and evaluation sets. A value of 0.9 means the use of 90% of the data for training and 10% for evaluation.

The “auto_connect=True ” argument ensures that the output of the dataset_yolo task is automatically connected to the input of the train_yolo_v7 task.

Step 5: apply your workflow to your dataset

Finally, we run the workflow to start the training process.

wf.run()

You can monitor the progress of your training using tools like Tensorboard or MLflow.

Once the training is complete, the train_yolo_v7 task will save the best model in a folder named with a timestamp inside the output_folder. You can find your best.pt model in the weights folder of the timestamped folder.

Test your fine-tuned YOLOv7 model

First, we can run an aerial image on the pre-trained YOLOv7 model:

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display
from ikomia.utils import ik

# Initialize the workflow
wf = Workflow()

# Add an Object Detection algorithm  
yolo = wf.add_task(ik.infer_yolo_v7(thr_conf="0.4"), auto_connect=True)

# Run on your image
wf.run_on(path="/path/to/aerial/dataset/test/airport_246_jpg.rf.3d892810357f48026932d5412fa81574.jpg")

# Inspect your results
display(yolo.get_image_with_graphics())

We can observe that the infer_yolo_v7 default pre-trained model doesn’t detect any plane. This is because the model has been trained on the COCO dataset, which does not contain aerial images of airports. As a result, the model lacks knowledge of what an airplane looks like from above.

To test the model we just trained, we specify the path to our custom model using the ’model_weight_file’ argument. We then run the workflow on the same image we used previously.

# Use your custom YOLOv7 model  
yolo = wf.add_task(ik.infer_yolo_v7(
    model_weight_file="path/to/output_folder/timestamp/weights/best.pt", 
    thr_conf="0.4"), 
    auto_connect=True
)

Conclusion

In this comprehensive guide, we have dived into the process of fine-tuning the YOLOv7 pre-trained model, empowering it to achieve higher accuracy when identifying specific object classes.

The Ikomia API serves as a game-changer, streamlining the development of Computer Vision workflows and enabling effortless experimentation with various parameters to unlock remarkable results.

For a deeper understanding of the API's capabilities, we recommend referring to the documentation. Additionally, don't miss the opportunity to explore the impressive roster of advanced algorithms available on Ikomia HUB, and take a spin with Ikomia STUDIO, a user-friendly interface that mirrors the API's features.

A new open source tool for Computer Vision developers

gdemarcq — Thu, 27 Oct 2022 09:59:22 +0000

Hi all,

Here is the news we’ve been eager to share with you for weeks: we just released our Open Source Python API for Computer Vision! ✨

Why is this a big step for all Computer Vision developers and data scientists?

Too often, scientific and technological knowledge represent a barrier to building solutions based on AI. And yet a lot of fields would benefit from the numerous positive innovations it enables.

That’s why we have been working on the Ikomia API these last months and can’t wait to receive your feedback 😀.

Is it for you? You might wonder. It is if (multiple answers possible 😅):

✅ you have a basic to average knowledge of Python and image analysis and need to build a Computer Vision solution,

✅ you are fed up of searching endlessly for the right model,

✅ you spend days testing algorithms and want to reduce this time tenfold,

✅ you can’t wait to use several models together and chain them in a workflow,

✅ you would like to benchmark your algorithm with others on your dataset and compare the results in real time,

✅ you want to be able to switch easily between training models,

✅ you are eager to finally share your work easily.

Checked any box? Here is your next step:

Go to our repo GitHub and pip install ikomia

As simple as that ! 😎