DEV Community

Cover image for DBSCAN: Finding Cluster of any shape
Leonhard Kwahle
Leonhard Kwahle

Posted on

2 2 2 1 2

DBSCAN: Finding Cluster of any shape

Exploring how DBSCAN uses density, not distance, to find clusters of any shape from theory and code to real-world applications like anomaly detection and geospatial mapping.

Before diving into the code or real-world applications, it’s essential to understand what DBSCAN actually is and why it stands apart from traditional clustering techniques.

What is DBSCAN

At its core, DBSCAN short for Density-Based Spatial Clustering of Applications with Noise is a powerful algorithm that clusters data not based on shape or center points, but on the density of data points in a region. Let’s break down the key concepts that make DBSCAN unique.

  • Density-Based Clustering : Instead of grouping points based of distance from a center(like K-MEANS). This algorithm groups points based on how crowded a region is

  • No need for predefined number of clusters : Unlike k-means which needs to figure out the number of clusters. DBSCAN is able to get the different clusters based on density of points in a region

It is like traversing through all the points noting the number of times you encountered regions of high density

  • Identifies Noise and Outliers : DBSCAN is able to naturally identifies points that do not belong to any cluster like isolated data or anomalies . these are labelled as noise and hence a good tool for anomaly detection

  • It can find clusters of any shape : It is not limited to circular or spherical clusters. It is able to detect clusters of any shape like spirals and other complex irregular shapes

  • Core, Border, and Noise Points : DBSCAN classifies points into 3 types

Types of Points in DBSCAN

  • Core points : Have enough neighbors nearby (defined by ε and MinPts). ε is a defined distance used to determine the neighbors of a given point and MinPts is the minimum number of neighbor a core point must have to form a dense region

  • Border points : Close to a core point but not dense enough on their own.

  • Noise points : Points that are not close to any dense region.


Code implementation

1.using numpy

  • ALGORITHM According to this algorithm a cluster is a continious region of high density(contains certain number of points whose distant from core point is less that a certain number ε

-For each instance(point) counts how many instances are located within a small distance ε (epsilon) from it. This region is called the instance’s ε- neighborhood.
-If an instance has at least MinPts instances in its ε-neighborhood (including itself), then it is considered a core instance. In other words, core instances are those that are located in dense regions.
-All instances in the same neighborhood will be assigned to the same cluster. this neighborhood may contain points that are core points to other neighborhoods
-Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.


1.Import numpy

#import libraries
import numpy as np
Enter fullscreen mode Exit fullscreen mode

2.create a DBSCAN Class(remember Object-Oriented Programming)

#class
class DBSCAN:
     #constructor function
    def __init__(self, eps, MinPts):
        self.eps = eps
        self.MinPts = MinPts
    #define the fit function

Enter fullscreen mode Exit fullscreen mode

define methods for this class

1.fit function:

  • Initialize labels for all points to noise (-1).
  • Iterate over all points. If a point is already labeled, skip it.
  • Find neighbors within epsilon distance using the get_neighbors method.
  • If a point is not a core point (i.e., it has fewer than min_samples neighbors), label it as noise.
  • Otherwise, label the point as part of a new cluster and expand the cluster using the expand_cluster method.
def fit(self, data):
        # Initialize labels for all points to noise (-1)
        labels = np.full(data.shape[0], -1)#np.full creates an array of given shape and fill it with a certain value np.full(shape,value)
        cluster_id = 0
        for i in range(data.shape[0]):
            # If point is already labeled, skip it
            if labels[i] != -1:
                continue
            # Find neighbors within epsilon distance
            neighbors = self.get_neighbors(data, i, self.eps)
            # If point is not a core point, label it as noise
            if len(neighbors) < self.min_samples:
                labels[i] = -1
            else:
                # Label point as part of a new cluster
                labels = self.expand_cluster(data, labels, i, neighbors, cluster_id, self.eps, self.min_samples)
                cluster_id += 1
        return labels

2.get_neighbors method

  • Calculate the Euclidean distance between a point and all other points.
  • Return the indices of points within epsilon distance.
  • expand_cluster method:
    • Label a point as part of a cluster.
    • Iterate over all neighbors of the point. If a neighbor is noise or unlabeled, label it as part of the cluster.
    • Find neighbors of each neighbor. If a neighbor is a core point, add its neighbors to the list.
def get_neighbors(self, data, index, eps):
        # Calculate Euclidean distance between point and all other points
        distances = np.linalg.norm(data - data[index], axis=1)
        # Return indices of points within epsilon distance
        return np.where(distances <= eps)[0]
    def expand_cluster(self, data, labels, index, neighbors, cluster_id, eps, min_samples):
        # Label point as part of the cluster
        labels[index] = cluster_id
        # Iterate over all neighbors
        for i in neighbors:
            # If neighbor is noise or unlabeled, label it as part of the cluster
            if labels[i] == -1:
                labels[i] = cluster_id
                # Find neighbors of the neighbor
                new_neighbors = self.get_neighbors(data, i, eps)
                # If neighbor is a core point, add its neighbors to the list
                if len(new_neighbors) >= min_samples:
                    neighbors = np.concatenate((neighbors, new_neighbors))
    return labels

use case

from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

#Generate sample data
data, _ = make_moons(n_samples=200, noise=0.05)

#Create DBSCAN instance
dbscan = DBSCAN(eps=0.3, min_samples=10)

#Fit DBSCAN to data
labels = dbscan.fit(data)

#Plot clusters
plt.scatter(data[:, 0], data[:, 1], c=labels)
plt.show()

Enter fullscreen mode Exit fullscreen mode

Redis image

Short-term memory for faster
AI agents 🤖💨

AI agents struggle with latency and context switching. Redis fixes it with a fast, in-memory layer for short-term context—plus native support for vectors and semi-structured data to keep real-time workflows on track.

Start building

Top comments (0)

Dev Diairies image

User Feedback & The Pivot That Saved The Project ↪️

We’re following the journey of a dev team building on the Stellar Network as they go from hackathon idea to funded startup, testing their product in the real world and adapting as they go.

Watch full video 🎥

👋 Kindness is contagious

Explore this insightful write-up, celebrated by our thriving DEV Community. Developers everywhere are invited to contribute and elevate our shared expertise.

A simple "thank you" can brighten someone’s day—leave your appreciation in the comments!

On DEV, knowledge-sharing fuels our progress and strengthens our community ties. Found this useful? A quick thank you to the author makes all the difference.

Okay