Exploring how DBSCAN uses density, not distance, to find clusters of any shape from theory and code to real-world applications like anomaly detection and geospatial mapping.
Before diving into the code or real-world applications, it’s essential to understand what DBSCAN actually is and why it stands apart from traditional clustering techniques.
What is DBSCAN
At its core, DBSCAN short for Density-Based Spatial Clustering of Applications with Noise is a powerful algorithm that clusters data not based on shape or center points, but on the density of data points in a region. Let’s break down the key concepts that make DBSCAN unique.
Density-Based Clustering : Instead of grouping points based of distance from a center(like K-MEANS). This algorithm groups points based on how crowded a region is
No need for predefined number of clusters : Unlike k-means which needs to figure out the number of clusters. DBSCAN is able to get the different clusters based on density of points in a region
It is like traversing through all the points noting the number of times you encountered regions of high density
Identifies Noise and Outliers : DBSCAN is able to naturally identifies points that do not belong to any cluster like isolated data or anomalies . these are labelled as noise and hence a good tool for anomaly detection
It can find clusters of any shape : It is not limited to circular or spherical clusters. It is able to detect clusters of any shape like spirals and other complex irregular shapes
Core, Border, and Noise Points : DBSCAN classifies points into 3 types
Types of Points in DBSCAN
Core points : Have enough neighbors nearby (defined by ε and MinPts). ε is a defined distance used to determine the neighbors of a given point and MinPts is the minimum number of neighbor a core point must have to form a dense region
Border points : Close to a core point but not dense enough on their own.
Noise points : Points that are not close to any dense region.
Code implementation
1.using numpy
- ALGORITHM According to this algorithm a cluster is a continious region of high density(contains certain number of points whose distant from core point is less that a certain number ε
-For each instance(point) counts how many instances are located within a small distance ε (epsilon) from it. This region is called the instance’s ε- neighborhood.
-If an instance has at least MinPts instances in its ε-neighborhood (including itself), then it is considered a core instance. In other words, core instances are those that are located in dense regions.
-All instances in the same neighborhood will be assigned to the same cluster. this neighborhood may contain points that are core points to other neighborhoods
-Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.
1.Import numpy
#import libraries
import numpy as np
2.create a DBSCAN Class(remember Object-Oriented Programming)
#class
class DBSCAN:
#constructor function
def __init__(self, eps, MinPts):
self.eps = eps
self.MinPts = MinPts
#define the fit function
define methods for this class
1.fit function:
- Initialize labels for all points to noise (-1).
- Iterate over all points. If a point is already labeled, skip it.
- Find neighbors within epsilon distance using the get_neighbors method.
- If a point is not a core point (i.e., it has fewer than min_samples neighbors), label it as noise.
- Otherwise, label the point as part of a new cluster and expand the cluster using the expand_cluster method.
def fit(self, data): # Initialize labels for all points to noise (-1) labels = np.full(data.shape[0], -1)#np.full creates an array of given shape and fill it with a certain value np.full(shape,value) cluster_id = 0 for i in range(data.shape[0]): # If point is already labeled, skip it if labels[i] != -1: continue # Find neighbors within epsilon distance neighbors = self.get_neighbors(data, i, self.eps) # If point is not a core point, label it as noise if len(neighbors) < self.min_samples: labels[i] = -1 else: # Label point as part of a new cluster labels = self.expand_cluster(data, labels, i, neighbors, cluster_id, self.eps, self.min_samples) cluster_id += 1 return labels
2.get_neighbors method
- Calculate the Euclidean distance between a point and all other points.
- Return the indices of points within epsilon distance.
- expand_cluster method:
- Label a point as part of a cluster.
- Iterate over all neighbors of the point. If a neighbor is noise or unlabeled, label it as part of the cluster.
- Find neighbors of each neighbor. If a neighbor is a core point, add its neighbors to the list.
def get_neighbors(self, data, index, eps): # Calculate Euclidean distance between point and all other points distances = np.linalg.norm(data - data[index], axis=1) # Return indices of points within epsilon distance return np.where(distances <= eps)[0] def expand_cluster(self, data, labels, index, neighbors, cluster_id, eps, min_samples): # Label point as part of the cluster labels[index] = cluster_id # Iterate over all neighbors for i in neighbors: # If neighbor is noise or unlabeled, label it as part of the cluster if labels[i] == -1: labels[i] = cluster_id # Find neighbors of the neighbor new_neighbors = self.get_neighbors(data, i, eps) # If neighbor is a core point, add its neighbors to the list if len(new_neighbors) >= min_samples: neighbors = np.concatenate((neighbors, new_neighbors)) return labels
use case
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
#Generate sample data
data, _ = make_moons(n_samples=200, noise=0.05)
#Create DBSCAN instance
dbscan = DBSCAN(eps=0.3, min_samples=10)
#Fit DBSCAN to data
labels = dbscan.fit(data)
#Plot clusters
plt.scatter(data[:, 0], data[:, 1], c=labels)
plt.show()
Top comments (0)