DBSCAN Algorithm from Scratch in Python

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular unsupervised learning method utilized in model building and machine learning algorithms originally proposed by Ester et al in 1996. Before we go any further, we need to define what is “unsupervised” learning method. Unsupervised learning methods are when there is no clear objective or outcome we are seeking to find. Instead, we are clustering the data together based on the similarity of observations.

Definitions

  • ε (epsilon): the radius of a neighborhood centered on a given point
  • Core Point: a given point is considered a Core Point if there are at least minPts points within its ε neighborhood, including itself
  • Border Point: a given point is considered a Borer Point if there fewer than minPts points within its ε neighborhood, including itself
  • Noise: any point that is not a Core Point or Border Point
  • Directly Density Reachable: a given point is Directly Density Reachable (ε Reachable) from another point if the second point is a core point, and the first point lies within the ε neighborhood of the second point
  • Density Reachable: a given point is Density Reachable from another point if there is a chain of points, Directly Density Reachable from each other, that connects them
  • Density Connected: A given point is Density Connected from another point if there is a third point from which both are Density Reachable — These points are said to be Connected Components

DBSCAN in a Nutshell

Given a set of points P, the radius of a neighborhood ε, and a minimum number of points minPts:

  1. Identify Core Points with at least minPts neighbors;
  2. Find all Connected Components of each core point — This Density Connected grouping of points is a cluster
  3. Each Border Point is assigned to a cluster if that cluster is Density Reachable, otherwise, Border Point is considered Noise
  • For each point in the dataset, DBSCAN forms an n-dimensional shape around that data point and then counts how many data points fall within that shape.
  • DBSCAN counts this shape as a cluster. DBSCAN iteratively expands the cluster, by going through each individual point within the cluster, and counting the number of other data points nearby. Take the graphic below for an example:

DBSCAN vs K-Means Clustering

DBSCAN is a popular clustering algorithm that is fundamentally very different from k-means.

  • k-means requires specifying the number of clusters, ‘k’. DBSCAN does not but does require specifying two parameters which influence the decision of whether two nearby points should be linked into the same cluster. These two parameters are a distance threshold, ε (epsilon), and “MinPts” (minimum number of points), to be explained.
  • k-means runs over many iterations to converge on a good set of clusters, and cluster assignments can change on each iteration. DBSCAN makes only a single pass through the data, and once a point has been assigned to a particular cluster, it never changes.

My Approach to the DBSCAN Algorithm

I like the language of trees for describing cluster growth in DBSCAN. It starts with an arbitrary seed point which has at least MinPts points nearby within a distance or “radius” of ε. We do a breadth-first search along each of these nearby points. For a given nearby point, we check how many points it has within its radius. If it has fewer than MinPts neighbors, this point becomes a leaf–we don’t continue to grow the cluster from it. If it does have at least MinPts, however, then it’s a branch, and we add all of its neighbors to the FIFO (“First In, First Out”) queue of our breadth-first search.

Data Science student, geologist, and avid disc golfer