DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular unsupervised learning method utilized in model building and machine learning algorithms originally proposed by Ester et al in 1996. Before we go any further, we need to define what is “unsupervised” learning method. Unsupervised learning methods are when there is no clear objective or outcome we are seeking to find. Instead, we are clustering the data together based on the similarity of observations.
- ε (epsilon): the radius of a neighborhood centered on a given point
- Core Point: a given point is considered a Core Point if there are at least minPts points within its ε neighborhood, including itself
- Border Point: a given point is considered a Borer Point if there fewer than minPts points within its ε neighborhood, including itself
- Noise: any point that is not a Core Point or Border Point
- Directly Density Reachable: a given point is Directly Density Reachable (ε Reachable) from another point if the second point is a core point, and the first point lies within the ε neighborhood of the second point
- Density Reachable: a given point is Density Reachable from another point if there is a chain of points, Directly Density Reachable from each other, that connects them
- Density Connected: A given point is Density Connected from another point if there is a third point from which both are Density Reachable — These points are said to be Connected Components
DBSCAN in a Nutshell
Given a set of points P, the radius of a neighborhood ε, and a minimum number of points minPts:
- Find all points within the ε neighborhood of each point;
- Identify Core Points with at least minPts neighbors;
- Find all Connected Components of each core point — This Density Connected grouping of points is a cluster
- Each Border Point is assigned to a cluster if that cluster is Density Reachable, otherwise, Border Point is considered Noise
Any given point may initially be considered noise and later revised to belong to a cluster, but once assigned to a cluster a point will never be assigned.
Okay if that was too much let's explain it in some simpler terms:
Given that DBSCAN is a density-based clustering algorithm, it does a great job of seeking areas in the data that have a high density of observations, versus areas of the data that are not very dense with observations. DBSCAN can sort data into clusters of varying shapes as well, another strong advantage. DBSCAN works as such:
- Divides the dataset into n dimensions
- For each point in the dataset, DBSCAN forms an n-dimensional shape around that data point and then counts how many data points fall within that shape.
- DBSCAN counts this shape as a cluster. DBSCAN iteratively expands the cluster, by going through each individual point within the cluster, and counting the number of other data points nearby. Take the graphic below for an example:
Going through the process step-by-step, DBSCAN will start by dividing the data into n dimensions. After DBSCAN has done so, it will start at a random point (in this case let's assume it was one of the purple points), and it will count how many other points are nearby. DBSCAN will continue this process until no other data points are nearby, and then it will look to form a second cluster.
As you may have noticed from the graphic, there are a couple of parameters and specifications that we need to give DBSCAN before it does its work.
Referring back to the graphic, the epsilon is the radius given to test the distance between data points. If a point falls within the epsilon distance of another point, those two points will be in the same cluster.
Furthermore, the minimum number of points needed is set to 4 in this scenario. When going through each data point, as long as DBSCAN finds 4 points within epsilon distance of each other, a cluster is formed.
Naftali Harris has created a great web-based visualization of running DBSCAN on a 2-dimensional dataset where you can set epsilon to higher and lower values. Try clicking on the “Smiley” dataset and hitting the GO button.
DBSCAN vs K-Means Clustering
DBSCAN is a popular clustering algorithm that is fundamentally very different from k-means.
- In k-means clustering, each cluster is represented by a centroid, and points are assigned to whichever centroid they are closest to. In DBSCAN, there are no centroids, and clusters are formed by linking nearby points to one another.
- k-means requires specifying the number of clusters, ‘k’. DBSCAN does not but does require specifying two parameters which influence the decision of whether two nearby points should be linked into the same cluster. These two parameters are a distance threshold, ε (epsilon), and “MinPts” (minimum number of points), to be explained.
- k-means runs over many iterations to converge on a good set of clusters, and cluster assignments can change on each iteration. DBSCAN makes only a single pass through the data, and once a point has been assigned to a particular cluster, it never changes.
My Approach to the DBSCAN Algorithm
I like the language of trees for describing cluster growth in DBSCAN. It starts with an arbitrary seed point which has at least MinPts points nearby within a distance or “radius” of ε. We do a breadth-first search along each of these nearby points. For a given nearby point, we check how many points it has within its radius. If it has fewer than MinPts neighbors, this point becomes a leaf–we don’t continue to grow the cluster from it. If it does have at least MinPts, however, then it’s a branch, and we add all of its neighbors to the FIFO (“First In, First Out”) queue of our breadth-first search.
Once the breadth-first search is complete, we are done with that cluster and it will not be revisited any of the points in it. We pick a new arbitrary seed point (which isn’t already part of another cluster) and grow the next cluster. This continues until all of the points have been assigned.
There is one other novel aspect of DBSCAN which affects the algorithm. If a point has fewer than MinPts neighbors, AND it’s not a leaf node of another cluster, then it’s labeled as a “Noise” point that doesn’t belong to any cluster.
Noise points are identified as part of the process of selecting a new seed if a particular seed point does not have enough neighbors, then it is labeled as a Noise point. This label is often temporary, however–these Noise points are often picked up by some cluster as a leaf node.
To understand this, I feel like it is best to just jump headfirst into the code:
Above is a working implementation within Python. Please note that this emphasis in the implementation of the algorithm… the distance calculations, for example, could be optimized significantly.
You can also find this code along with a validation python file on GitHub here.