DBSCAN (for density-based spatial clustering of applications with noise) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.[1] It is a density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.[2] OPTICS can be seen as a generalization of DBSCAN to multiple ranges, effectively replacing the ε parameter with a maximum search radius.
Basic idea
DBSCAN's definition of a cluster is based on the notion of density reachability. Basically, a point q is directly density-reachable from a point p if it is not farther away than a given distance ε (i.e., is part of its ε-neighborhood), and if p is surrounded by sufficiently many points such that one may consider p and q be part of a cluster. q is called density-reachable (note: this is different from "directly density-reachable") from p if there is a sequence of points with p1 = p and pn = q where each pi + 1 is directly density-reachable from pi. Note that the relation of density-reachable is not symmetric (since q might lie on the edge of a cluster, having insufficiently many neighbors to count as a genuine cluster element), so the notion of density-connected is introduced: two points p and q are density-connected if there is a point o such that both p and q are density reachable from o.
A cluster, which is a subset of the points of the database, satisfies two properties:
DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a cluster (minPts). It starts with an arbitrary starting point that has not been visited. This point's ε-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized ε-environment of a different point and hence be made part of a cluster.
If a point is found to be part of a cluster, its ε-neighborhood is also part of that cluster. Hence, all points that are found within the ε-neighborhood are added, as is their own ε-neighborhood. This process continues until the cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.
Pseudocode
DBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited N = regionQuery(P, eps) if sizeof(N) < MinPts mark P as NOISE else C = next cluster expandCluster(P, N, C, eps, MinPts) expandCluster(P, N, C, eps, MinPts) add P to cluster C for each point P' in N if P' is not visited mark P' as visited N' = regionQuery(P', eps) if sizeof(N') >= MinPts N = N joined with N' if P' is not yet member of any cluster add P' to cluster C
Complexity
DBSCAN visits each point of the database, possibly multiple times (e.g., as candidates to different clusters). For practical considerations, however, the time complexity is mostly governed by the number of regionQuery invocations. DBSCAN executes exactly one such query for each point, and if an indexing structure is used that executes such a neighborhood query in O(log n), an overall runtime complexity of is obtained. Without the use of an accelerating index structure, the run time complexity is O(n2). Often the distance matrix of size (n2 − n) / 2 is materialized to avoid distance recomputations. This however also needs O(n2) memory.
See the section on extensions below for algorithmic modifications to handle these issues.
Every data mining task has the problem of parameters. Every parameter influences the algorithm in specific ways. For DBSCAN the parameters ε and MinPts are needed. The parameters must be specified by the user of the algorithms since other data sets and other questions require different parameters. An initial value for ε can be determined by a k-distance graph. As a rule of thumb, k can be derived from the number of dimensions in the data set D as . However, larger values are usually better for data sets with noise.
Although this parameter estimation gives a sufficient initial parameter set the resulting clustering can turn out to be not the expected partitioning. Therefore research has been performed on incrementally optimizing the parameters against a specific target value.
OPTICS can be seen as a generalization of DBSCAN that replaces the ε parameter with a maximum value that mostly effects performance. MinPts then essentially becomes the minimum cluster size to find. While the algorithm is a lot easier to parameterize then DBSCAN, the results are a bit more difficult to use, as it will usually produce a hierarchical clustering instead of the simple data partitioning that DBSCAN produces.