聚类树状图
Agglomerative Clustering is a type of hierarchical clustering algorithm. It is an unsupervised machine learning technique that divides the population into several clusters such that data points in the same cluster are more similar and data points in different clusters are dissimilar.
聚集聚类是一种层次聚类算法。 这是一种无监督的机器学习技术,可将总体分为多个集群,以使同一集群中的数据点更加相似,而不同集群中的数据点则彼此不同。
- Points in the same cluster are closer to each other. 同一群集中的点彼此靠近。
- Points in the different clusters are far apart. 不同聚类中的点相距很远。
In the above sample 2-dimension dataset, it is visible that the dataset forms 3 clusters that are far apart, and points in the same cluster are close to each other.
在上面的示例二维数据集中,可以看到该数据集形成了3个彼此相距很远的群集,并且同一群集中的点彼此靠近。
聚集集群背后的直觉: (The intuition behind Agglomerative Clustering:)
Agglomerative Clustering is a bottom-up approach, initially, each data point is a cluster of its own, further pairs of clusters are merged as one moves up the hierarchy.
聚集式聚类是一种自下而上的方法,最初,每个数据点都是其自身的一个聚类,随着一个聚类上移,将进一步合并成对的聚类。
聚集聚类的步骤: (Steps of Agglomerative Clustering:)
- Initially, all the data-points are a cluster of its own. 最初,所有数据点都是其自身的集群。
- Take two nearest clusters and join them to form one single cluster. 选取两个最近的群集,并将它们合并为一个群集。
- Proceed recursively step 2 until you obtain the desired number of clusters. 递归地执行步骤2,直到获得所需的群集数量。
In the above sample dataset, it is observed that 2 clusters are far separated from each other. So we stopped after getting 2 clusters.
在上面的样本数据集中,观察到2个聚类彼此分离。 因此,我们在获得2个簇之后就停止了。
(Image by Author), Sample dataset separated into 2 clusters (作者提供的图像),样本数据集分为2个类如何加入两个集群以形成一个集群? (How to join two clusters to form one cluster?)
To obtain the desired number of clusters, the number of clusters needs to be reduced from initially being n cluster (n equals the total number of data-points). Two clusters are combined by computing the similarity between them.
为了获得所需的群集数量,需要将群集数量从最初的n个群集减少(n等于数据点的总数)。 通过计算两个群集之间的相似度将它们组合在一起。
There are some methods which are used to calculate the similarity between two clusters:
有一些方法可用于计算两个聚类之间的相似度:
- Distance between two closest points in two clusters. 两个群集中两个最近点之间的距离。
- Distance between two farthest points in two clusters. 两个群集中两个最远点之间的距离。
- The average distance between all points in the two clusters. 两个群集中所有点之间的平均距离。
- Distance between centroids of two clusters. 两个簇的质心之间的距离。
There are several pros and cons of choosing any of the above similarity metrics.
选择上述相似性指标中的任何一个都有其优缺点。
凝聚集群的实现: (Implementation of Agglomerative Clustering:)
(Code by Author) (作者代码)如何获得最佳的簇数? (How to obtain the optimal number of clusters?)
The implementation of the Agglomerative Clustering algorithm accepts the number of desired clusters. There are several ways to find the optimal number of clusters such that the population is divided into k clusters in a way that:
聚集聚类算法的实现接受所需聚类的数量。 有几种方法可以找到最佳数目的聚类,以便按以下方式将总体分为k个聚类:
Points in the same cluster are closer to each other.
同一群集中的点彼此靠近。
Points in the different clusters are far apart.
不同聚类中的点相距很远。
By observing the dendrograms, one can find the desired number of clusters.
通过观察树状图,可以找到所需数目的簇。
Dendrograms are a diagrammatic representation of the hierarchical relationship between the data-points. It illustrates the arrangement of the clusters produced by the corresponding analyses and is used to observe the output of hierarchical (agglomerative) clustering.
树状图是数据点之间层次关系的图形表示。 它说明了由相应分析产生的聚类的排列,并用于观察分层(聚集)聚类的输出。
树状图的实现: (Implementation of Dendrograms:)
(Code by Author) (作者代码)Download the sample 2-dimension dataset from here.
从此处下载示例二维数据集。
Left Image: Visualize the sample dataset, 左图像:可视化示例数据集, Right Image: Visualize 3 cluster for the sample dataset 右图像:可视化示例数据集的3个簇For the above sample dataset, it is observed that the optimal number of clusters would be 3. But for high dimension dataset where visualization is of the dataset is not possible dendrograms plays an important role to find the optimal number of clusters.
对于上面的样本数据集,可以观察到最佳数目的聚类将是3。但是对于高维数据集,无法可视化该数据集,树状图对于找到最佳数目的聚类起着重要的作用。
如何通过观察树状图找到最佳聚类数: (How to find the optimal number of clusters by observing the dendrograms:)
(Image by Author), Dendrogram for the above sample dataset (作者提供的图像),上述示例数据集的树状图From the above dendrogram plot, find a horizontal rectangle with max-height that does not cross any horizontal vertical dendrogram line.
从上面的树状图中,找到最大高度不与任何水平垂直树状图线交叉的水平矩形。
Left: Separating into 2 clusters, 左:分为2个类, Right: Separating into 3 clusters 右:分为3个类The portion in the dendrogram in which rectangle having the max-height can be cut, and the optimal number of clusters will be 3 as observed in the right part of the above image. Max height rectangle is chosen because it represents the maximum Euclidean distance between the optimal number of clusters.
在树状图中可以切割出具有最大高度的矩形的部分,并且如上图右侧所示,最佳簇数将为3。 选择最大高度矩形是因为它代表最佳簇数之间的最大欧几里得距离。
结论: (Conclusion:)
In this article, we have discussed the in-depth intuition of the agglomerative hierarchical clustering algorithm. There are some disadvantages to the algorithm that it is not suitable for large datasets because of the large space and time complexities. Even observing the dendrogram to find the optimal number of clusters for a large dataset is very difficult.
在本文中,我们讨论了聚集层次聚类算法的深入直觉。 由于存在较大的空间和时间复杂性,该算法存在一些缺点,不适用于大型数据集。 即使观察树状图以找到大型数据集的最佳聚类数也非常困难。
Thank You for Reading
谢谢您的阅读
翻译自: https://towardsdatascience.com/agglomerative-clustering-and-dendrograms-explained-29fc12b85f23
聚类树状图