层次聚类 hierarchical clustering-猿圈-程序猿的知识社区

定义

层次聚类的策略一般分为两类

Agglomerative: This is "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
这是一种“自底向上”的方法：每个观测值都是单独的一个簇，并且当一个簇向上移动时，成对的簇被合并为一个簇。
Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
这是一种“自顶向下”的方法：所有的观测都从一个簇开始，当一个观察向下移动时，递归地执行分裂为多个簇。

The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of \(O(n^3)\) and requires \(O(n^2)\) memory, which makes it too slow for even medium data sets.

标准的层次聚类算法（自底向上）需要\(O(n^3)\)的时间，\(O(n^2)\)的空间。

Metric

For text or other non-numeric data, metrics such as the Hamming distance or Levenshtein distance are often used.

对于文本或其他非数字数据，通常使用诸如Hamming距离或Levenshtein距离之类的度量。

Linkage criteria

链接标准用于确定两个观测集合之间的距离计算。

The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.

常见的链接标准有：

Names
Maximum or complete-linkage clustering	\(max(d(a,b):a \in A, b\in B)\)
Minimum or single-linkage clustering	\(min(d(a,b):a \in A, b\in B)\)
Unweighted average linkage clustering (or UPGMA)	\(\frac{1}{\\|A\\|\cdot \\|B\\|}\sum_{a\in A}\sum_{b\in B}d(a,b)\)

Names
完全（最远）链接聚合聚类	\(max(d(a,b):a \in A, b\in B)\)
单连接（最近）聚合聚类	\(min(d(a,b):a \in A, b\in B)\)
平均聚合聚类	\(\frac{1}{\\|A\\|\cdot \\|B\\|}\sum_{a\in A}\sum_{b\in B}d(a,b)\)

Example

Agglomerative clustering example

For example, suppose this data is to be clustered, and the Euclidean distance is the distance metric.

UPGMA

Wiki

UPGMA (unweighted pair group method with arithmetic mean) is a simple agglomerative (bottom-up) hierarchical clustering method.

UPGMA是一种简单的聚合型（自底向上）层次聚类方法。

算法步骤

第一步：

我们假设我们有5个元素，和他们两两之间距离矩阵。

	a	b	c	d	e
a	0	17	21	31	23
b	17	0	30	34	21
c	21	30	0	28	39
d	31	34	28	0	43
e	23	21	39	43	0

在这个例子中，\(D_1 (a,b) = 17\)是最小的值，所以我们首先合并元素a和b。

First branch length estimation

第一次分支长度估算

假设 \(u\) 当做连接a和b的节点，有\(\delta(a,u)=\delta(b,u)=D_1(a,b)/2=8.5\).

因此连接a,b与 \(u\) 的分支长度为8.5
First distance matrix update

第一次距离矩阵更新

\[D_2((a,b),c) = (D_1(a,c)+D_1(b,c))/2=25.5 \\ D_2((a,b),d) = (D_1(a,d)+D_1(b,d))/2=32.5 \\ D_2((a,b),e) = (D_1(a,e)+D_1(b,e))/2=22 \]

第二步：

更新之后的距离矩阵

	(a,b)	c	d	e
(a,b)	0	25.5	32.5	22
c	25.5	0	28	39
d	32.5	28	0	43
e	22	39	43	0

(a,b) 与 e 的距离最小，所以合并这两个元素

Second branch length estimation

第二次分支长度估算

假设 \(v\) 当做连接(a,b)和e的节点，有

\(\delta(a,v)=\delta(b,v)=\delta(e,v)=D_2((a,b),e)/2=11\)

\(\delta(u,v)=\delta(e,v)-\delta(a,u) = 11-8.5 = 2.5\)

因此连接a,b与 \(u\) 的分支长度为8.5
Second distance matrix update

第二次距离矩阵更新

\[D_3(((a,b),e),c) = (D_2((a,b),c)*2+D_2(e,c))/(2+1)=30 \\ D_3(((a,b),e),d) = (D_2((a,b),d)*2+D_2(e,d))/(2+1)=36 \]

第三步：

	((a,b),e)	c	d
((a,b),e)	0	30	36
c	30	0	28
d	36	28	0

c和d的距离最小，合并cd

Third branch length estimation

第三次分支长度估算

\(\delta(c,w)=\delta(d,w)=D_1(c,d)/2=14\)
Third distance matrix update

第三次距离矩阵更新

\[D_4((c,d),((a,b),e)) = (D_3(c,((a,b),e))+D_3(d,((a,b),e)))/2=33 \]

第四步

	((a,b),e)	(c,d)
((a,b),e)	0	33
(c,d)	33	0

直接合并两个节点

分支长度估算

\[\delta(((a,b),e),r) = \delta((c,d),r) = 33/2 = 16.5\\ \delta(v,r) = \delta(((a,b),e),r)-\delta(e,v)=5.5\\ \delta(w,r) = \delta((c,d),r)-\delta(c,w)=2.5 \]

最终生成的树

20210410232252

其中每个节点到根节点的距离是一样的。

菜单 学习猿地 - LMONKEY

开通学习猿地VIP

尊享10项VIP特权 持续新增

知识通关挑战

打卡带练！告别无效练习

接私单赚外块

VIP优先接，累计金额超百万

学习猿地私房课免费学

大厂实战课仅对VIP开放

你的一对一导师

每月可免费咨询大牛30次

领取更多软件工程师实用特权

Java开发工程师

何以解忧，唯有 Java！

Python开发工程师

人生苦短我要学Python！

PHP开发工程师

PHP是世界上最好的编程语言!

GO开发工程师

想优雅的写程序，赶紧GO!

大数据开发工程师

弄大数据，就是在搞革命!

前端开发工程师

不仅最好，而且最全!

UI开发工程师

这个世界从来不缺少美，缺少你来创造美!

Linux运维工程师

不只是说说而已！

WEB前端1+X

不仅为考证，轻松做开发

计算机二级（C语言）

一切只为考证！

猿工手册

各种工具精挑细选

猿材料

各种工具精挑细选

猿代码

各种工具精挑细选

猿著课件

各种工具精挑细选

知识题库

知识闯关节节高，刷题涨知识！

阶段练习

单元测试知识学习状况秒掌握！

期末考试

期末测试，学习情况即刻知道！

面试题库

最新的企业技术人员招聘真题练习！

Java专区

2948篇 ｜ 24.5万人浏览

Python专区

2593篇 ｜ 19.5万人浏览

大数据专区

2948篇 ｜ 18.5万人浏览

PHP专区

3953篇 ｜ 27.5万人浏览

Go专区

1382篇 ｜ 12.5万人浏览

Web前端专区

1453篇 ｜ 31.5万人浏览

Linux云计算

1230篇 ｜ 4.5万人浏览

其他专区

53232篇文章 ｜ 11.5万人浏览

搜索

历史记录 清除记录

近期热搜

项目开发全程实录（电商EW_Shop）

8678 人 1年前

轻松学会Laravel-项目篇（商城API）

2022 人 1年前

Python数据分析2.0-金融

325 人 1年前

Python办公自动化2.0

241 人 1年前

使用账号登录

启用更安全省心的  微信登录

Java开发工程师

何以解忧，唯有 Java！

Python开发工程师

菜单学习猿地 - LMONKEY

尊享10项VIP特权持续新增

2948篇｜ 24.5万人浏览

2593篇｜ 19.5万人浏览

2948篇｜ 18.5万人浏览

3953篇｜ 27.5万人浏览

1382篇｜ 12.5万人浏览

1453篇｜ 31.5万人浏览

1230篇｜ 4.5万人浏览

53232篇文章｜ 11.5万人浏览

历史记录清除记录

2948篇｜ 24.5万人浏览

2593篇｜ 19.5万人浏览

2948篇｜ 18.5万人浏览

3953篇｜ 27.5万人浏览

1382篇｜ 12.5万人浏览

1453篇｜ 31.5万人浏览

1230篇｜ 4.5万人浏览

53232篇文章｜ 11.5万人浏览