http://blog.csdn.net/pipisorry/article/details/48814183
在scipy.spatial中最重要的模块应该就是距离计算模块distance了。
from scipy import spatial
矩阵参数每行代表一个观测值,计算结果就是每行之间的metric距离。Distance matrix computation from a collection of raw observation vectors stored in a rectangular array.
Distance functions between two vectors u and v. Computingdistances over a large collection of vectors is inefficient for thesefunctions. Use pdist for this purpose.
输入的参数应该是向量,也就是维度应该是(n, ),当然也可以是(1, n)它会使用squeeze自动去掉维度为1的维度;但是如果是多维向量,至少有两个维度>1就会出错。
e.g. spatial.distance.correlation(u, v) #计算向量u和v之间的相关系数(pearson correlation coefficient, Centered Cosine)
Note: 如果向量u和v元素数目都只有一个或者某个向量中所有元素相同(分母norm(u - u.mean())为0),那么相关系数当然计算无效,会返回nan。
braycurtis(u, v) | Computes the Bray-Curtis distance between two 1-D arrays. |
canberra(u, v) | Computes the Canberra distance between two 1-D arrays. |
chebyshev(u, v) | Computes the Chebyshev distance. |
cityblock(u, v) | Computes the City Block (Manhattan) distance. |
correlation(u, v) | Computes the correlation distance between two 1-D arrays. |
cosine(u, v) | Computes the Cosine distance between 1-D arrays. |
dice(u, v) | Computes the Dice dissimilarity between two boolean 1-D arrays. |
euclidean(u, v) | Computes the Euclidean distance between two 1-D arrays. |
hamming(u, v) | Computes the Hamming distance between two 1-D arrays. |
jaccard(u, v) | Computes the Jaccard-Needham dissimilarity between two boolean 1-D arrays. |
kulsinski(u, v) | Computes the Kulsinski dissimilarity between two boolean 1-D arrays. |
mahalanobis(u, v, VI) | Computes the Mahalanobis distance between two 1-D arrays. |
matching(u, v) | Computes the Matching dissimilarity between two boolean 1-D arrays. |
minkowski(u, v, p) | Computes the Minkowski distance between two 1-D arrays. |
rogerstanimoto(u, v) | Computes the Rogers-Tanimoto dissimilarity between two boolean 1-D arrays. |
russellrao(u, v) | Computes the Russell-Rao dissimilarity between two boolean 1-D arrays. |
seuclidean(u, v, V) | Returns the standardized Euclidean distance between two 1-D arrays. |
sokalmichener(u, v) | Computes the Sokal-Michener dissimilarity between two boolean 1-D arrays. |
sokalsneath(u, v) | Computes the Sokal-Sneath dissimilarity between two boolean 1-D arrays. |
sqeuclidean(u, v) | Computes the squared Euclidean distance between two 1-D arrays. |
wminkowski(u, v, p, w) | Computes the weighted Minkowski distance between two 1-D arrays. |
yule(u, v) | Computes the Yule dissimilarity between two boolean 1-D arrays. |
pdist(X[, metric, p, w, V, VI])Pairwise distances between observations in n-dimensional space.观测值(n维)两两之间的距离。Pairwise distances between observations in n-dimensional space.距离值越大,相关度越小。
注意,距离转换成相似度时,由于自己和自己的距离是不会计算的默认为0,所以要先通过dist = spatial.distance.squareform(dist)转换成dense矩阵,再通过1 - dist计算相似度。
metric:
1 距离计算可以使用自己写的函数。Y = pdist(X, f) Computes the distance between all pairs of vectors in Xusing the user supplied 2-arity function f.
如欧式距离计算可以这样:
dm = pdist(X, lambda u, v: np.sqrt(((u-v)**2).sum()))
但是如果scipy库中有相应的距离计算函数的话,就不要使用dm = pdist(X, sokalsneath)这种方式计算,sokalsneath调用的是python自带的函数,会调用c(n, 2)次;而应该使用scipy中的optimized C version,使用dm = pdist(X, 'sokalsneath')。
再如矩阵行之间的所有cause effect值的计算可以这样:
def causal_effect(m): effect = lambda u, v: u.dot(v) / sum(u) - (1 - u).dot(v) / sum(1 - u) return spatial.distance.squareform(spatial.distance.pdist(m, metric=effect))
2 这里计算的是两两之间的距离,而不是相似度,如计算cosine距离后要用1-cosine才能得到相似度。从下面的consine计算公式就可以看出。
Y = pdist(X, ’euclidean’) #d=sqrt((x1-x2)^2+(y1-y2)^2+(z1-z2)^2)
Y = pdist(X, ’minkowski’, p)
cdist(XA, XB[, metric, p, V, VI, w])Computes distance between each pair of the two collections of inputs.
当然XA\XB最简单的形式是一个二维向量(也必须是,否则报错ValueError: XA must be a 2-dimensional array.),计算的就是两个向量之间的metric距离度量。
squareform(X[, force, checks])Converts a vector-form distance vector to a square-form distance matrix, and vice-versa.
将向量形式的距离表示转换成dense矩阵形式。Converts a vector-form distance vector to a square-form distance matrix, and vice-versa.
注意:Distance matrix 'X' must be symmetric&diagonal must be zero.
皮皮blog
x
array([[0, 2, 3],
[2, 0, 6],
[3, 6, 0]])
y=dis.pdist(x)
Iy
array([ 4.12310563, 5.83095189, 8.54400375])
z=dis.squareform(y)
z
array([[ 0. , 4.12310563, 5.83095189],
[ 4.12310563, 0. , 8.54400375],
[ 5.83095189, 8.54400375, 0. ]])
type(z)
numpy.ndarray
type(y)
numpy.ndarray
print(sim) print(spatial.distance.cdist(sim[0].reshape((1, 2)), sim[1].reshape((1, 2)), metric='cosine')) print(spatial.distance.pdist(sim, metric='cosine'))[[-2.85 -0.45]
[[ 0.14790689]]
[ 0.14790689]
皮皮blog
Predicates for checking the validity of distance matrices, bothcondensed and redundant. Also contained in this module are functionsfor computing the number of observations in a distance matrix.
is_valid_dm(D[, tol, throw, name, warning]) | Returns True if input array is a valid distance matrix. |
is_valid_y(y[, warning, throw, name]) | Returns True if the input array is a valid condensed distance matrix. |
num_obs_dm(d) | Returns the number of original observations that correspond to a square, redundant distance matrix. |
num_obs_y(Y) | Returns the number of original observations that correspond to a condensed distance matrix. |
ref: Distance computations (scipy.spatial.distance)
Spatial algorithms and data structures (scipy.spatial)
scipy-ref-0.14.0-p933