本文隶属于专栏《机器学习数学通关指南》,该专栏为笔者原创,引用请注明来源,不足和错误之处请在评论区帮忙指出,谢谢!
本专栏目录结构和参考文献请见《机器学习数学通关指南》
知识库广场搜索:
知识库 | 创建人 |
---|---|
机器学习 | @Shockang |
机器学习数学基础 | @Shockang |
深度学习 | @Shockang |
向量空间是机器学习的数学基础,它提供了描述和处理高维数据的强大框架。从本质上看,向量空间是满足特定代数运算规则的向量集合:
机器学习视角:特征向量构成的空间就是一个向量空间,每个数据点都是该空间中的一个向量。
向量加法:
数乘运算:
内积(点积): a ⋅ b = ∣ a ∣ ∣ b ∣ cos θ = ∑ i = 1 n a i b i \mathbf{a} \cdot \mathbf{b} = |\mathbf{a}||\mathbf{b}|\cos\theta = \sum_{i=1}^{n} a_i b_i a⋅b=∣a∣∣b∣cosθ=∑i=1naibi
外积(叉积): a × b \mathbf{a} \times \mathbf{b} a×b
文本向量化:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["机器学习很有趣", "深度学习是机器学习的子集", "向量空间在NLP中很重要"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
print(X.toarray()) # 文档向量表示
图像表示:每张图片可表示为像素值构成的高维向量(如28×28的MNIST图片是784维向量)
推荐系统:用户和物品都可表示为特征向量,内积表示相似度/兴趣度
范数是度量向量"大小"的函数,满足三个基本性质:
常见的范数类型:
L1范数(曼哈顿距离): ∣ ∣ x ∣ ∣ 1 = ∑ i = 1 n ∣ x i ∣ ||\mathbf{x}||_1 = \sum_{i=1}^n |x_i| ∣∣x∣∣1=∑i=1n∣xi∣
L2范数(欧氏距离): ∣ ∣ x ∣ ∣ 2 = ∑ i = 1 n x i 2 ||\mathbf{x}||_2 = \sqrt{\sum_{i=1}^n x_i^2} ∣∣x∣∣2=∑i=1nxi2
L∞范数(切比雪夫距离): ∣ ∣ x ∣ ∣ ∞ = max ( ∣ x 1 ∣ , … , ∣ x n ∣ ) ||\mathbf{x}||_\infty = \max(|x_1|, \dots, |x_n|) ∣∣x∣∣∞=max(∣x1∣,…,∣xn∣)
Frobenius范数(矩阵): ∣ ∣ A ∣ ∣ F = ∑ i = 1 m ∑ j = 1 n ∣ a i j ∣ 2 ||A||_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n |a_{ij}|^2} ∣∣A∣∣F=∑i=1m∑j=1n∣aij∣2
单位球形态:
稀疏性与正则化:
from sklearn.preprocessing import normalize
# L2归一化
X_normalized = normalize(X, norm='l2')
# L1归一化
X_normalized = normalize(X, norm='l1')
L1正则化(LASSO): J ( θ ) = M S E ( θ ) + λ ∣ ∣ θ ∣ ∣ 1 J(\theta) = MSE(\theta) + \lambda ||\theta||_1 J(θ)=MSE(θ)+λ∣∣θ∣∣1
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1) # alpha为正则化强度
lasso.fit(X_train, y_train)
L2正则化(Ridge): J ( θ ) = M S E ( θ ) + λ ∣ ∣ θ ∣ ∣ 2 2 J(\theta) = MSE(\theta) + \lambda ||\theta||_2^2 J(θ)=MSE(θ)+λ∣∣θ∣∣22
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
混合正则化(Elastic Net):结合L1和L2的优点
from sklearn.linear_model import ElasticNet
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5) # l1_ratio控制L1和L2的比例
elastic.fit(X_train, y_train)
from sklearn.neighbors import KNeighborsClassifier
# 欧氏距离
knn_l2 = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
# 曼哈顿距离
knn_l1 = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
PCA:基于L2范数的投影方差最大化
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
t-SNE:保持高维空间中相似度关系的降维可视化
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# 加载MNIST数据集
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, parser='auto')
X = X.astype('float32')
y = y.astype('int')
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 标准化特征 - 基于L2范数
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 比较不同正则化(范数)的效果
models = {
"L1正则化 (Lasso)": LogisticRegression(penalty='l1', solver='saga', C=0.1, max_iter=1000),
"L2正则化 (Ridge)": LogisticRegression(penalty='l2', C=0.1, max_iter=1000)
}
results = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
results[name] = accuracy
# 分析权重稀疏性
coef = model.coef_
non_zeros = np.mean(np.abs(coef) > 1e-5) * 100 # 非零权重百分比
print(f"模型: {name}, 准确率: {accuracy:.4f}, 非零权重比例: {non_zeros:.2f}%")
# 可视化部分权重
plt.figure(figsize=(10, 5))
for i in range(5): # 仅展示前5个类别的权重
plt.subplot(1, 5, i+1)
plt.imshow(coef[i].reshape(28, 28), cmap='viridis')
plt.title(f"类别 {i}")
plt.axis('off')
plt.tight_layout()
plt.suptitle(f"{name}的权重可视化")
plt.show()
核方法通过将原始特征空间隐式映射到高维空间,使线性不可分问题变为线性可分:
from sklearn.svm import SVC
# 使用不同核函数的SVM
svm_linear = SVC(kernel='linear')
svm_poly = SVC(kernel='poly', degree=3)
svm_rbf = SVC(kernel='rbf', gamma='scale')
# 对比在相同数据上的表现
svm_rbf.fit(X_train_scaled, y_train)
print(f"RBF核SVM准确率: {svm_rbf.score(X_test_scaled, y_test):.4f}")
当数据分布在低维流形上时,欧几里得距离可能不再适合表示点之间的真实关系:
from sklearn.manifold import TSNE, Isomap
# t-SNE基于局部相似度的降维可视化
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_train_scaled[:1000]) # 取部分样本演示
# Isomap:保持测地线距离的降维方法
isomap = Isomap(n_components=2, n_neighbors=10)
X_isomap = isomap.fit_transform(X_train_scaled[:1000])
plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_train[:1000], cmap='viridis', alpha=0.7)
plt.title("t-SNE降维")
plt.subplot(122)
plt.scatter(X_isomap[:, 0], X_isomap[:, 1], c=y_train[:1000], cmap='viridis', alpha=0.7)
plt.title("Isomap降维")
plt.show()
现代深度学习大量使用张量概念,张量是向量空间的高维扩展:
import torch
import torch.nn as nn
# 简单卷积神经网络
class ConvNet(nn.Module):
def __init__(self):
super(ConvNet, self).__init__()
# 卷积层:特征提取器
self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
self.relu = nn.ReLU()
self.pool = nn.MaxPool2d(2)
self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
# 全连接层:分类器
self.fc = nn.Linear(32 * 7 * 7, 10)
def forward(self, x):
# 形状变换:[batch, 784] -> [batch, 1, 28, 28]
x = x.view(-1, 1, 28, 28)
# 卷积层计算过程
x = self.pool(self.relu(self.conv1(x))) # -> [batch, 16, 14, 14]
x = self.pool(self.relu(self.conv2(x))) # -> [batch, 32, 7, 7]
# 展平张量进行分类
x = x.view(-1, 32 * 7 * 7)
x = self.fc(x)
return x
范数选择影响优化结果:
Elastic Net:结合L1和L2范数,平衡稀疏性与稳定性
J ( θ ) = M S E ( θ ) + λ 1 ∣ ∣ θ ∣ ∣ 1 + λ 2 ∣ ∣ θ ∣ ∣ 2 2 J(\theta) = MSE(\theta) + \lambda_1 ||\theta||_1 + \lambda_2 ||\theta||_2^2 J(θ)=MSE(θ)+λ1∣∣θ∣∣1+λ2∣∣θ∣∣22
组稀疏性:分组L1/L2范数,促进特征组的整体选择/排除
∣ ∣ w ∣ ∣ 2 , 1 = ∑ j = 1 m ∑ i = 1 n j w j i 2 ||w||_{2,1} = \sum_{j=1}^{m} \sqrt{\sum_{i=1}^{n_j} w_{ji}^2} ∣∣w∣∣2,1=∑j=1m∑i=1njwji2
核范数:矩阵奇异值之和,用于低秩矩阵近似(如推荐系统)
∣ ∣ A ∣ ∣ ∗ = ∑ i σ i ( A ) ||A||_* = \sum_{i} \sigma_i(A) ∣∣A∣∣∗=∑iσi(A)
# 自定义实现Group Lasso
def group_lasso_penalty(model, group_indices):
penalty = 0
for group in group_indices:
group_params = model.parameters()[group]
penalty += torch.sqrt(torch.sum(group_params**2))
return penalty
不同范数定义的单位球有不同几何形状,影响优化解的特性:
正交投影: p r o j u ( v ) = u ⋅ v u ⋅ u u proj_u(v) = \frac{u \cdot v}{u \cdot u} u proju(v)=u⋅uu⋅vu
格拉姆-施密特正交化:将任意线性无关向量组转化为正交基
def gram_schmidt(vectors):
"""实现格拉姆-施密特正交化"""
result = []
for v in vectors:
# 减去v在已有正交向量上的投影
for u in result:
v = v - np.dot(v, u) / np.dot(u, u) * u
# 归一化
result.append(v / np.linalg.norm(v))
return np.array(result)
from sklearn.feature_selection import SelectFromModel
# 使用L1正则化进行特征选择
lasso = LogisticRegression(penalty='l1', solver='saga', C=0.1)
selector = SelectFromModel(lasso, prefit=False)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)
# 在降维后的特征上训练
clf = LogisticRegression()
clf.fit(X_train_selected, y_train)
y_pred = clf.predict(X_test_selected)
print(f"特征选择后准确率: {accuracy_score(y_test, y_pred):.4f}")
print(f"原始特征数: {X_train_scaled.shape[1]}, 选择后特征数: {X_train_selected.shape[1]}")
from sklearn.metrics.pairwise import cosine_similarity, manhattan_distances, euclidean_distances
# 选择几个代表性样本作为查询
query_indices = [0, 100, 200] # 假设这些是不同类别的样本
query_samples = X_test_scaled[query_indices]
# 计算不同距离度量下的最近邻
for i, query in enumerate(query_samples):
print(f"\n查询样本 {i} (真实标签: {y_test[query_indices[i]]})")
# 欧氏距离
dist_l2 = euclidean_distances([query], X_train_scaled)[0]
idx_l2 = np.argmin(dist_l2)
print(f"欧氏距离最近邻: 索引 {idx_l2}, 标签 {y_train[idx_l2]}, 距离 {dist_l2[idx_l2]:.4f}")
# 曼哈顿距离
dist_l1 = manhattan_distances([query], X_train_scaled)[0]
idx_l1 = np.argmin(dist_l1)
print(f"曼哈顿距离最近邻: 索引 {idx_l1}, 标签 {y_train[idx_l1]}, 距离 {dist_l1[idx_l1]:.4f}")
# 余弦相似度(最大值为最相似)
cos_sim = cosine_similarity([query], X_train_scaled)[0]
idx_cos = np.argmax(cos_sim)
print(f"余弦相似度最近邻: 索引 {idx_cos}, 标签 {y_train[idx_cos]}, 相似度 {cos_sim[idx_cos]:.4f}")
概念 | 数学定义 | 机器学习应用 |
---|---|---|
向量空间 | 满足线性运算的向量集合 | 特征表示、映射变换 |
L1范数 | ∣ x ∣ 1 = ∑ i = 1 n ∣ x i ∣ |x|_1 = \sum_{i=1}^n |x_i| ∣x∣1=∑i=1n∣xi∣ | 特征选择、稀疏模型(如 LASSO 正则化) |
L2范数 | ∣ x ∣ 2 = ∑ i = 1 n x i 2 |x|_2 = \sqrt{\sum_{i=1}^n x_i^2} ∣x∣2=∑i=1nxi2 | 防过拟合、距离计算(如 Ridge 正则化、欧氏距离) |
内积 | x ⋅ y = ∑ i x i y i x \cdot y = \sum_i x_i y_i x⋅y=∑ixiyi | 相似度度量、投影计算 |
实践建议:深入理解向量空间与范数不仅需要掌握公式,更要通过代码实现和可视化来培养直觉。尝试用不同范数对同一数据集进行正则化,观察模型表现和权重分布的差异,将会加深对这些概念的理解。
希望这篇指南能帮助您在机器学习的数学基础上构建更坚实的理论框架!