Scikit-learn 是 Python 中最流行的机器学习库,提供了丰富的算法和工具,涵盖了从数据预处理到模型评估的完整机器学习流程。
sklearn.datasets
:提供示例数据集。sklearn.preprocessing
:数据预处理工具。sklearn.model_selection
:模型选择工具。sklearn.linear_model
:线性模型。sklearn.tree
:决策树模型。sklearn.ensemble
:集成学习模型。sklearn.metrics
:模型评估指标。from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# 标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 缺失值填充
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
# 特征编码
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 加载数据
data = load_iris()
X, y = data.data, data.target
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 训练模型
model = RandomForestClassifier()
model.fit(X_train, y_train)
# 预测
y_pred = model.predict(X_test)
# 评估
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# 训练模型
model = LinearRegression()
model.fit(X_train, y_train)
# 预测
y_pred = model.predict(X_test)
# 评估
mse = mean_squared_error(y_test, y_pred)
print(f'MSE: {mse}')
from sklearn.cluster import KMeans
# 训练模型
model = KMeans(n_clusters=3)
model.fit(X)
# 获取聚类结果
labels = model.labels_
from sklearn.model_selection import GridSearchCV
# 参数网格
param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}
# 网格搜索
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
# 最佳参数
print(f'Best parameters: {grid_search.best_params_}')
from sklearn.pipeline import Pipeline
# 创建 Pipeline
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
# 训练模型
pipeline.fit(X_train, y_train)
Statsmodels 是一个专注于统计建模和假设检验的库,提供了丰富的统计模型和工具,适合进行统计分析。
statsmodels.api
:提供统计建模的 API。statsmodels.formula.api
:支持公式化建模。statsmodels.tsa
:时间序列分析工具。import statsmodels.api as sm
# 添加常数项
X = sm.add_constant(X)
# 训练模型
model = sm.OLS(y, X)
results = model.fit()
# 输出结果
print(results.summary())
# 训练模型
model = sm.GLM(y, X, family=sm.families.Binomial())
results = model.fit()
# 输出结果
print(results.summary())
from statsmodels.tsa.arima_model import ARIMA
# 训练模型
model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
# 输出结果
print(results.summary())
from statsmodels.stats.weightstats import ttest_ind
# 独立样本 t 检验
t_stat, p_value, df = ttest_ind(X1, X2)
print(f'T-statistic: {t_stat}, P-value: {p_value}')
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# 拟合模型
model = ols('y ~ C(group)', data=df).fit()
# 执行 ANOVA
anova_results = anova_lm(model)
print(anova_results)
特性 | Scikit-learn | Statsmodels |
---|---|---|
定位 | 机器学习库 | 统计建模库 |
主要功能 | 分类、回归、聚类、降维 | 线性回归、假设检验、时间序列 |
模型评估 | 提供多种评估指标 | 提供详细的统计结果 |
适用场景 | 预测模型构建 | 统计分析与假设检验 |