Day11 Python打卡训练营

超参数调整专题1

知识点回顾

1.  网格搜索

2.  随机搜索(简单介绍,非重点 实战中很少用到,可以不了解)

3.  贝叶斯优化(2种实现逻辑,以及如何避开必须用交叉验证的问题)

4.  time库的计时模块,方便后人查看代码运行时长

今日作业:

对于信贷数据的其他模型,如LightGBM和KNN 尝试用下贝叶斯优化和网格搜索

数据预处理

  1. 数据加载和基础配置:设置中文字体显示,加载数据集
  2. 分类特征处理:
    • 标签编码:将有序类别特征(如贷款期限、所有权状态)转换为数值
    • 独热编码:将多类别特征(如贷款用途)转换为二进制特征
  3. 缺失值处理:对连续特征使用中位数填充
  4. 特征工程:重命名列、类型转换等
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
data = pd.read_csv('data.csv')

# 处理分类特征
discrete_features = data.select_dtypes(include=['object']).columns.tolist()

# 对"Home Ownership"进行标签编码
home_ownership_mapping = {'Own Home':1, 'Rent':2, 'Have Mortgage':3, 'Home Mortgage':4}
data['Home Ownership'] = data['Home Ownership'].map(home_ownership_mapping)

# 对"Years in current job"进行标签编码
years_in_job_mapping = {'< 1 year':1, '1 year':2, '2 years':3, '3 years':4, 
                       '4 years':5, '5 years':6, '6 years':7, '7 years':8,
                       '8 years':9, '9 years':10, '10+ years':11}
data['Years in current job'] = data['Years in current job'].map(years_in_job_mapping)

# 对"Purpose"进行独热编码
data = pd.get_dummies(data, columns=['Purpose'])
data2 = pd.read_csv("data.csv")
list_final = [i for i in data.columns if i not in data2.columns]
for i in list_final:
    data[i] = data[i].astype(int)

# 对"Term"进行映射并重命名
term_mapping = {'Short Term':0, 'Long Term':1}
data['Term'] = data['Term'].map(term_mapping)
data.rename(columns={'Term':'Long Term'}, inplace=True)

# 填充连续特征缺失值(用中位数)
continuous_features = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
for feature in continuous_features:
    mode_value = data[feature].mode()[0]
    data[feature].fillna(mode_value, inplace=True)

数据集划分

  • 使用分层抽样保证数据分布一致性
  • 最终划分比例:训练集80%,验证集10%,测试集10%
  • 固定random_state确保可复现性

基础模型

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import time

print("--- 默认参数随机森林 ---")
start_time = time.time()
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
end_time = time.time()

print(f"训练耗时: {end_time-start_time:.2f}s")
print(classification_report(y_test, rf_pred))
print("混淆矩阵:\n", confusion_matrix(y_test, rf_pred))

网格搜索调参

  • 参数网格:穷举搜索4个参数的3-4种取值组合
  • 5折交叉验证:确保评估可靠性
  • 耗时约56秒(比基准模型慢56倍)
  • 最佳参数:n_estimators=200, max_depth=20等
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

print("--- 网格搜索 ---")
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring='accuracy'
)

start_time = time.time()
grid_search.fit(X_train, y_train)
end_time = time.time()

print(f"最佳参数: {grid_search.best_params_}")
best_pred = grid_search.predict(X_test)
print(classification_report(y_test, best_pred))

贝叶斯优化

  • 搜索空间:与网格搜索相同但采用随机采样
  • 迭代次数:32次(远少于网格搜索的324次组合)
  • 耗时约43秒,平衡了搜索效果和时间成本
  • 最佳参数:n_estimators=85, max_depth=21等
from skopt import BayesSearchCV
from skopt.space import Integer

search_space = {
    'n_estimators': Integer(50, 200),
    'max_depth': Integer(10, 30),
    'min_samples_split': Integer(2, 10),
    'min_samples_leaf': Integer(1, 4)
}

print("--- 贝叶斯优化 ---")
bayes_search = BayesSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    search_spaces=search_space,
    n_iter=32,
    cv=5,
    n_jobs=-1,
    scoring='accuracy'
)

start_time = time.time()
bayes_search.fit(X_train, y_train)
end_time = time.time()

print(f"最佳参数: {bayes_search.best_params_}")
best_pred = bayes_search.predict(X_test)
print(classification_report(y_test, best_pred))

@浙大疏锦行

你可能感兴趣的:(Py,60天打卡训练营,python)