day 23

机器学习管道 pipeline

一般通用pipeline的实现流程:

1.构建多个转换器(transformer),来实现对特征的预处理

2.构建 ColumnTransformer,将不同的预处理应用于不同的列子集,构造一个完备的转化器

3.构建完整的 Pipeline,将预处理器和模型串联起来

通用pipeline

如果要实现一个简单的适用于所有机器学习模型的pipeline,我觉得可以通过类的封装来实现:

具体代码实例如下:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
import pandas as pd
from sklearn.model_selection import train_test_split


class GenericPipelineBuilder:
    #构造函数传入需要的参数,包括模型,数值特征,适合进行独热编码的特征,适合进行序数编码的特征,指定每个序数编码特征的类别顺序列表
    def __init__(self, model, numeric_features, onehot_features, ordinal_features, ordinal_categories=None):
        self.model = model
        self.numeric_features = numeric_features
        self.onehot_features = onehot_features
        self.ordinal_features = ordinal_features
        self.ordinal_categories = ordinal_categories if ordinal_categories else [[] for _ in range(len(ordinal_features))]

    # 构建数值特征的处理管道,包括缺失值填充和标准化
    def build_numeric_transformer(self):
        return Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())
        ])

    # 构建适合进行独热编码的特征的处理管道,包括缺失值填充和独热编码
    def build_onehot_transformer(self):
        return Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
        ])

    # 构建适合进行序数编码的特征的处理管道,包括缺失值填充和序数编码
    def build_ordinal_transformer(self):
        return Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('ordinal', OrdinalEncoder(categories=self.ordinal_categories, handle_unknown='use_encoded_value',
                                       unknown_value=-1))
        ])

    # 构建整个预处理的ColumnTransformer,将不同的特征处理管道应用到不同的特征
    def build_preprocessor(self):
        return ColumnTransformer(
            transformers=[
                ('num', self.build_numeric_transformer(), self.numeric_features),
                ('onehot', self.build_onehot_transformer(), self.onehot_features),
                ('ordinal', self.build_ordinal_transformer(), self.ordinal_features)
            ],
            remainder='passthrough'
        )

    # 构建完整的机器学习流水线,包括预处理和模型
    def build_pipeline(self):
        return Pipeline(steps=[
            ('preprocessor', self.build_preprocessor()),
            ('classifier', self.model)
        ])

使用的话直接实例化类进行操作即可

例子如下:

    data = pd.read_csv('your_data.csv')
    y = data['target_column']
    X = data.drop('target_column', axis=1)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    numeric_features = X.select_dtypes(include=['number']).columns.tolist()
    onehot_features = ['Color', 'City']
    ordinal_features = ['Education_Level']
    ordinal_categories = [['High School', 'Bachelor', 'Master', 'PhD']]

    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier(random_state=42)

    builder = GenericPipelineBuilder(model, numeric_features, onehot_features, ordinal_features, ordinal_categories)
    pipeline = builder.build_pipeline()
    pipeline.fit(X_train, y_train)
    print("模型训练完成")

@浙大疏锦行

你可能感兴趣的:(60天计划,python,机器学习)