scikit学习库_从scikit学习管道中提取绘图特征名称的重要性

scikit学习库

If you have ever been tasked with productionalizing a machine learning model, you probably know that Scikit-Learn library offers one of the best ways — if not the best way — of creating production-quality machine learning workflows. The ecosystem’s Pipeline, ColumnTransformer, preprocessors, imputers & feature selection classes are powerful tools that transform raw data into model-ready features.

如果您曾经负责将机器学习模型进行生产化,那么您可能知道Scikit-Learn库提供了创建生产质量的机器学习工作流的最佳方法之一(如果不是最佳方法的话)。 生态系统的Pipeline , ColumnTransformer , preprocessors , imputers和功能选择类是将原始数据转换为可用于模型的功能的强大工具。

However, before anyone is going to let you deploy to production, you are going to want to have some minimal understanding of how the new model works. The most common way to explain how a black-box model works is by plotting feature names and importance values. If you have ever tried to extract the feature names from a heterogeneous dataset processed by ColumnTransformer, you know that this is no easy task. Exhaustive Internet searches have only brought to my attention where others have asked the same question or offered a partial answer, instead of yielding a comprehensive and satisfying solution.

但是,在任何人要让您部署到生产之前,您都需要对新模型的工作原理有一些最低限度的了解。 解释黑匣子模型如何工作的最常见方法是绘制特征名称和重要性值。 如果您曾经尝试从ColumnTransformer处理的异构数据集中提取要素名称,那么您将知道这并非易事。 在别人详尽的互联网搜索只是提醒我注意问 了 同样的 问题,或提供了部分答案 ,而不是产生一个全面的和令人满意的解决方案。

To remedy this situation, I have developed a class called FeatureImportance that will extract feature names and importance values from a Pipeline instance. It then uses the Plotly library to plot the feature importance using only a few lines of code. In this post, I will load a fitted Pipeline, demonstrate how to use my class and then give an overview of how it works. The complete code can be found here or at the end of this blog post.

为了解决这种情况,我开发了一个名为FeatureImportance的类,该类将从Pipeline实例中提取要素名称和重要性值。 然后,它使用Plotly库仅使用几行代码即可绘制特征重要性。 在这篇文章中,我将加载适合的Pipeline,演示如何使用我的课程,然后概述其工作原理。 完整的代码可以在这里或此博客文章的末尾找到。

There are two things I should note before continuing:

在继续之前,我需要注意两点:

  1. I credit Joey Gao’s code on this thread with showing the way to tackle this problem.

    我认为Joey Gao在此线程上的代码展示了解决此问题的方法。

  2. My post assumes that you have worked with Scikit-Learn and Pandas before and are familiar with how ColumnTransformer, Pipeline & preprocessing classes facilitate reproducible feature engineering processes. If you need a refresher, check out this Scikit-Learn example.

    我的帖子假设您之前曾使用过Scikit-Learn和Pandas,并且熟悉ColumnTransformer,Pipeline和预处理类如何促进可重现的要素工程流程。 如果需要复习,请查看此Scikit-Learn示例 。

创建管道 (Creating a Pipeline)

For the purposes of demonstration, I’ve written a script called fit_pipeline_ames.py. It loads the Ames housing training data from Kaggle and fits a moderately complex Pipeline. I’ve plotted its visual representation below.

出于演示目的,我编写了一个名为fit_pipeline_ames.py的脚本。 它从Kaggle加载Ames住房培训数据,并适合中等复杂的管道。 我在下面绘制了它的视觉表示。

from sklearn import set_config 
from sklearn.utils import estimator_html_repr
from IPython.core.display import display, HTML
from fit_pipeline_ames import * # create & fit pipeline
set_config(display='diagram')
display(HTML(estimator_html_repr(pipe)))
scikit学习库_从scikit学习管道中提取绘图特征名称的重要性_第1张图片

This pipe instance contains the following 4 steps:

pipe实例包含以下4个步骤:

  1. The ColumnTransformer instance is composed of 3 Pipelines, containing a total of 4 transformer instances, including SimpleImputer, OneHotEncoder & GLMMEncoder from the category_encoders package. See my previous blog post for a full explanation of how I dynamically constructed this particular ColumnTransformer.

    ColumnTransformer实例由3条管道组成,总共包含4个转换器实例,包括category_encoders包中的SimpleImputer , OneHotEncoder和GLMMEncoder 。 有关我如何动态构造此特定ColumnTransformer的完整说明,请参见我以前的博客文章 。

  2. The VarianceThreshold uses the default threshold of 0, which removes any features that contain only a single value. Some models will fail if a feature has no variance.

    VarianceThreshold使用默认阈值0,该阈值将删除仅包含单个值的所有要素。 如果要素没有差异,则某些模型将失败。

  3. The SelectPercentile uses the f_regression scoring function with a percentile threshold of 90. These settings retain the top 90% of features and discard the bottom 10%.

    SelectPercentile使用f_regression评分功能,百分位数阈值为90。这些设置保留功能的前90%,并丢弃后10%。

  4. The CatBoostRegressor model is fit to the SalesPrice dependent variable using the features created and selected in the preceding steps.

    使用前面步骤中创建和选择的功能, CatBoostRegressor模型适合于SalesPrice因变量。

绘制特征重要性 (Plotting FeatureImportance)

With the help of FeatureImportance, we can extract the feature names and importance values and plot them with 3 lines of code.

借助FeatureImportance ,我们可以提取特征名称和重要性值,并用3行代码对其进行绘制。

from feature_importance import FeatureImportance
feature_importance = FeatureImportance(pipe)
feature_importance.plot(top_n_features=25)
scikit学习库_从scikit学习管道中提取绘图特征名称的重要性_第2张图片

The plot method takes a number of arguments that control the plot's display. The most important ones are the following:

plot方法采用许多参数来控制图形的显示。 最重要的是以下内容:

  • top_n_features: This controls how many features will be plotted. The default value is 100. The plot's title will indicate this value as well as how many features there are in total. To plot all features, just set top_n_features to a number larger than the total features.

    top_n_features :这控制将绘制多少个特征。 默认值为100。图的标题将指示该值以及总共有多少个要素。 要绘制所有top_n_features ,只需将top_n_features设置为大于总top_n_features的数字。

  • rank_features: This argument controls whether the integer ranks are displayed in front of the feature names. The default is True. I find that this aids with interpretation, especially when comparing the feature importance from multiple models.

    rank_features :此参数控制是否在要素名称的前面显示整数等级。 默认值为True 。 我发现这有助于解释,特别是在比较多个模型的特征重要性时。

  • max_scale: This determines whether the importance values are scaled by the maximum value & multiplied by 100. The default is True. I find that this enables an intuitive way to compare how important other features are vis-a-viz the most important one. For instance, in the plot of above, we can say that GrLivArea is about 81% as important to the model as the top feature, OverallQty.

    max_scale :这确定是否将重要性值缩放为最大值并乘以100。默认值为True 。 我发现,这是一种直观的方式,可以比较其他功能对最重要功能的重要性。 例如,在上面的图中,我们可以说GrLivArea在模型中的重要性与OverallQty功能OverallQty GrLivArea ,约为81%。

这个怎么运作 (How It Works)

The FeatureImportance class should be instantiated using a fitted Pipeline instance. (You can also change the verbose argument to True if you want to have all of the diagnostics printed to your console.) My class validates that this Pipeline starts with a ColumnTransformer instance and ends with a regression or classification model that has the feature_importance_ attribute. The Pipeline can have any number of or no instances of classes from sklearn.feature_selection as intermediate steps.

应该使用适合的Pipeline实例来实例化FeatureImportance类。 (如果希望将所有诊断信息都打印到控制台,也可以将verbose参数更改为True 。)我的类验证此Pipeline以ColumnTransformer实例开头,并以具有feature_importance_属性的回归或分类模型结尾。 作为中间步骤,管道可以具有sklearn.feature_selection中任何数量的类实例,也可以不包含这些实例。

The FeatureImportance class is composed of 4 methods.

FeatureImportance类由4个方法组成。

  1. get_feature_names was the hardest method to devise. It iterates through the ColumnTransformer transformers, uses the hasattr function to discern what type of class we are dealing with and pulls the feature names accordingly. (Special Note: If the ColumnTransformer contains Pipelines and if one of the transformers in the Pipeline is adding completely new columns, it must come last in the pipeline. For example, OneHotEncoder, MissingIndicator & SimpleImputer(add_indicator=True) add columns to the dataset that didn't exist before, so they should come last in the Pipeline.)

    get_feature_names是最难设计的方法。 它遍历ColumnTransformer转换器,使用hasattr函数来识别我们正在处理的类的类型,并相应地提取特征名称。 (特别注意:如果ColumnTransformer包含管道,并且如果管道中的其中一个转换器添加了全新的列,则它必须排在管道的最后。例如,OneHotEncoder, MissingIndicator和SimpleImputer(add_indicator = True)将列添加到数据集中之前不存在,因此它们应该在管道中排在最后。)

  2. get_selected_features calls get_feature_names. Then it tests for whether the main Pipeline contains any classes from sklearn.feature_selection based upon the existence of the get_support method. If it does, this method returns only the features names that were retained by the selector class or classes. It stores the features that were not selected in the discarded_featuresattribute. Here are the 24 features that were removed by the selectors in my pipeline:

    get_selected_features调用get_feature_names 。 然后,基于get_support方法的存在,测试主管道是否包含sklearn.feature_selection中的任何类。 如果是这样,则此方法仅返回选择器类保留的要素名称。 它将未选择的功能存储在discarded_features属性中。 这是我的管道中的选择器删除的24个功能:

feature_importance.discarded_features['BsmtUnfSF', 'HalfBath', 'LotShape_IR2', 'LandSlope_Mod', 'LandSlope_Sev', 'Condition1_Artery', 'Condition1_Norm', 'BldgType_2fmCon', 'HouseStyle_2.5Fin', 'HouseStyle_2.5Unf', 'MasVnrType_None', 'ExterQual_Gd', 'ExterCond_Fa', 'ExterCond_TA', 'Foundation_BrkTil', 'Foundation_CBlock', 'BsmtCond_missing_value', 'BsmtFinType1_LwQ', 'BsmtFinType1_Rec', 'Electrical_Mix', 'GarageType_Detchd', 'GarageCond_Fa', 'SaleType_ConLw']
  1. get_feature_importance calls get_selected_features and then creates a Pandas Series where values are the feature importance values from the model and its index is the feature names created by the first 2 methods. This Series is then stored in the feature_importance attribute.

    get_feature_importance调用get_selected_features ,然后创建一个Pandas系列,其中值是来自模型的要素重要性值,其索引是由前两种方法创建的要素名称。 然后,此系列存储在feature_importance属性中。

  2. plot calls get_feature_importance and plots the output based upon the specifications.

    plot调用get_feature_importance并根据规范绘制输出。

(Code)

The original notebook for this blog post can be found here. The complete code for FeatureImportance is shown below and can be found here.

可以在此处找到此博客文章的原始笔记本。 FeatureImportance的完整代码如下所示,可以在此处找到。

If you create a Pipeline that you believe should be supported FeatureImportanceby but is not, please provide a reproducible example, and I will consider making the necessary changes.

如果您创建了一个管道,但您认为该管道应该受FeatureImportance支持但不受其支持,请提供一个可复制的示例,我将考虑进行必要的更改。

Stay tuned for further posts on training & regularizing models with Scikit-Learn ColumnTransformers and Pipelines. Let me know if you found this post helpful or have any ideas for improvement. Thanks!

请继续关注有关Scikit-Learn ColumnTransformers和Pipelines的培训和正则化模型的更多帖子。 让我知道您对这篇文章有帮助还是有任何改进的想法。 谢谢!

import numpy as np  
import pandas as pd  
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.utils.validation import check_is_fitted
import plotly.express as px




class FeatureImportance:


    """
    
    Extract & Plot the Feature Names & Importance Values from a Scikit-Learn Pipeline.
    
    The input is a Pipeline that starts with a ColumnTransformer & ends with a regression or classification model. 
    As intermediate steps, the Pipeline can have any number or no instances from sklearn.feature_selection.

    Note: 
    If the ColumnTransformer contains Pipelines and if one of the transformers in the Pipeline is adding completely new columns, 
    it must come last in the pipeline. For example, OneHotEncoder, MissingIndicator & SimpleImputer(add_indicator=True) add columns 
    to the dataset that didn't exist before, so there should come last in the Pipeline.
    
    
    Parameters
    ----------
    pipeline : a Scikit-learn Pipeline class where the a ColumnTransformer is the first element and model estimator is the last element
    verbose : a boolean. Whether to print all of the diagnostics. Default is False.
    
    Attributes
    __________
    
    feature_importance :  A Pandas Series containing the feature importance values and feature names as the index.    
    discarded_features : The features names that were not selected by a sklearn.feature_selection instance.
    plot_importances_df : A Pandas DataFrame containing the subset of features and values that are actually displaced in the plot. 
    
    
    
    """
    def __init__(self, pipeline, verbose=False):
        self.pipeline = pipeline
        self.verbose = verbose




    def get_feature_names(self, verbose=None):  


        """

        Get the column names from the a ColumnTransformer containing transformers & pipelines

        Parameters
        ----------
        verbose : a boolean indicating whether to print summaries. 
            default = False


        Returns
        -------
        a list of the correct feature names

        Note: 
        If the ColumnTransformer contains Pipelines and if one of the transformers in the Pipeline is adding completely new columns, 
        it must come last in the pipeline. For example, OneHotEncoder, MissingIndicator & SimpleImputer(add_indicator=True) add columns 
        to the dataset that didn't exist before, so there should come last in the Pipeline.

        Inspiration: https://github.com/scikit-learn/scikit-learn/issues/12525 

        """


        if verbose is None:
            verbose = self.verbose
            
        if verbose: print('''\n\n---------\nRunning get_feature_names\n---------\n''')
        
        column_transformer = self.pipeline[0]        
        assert isinstance(column_transformer, ColumnTransformer), "Input isn't a ColumnTransformer"
        check_is_fitted(column_transformer)


        new_feature_names = []


        for i, transformer_item in enumerate(column_transformer.transformers_): 
            
            transformer_name, transformer, orig_feature_names = transformer_item
            orig_feature_names = list(orig_feature_names)
            
            if verbose: 
                print('\n\n', i, '. Transformer/Pipeline: ', transformer_name, ',', 
                      transformer.__class__.__name__, '\n')
                print('\tn_orig_feature_names:', len(orig_feature_names))


            if transformer_name == 'remainder' and transformer == 'drop':
                    
                continue
                
            if isinstance(transformer, Pipeline):
                # if pipeline, get the last transformer in the Pipeline
                transformer = transformer.steps[-1][1]


            if hasattr(transformer, 'get_feature_names'):


                if 'input_features' in transformer.get_feature_names.__code__.co_varnames:


                    names = list(transformer.get_feature_names(orig_feature_names))


                else:


                    names = list(transformer.get_feature_names())


            elif hasattr(transformer,'indicator_') and transformer.add_indicator:
                # is this transformer one of the imputers & did it call the MissingIndicator?


                missing_indicator_indices = transformer.indicator_.features_
                missing_indicators = [orig_feature_names[idx] + '_missing_flag'\
                                      for idx in missing_indicator_indices]
                names = orig_feature_names + missing_indicators


            elif hasattr(transformer,'features_'):
                # is this a MissingIndicator class? 
                missing_indicator_indices = transformer.features_
                missing_indicators = [orig_feature_names[idx] + '_missing_flag'\
                                      for idx in missing_indicator_indices]


            else:


                names = orig_feature_names


            if verbose: 
                print('\tn_new_features:', len(names))
                print('\tnew_features:\n', names)


            new_feature_names.extend(names)


        return new_feature_names


    
    def get_selected_features(self, verbose=None):
        """

        Get the Feature Names that were retained after Feature Selection (sklearn.feature_selection)

        Parameters
        ----------
        verbose : a boolean indicating whether to print summaries. default = False

        Returns
        -------
        a list of the correct feature names


        """


        if verbose is None:
            verbose = self.verbose


        assert isinstance(self.pipeline, Pipeline), "Input isn't a Pipeline"


        features = self.get_feature_names()
        
        if verbose: print('\n\n---------\nRunning get_selected_features\n---------\n')
            
        all_discarded_features = []


        for i, step_item in enumerate(self.pipeline.steps[:]):
            
            step_name, step = step_item


            if hasattr(step, 'get_support'):


                if verbose: print('\nStep ', i, ": ", step_name, ',', 
                                  step.__class__.__name__, '\n')
                    
                check_is_fitted(step)


                feature_mask = step.get_support()
                features = [feature for feature, is_retained in zip(features, feature_mask)\
                            if is_retained]
                discarded_features = [feature for feature, is_retained in zip(features, feature_mask)\
                                      if not is_retained]
                all_discarded_features.extend(discarded_features)
                
                if verbose: 
                    print(f'\t{len(features)} retained, {len(discarded_features)} discarded')
                    if len(discarded_features) > 0:
                        print('\n\tdiscarded_features:\n\n', discarded_features)


        self.discarded_features = all_discarded_features
        
        return features


    def get_feature_importance(self):
        
        """
        Creates a Pandas Series where values are the feature importance values from the model and feature names are set as the index. 
        
        This Series is stored in the `feature_importance` attribute.

        Returns
        -------
        A pandas Series containing the feature importance values and feature names as the index.
        
        """
        
        assert isinstance(self.pipeline, Pipeline), "Input isn't a Pipeline"


        features = self.get_selected_features()
             
        assert hasattr(self.pipeline[-1], 'feature_importances_'),\
            "The last element in the pipeline isn't an estimator with a feature_importances_ attribute"
        
        importance_values = self.pipeline[-1].feature_importances_
        
        assert len(features) == len(importance_values),\
            "The number of feature names & importance values doesn't match"
        
        feature_importance = pd.Series(importance_values, index=features)
        self.feature_importance = feature_importance
        
        return feature_importance
        
    
    def plot(self, top_n_features=100, rank_features=True, max_scale=True, 
             display_imp_values=True, display_imp_value_decimals=1,
             height_per_feature=25, orientation='h', width=750, height=None, 
             str_pad_width=15, yaxes_tickfont_family='Courier New', 
             yaxes_tickfont_size=15):
        """

        Plot the Feature Names & Importances 


        Parameters
        ----------

        top_n_features : the number of features to plot, default is 100
        rank_features : whether to rank the features with integers, default is True
        max_scale : Should the importance values be scaled by the maximum value & mulitplied by 100?  Default is True.
        display_imp_values : Should the importance values be displayed? Default is True.
        display_imp_value_decimals : If display_imp_values is True, how many decimal places should be displayed. Default is 1.
        height_per_feature : if height is None, the plot height is calculated by top_n_features * height_per_feature. 
        This allows all the features enough space to be displayed
        orientation : the plot orientation, 'h' (default) or 'v'
        width :  the width of the plot, default is 500
        height : the height of the plot, the default is top_n_features * height_per_feature
        str_pad_width : When rank_features=True, this number of spaces to add between the rank integer and feature name. 
            This will enable the rank integers to line up with each other for easier reading. 
            Default is 15. If you have long feature names, you can increase this number to make the integers line up more.
            It can also be set to 0.
        yaxes_tickfont_family : the font for the feature names. Default is Courier New.
        yaxes_tickfont_size : the font size for the feature names. Default is 15.

        Returns
        -------
        plot

        """
        if height is None:
            height = top_n_features * height_per_feature
            
        # prep the data
        
        all_importances = self.get_feature_importance()
        n_all_importances = len(all_importances)
        
        plot_importances_df =\
            all_importances\
            .nlargest(top_n_features)\
            .sort_values()\
            .to_frame('value')\
            .rename_axis('feature')\
            .reset_index()
                
        if max_scale:
            plot_importances_df['value'] = \
                                plot_importances_df.value.abs() /\
                                plot_importances_df.value.abs().max() * 100
            
        self.plot_importances_df = plot_importances_df.copy()
        
        if len(all_importances) < top_n_features:
            title_text = 'All Feature Importances'
        else:
            title_text = f'Top {top_n_features} (of {n_all_importances}) Feature Importances'       
        
        if rank_features:
            padded_features = \
                plot_importances_df.feature\
                .str.pad(width=str_pad_width)\
                .values
            
            ranked_features =\
                plot_importances_df.index\
                .to_series()\
                .sort_values(ascending=False)\
                .add(1)\
                .astype(str)\
                .str.cat(padded_features, sep='. ')\
                .values


            plot_importances_df['feature'] = ranked_features
        
        if display_imp_values:
            text = plot_importances_df.value.round(display_imp_value_decimals)
        else:
            text = None


        # create the plot 
        
        fig = px.bar(plot_importances_df, 
                     x='value', 
                     y='feature',
                     orientation=orientation, 
                     width=width, 
                     height=height,
                     text=text)
        fig.update_layout(title_text=title_text, title_x=0.5) 
        fig.update(layout_showlegend=False)
        fig.update_yaxes(tickfont=dict(family=yaxes_tickfont_family, 
                                       size=yaxes_tickfont_size),
                         title='')
        fig.show()

翻译自: https://towardsdatascience.com/extracting-plotting-feature-names-importance-from-scikit-learn-pipelines-eb5bfa6a31f4

scikit学习库

你可能感兴趣的:(python,机器学习,大数据,人工智能,java)