【特征工程】特征工程入门

Baseline Model

首先要介绍一下概念:What does 'baseline' mean in machine learning? - Quora

A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset. You can use these predictions to measure the baseline's performance (e.g., accuracy)-- this metric will then become what you compare any other machine learning algorithm against.

下面我们来详细地看一下这个案例:
读取数据:

import pandas as pd
ks = pd.read_csv('../input/kickstarter-projects/ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])
ks.head(10)

可以看出,我们需要预测Kickstarter项目是否能成功。state列是目标值,可以用 category, currency, funding goal, country和launched时间来作为属性值。

pd.unique(ks.state)返回state维度的唯一值。pandas.unique¶
找到唯一值,类似于SQL的SELECT DISTINCT,返回的是唯一的标签。

ks.groupby('state')['ID'].count()对每个state进行计数。

pandas.DataFrame.query¶ 对数据进行筛选,如下面的实例,选择所有state状态不为live的数据(等同于SQL语句的SELECT):

ks = ks.query('state != "live"')
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))

添加结果列,如果是successful,设置为1,其他设置为0

时间格式

pandas.DataFrame.assign¶ 和 Datetime properties¶,利用这一时间属性,对时间进行切分:

ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

标签型变量的预处理

sklearn.preprocessing.LabelEncoder¶ 会自动对标签值建立字典索引,对每一个标签值分配一个数值:

from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()

# Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transform)

LightGBM

LightGBM的介绍:

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Categorical Encodings

对于类别型特征,有几种不同的特征工程方法,有一个Python的包:Category Encoders就是用来进行encoding的,下面是几种常用方法:

Count Encoding

导入category_encoders包,并确定特征的维度:

import category_encoders as ce
cat_features = ['category', 'currency', 'country']

CountEncoder的使用方法是:category_encoders.CountEncoder().fit_transform(特征维度)

count_enc = ce.CountEncoder()
count_encoded = count_enc.fit_transform(ks[cat_features])

将Count Encoding之后的维度与原来的数组进行join操作,对数组进行准备:

data = baseline_data.join(count_encoded.add_suffix("_count"))

# Training a model on the baseline data
train, valid, test = get_data_splits(data)
bst = train_model(train, valid)

get_data_splits()函数用于将数据集分为训练集、验证集和测试集,只是简单地根据数组的索引进行简单地划分,没有用其他方法:

def get_data_splits(dataframe, valid_fraction=0.1):
    valid_fraction = 0.1
    valid_size = int(len(dataframe) * valid_fraction)

    train = dataframe[:-valid_size * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_size * 2:-valid_size]
    test = dataframe[-valid_size:]
    
    return train, valid, test

Target Encoder

Target Encoder和Count Encounter的区别在于,在创建模型的时候需要输入目标变量列:

target_enc.fit(train[cat_features], train['outcome'])

CatBoost: unbiased boosting with categorical features

CatBoost与Traget Encoder类似,只不过CatBoost计算目标变量概率的范围是前面的行:

cb_enc.fit(train[cat_features], train['is_attributed'])

更多方法详见:
Categorical Encoding Methods
One-hot Encoding

Feature Generation

Feature Selection

你可能感兴趣的:(【特征工程】特征工程入门)