Baseline Model

首先要介绍一下概念：What does 'baseline' mean in machine learning? - Quora

A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset. You can use these predictions to measure the baseline's performance (e.g., accuracy)-- this metric will then become what you compare any other machine learning algorithm against.

下面我们来详细地看一下这个案例：
读取数据：

import pandas as pd
ks = pd.read_csv('../input/kickstarter-projects/ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])
ks.head(10)

可以看出，我们需要预测Kickstarter项目是否能成功。state列是目标值，可以用 category, currency, funding goal, country和launched时间来作为属性值。

pd.unique(ks.state)返回state维度的唯一值。pandas.unique¶
找到唯一值，类似于SQL的SELECT DISTINCT，返回的是唯一的标签。

ks.groupby('state')['ID'].count()对每个state进行计数。

pandas.DataFrame.query¶ 对数据进行筛选，如下面的实例，选择所有state状态不为live的数据（等同于SQL语句的SELECT）：

ks = ks.query('state != "live"')

ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))

添加结果列，如果是successful，设置为1，其他设置为0

时间格式

pandas.DataFrame.assign¶ 和 Datetime properties¶，利用这一时间属性，对时间进行切分：

ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

标签型变量的预处理

sklearn.preprocessing.LabelEncoder¶ 会自动对标签值建立字典索引，对每一个标签值分配一个数值：

from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()

# Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transform)

LightGBM

LightGBM的介绍：

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Categorical Encodings

对于类别型特征，有几种不同的特征工程方法，有一个Python的包：Category Encoders就是用来进行encoding的，下面是几种常用方法：

Count Encoding

导入category_encoders包，并确定特征的维度：

import category_encoders as ce
cat_features = ['category', 'currency', 'country']

CountEncoder的使用方法是：category_encoders.CountEncoder().fit_transform(特征维度)

count_enc = ce.CountEncoder()
count_encoded = count_enc.fit_transform(ks[cat_features])

将Count Encoding之后的维度与原来的数组进行join操作，对数组进行准备：

data = baseline_data.join(count_encoded.add_suffix("_count"))

# Training a model on the baseline data
train, valid, test = get_data_splits(data)
bst = train_model(train, valid)

get_data_splits()函数用于将数据集分为训练集、验证集和测试集，只是简单地根据数组的索引进行简单地划分，没有用其他方法：

def get_data_splits(dataframe, valid_fraction=0.1):
    valid_fraction = 0.1
    valid_size = int(len(dataframe) * valid_fraction)

    train = dataframe[:-valid_size * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_size * 2:-valid_size]
    test = dataframe[-valid_size:]
    
    return train, valid, test

Target Encoder

Target Encoder和Count Encounter的区别在于，在创建模型的时候需要输入目标变量列：

target_enc.fit(train[cat_features], train['outcome'])

CatBoost: unbiased boosting with categorical features

CatBoost与Traget Encoder类似，只不过CatBoost计算目标变量概率的范围是前面的行：

cb_enc.fit(train[cat_features], train['is_attributed'])

更多方法详见：
Categorical Encoding Methods
One-hot Encoding

【特征工程】特征工程入门