Baseline Model
首先要介绍一下概念:What does 'baseline' mean in machine learning? - Quora
A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset. You can use these predictions to measure the baseline's performance (e.g., accuracy)-- this metric will then become what you compare any other machine learning algorithm against.
下面我们来详细地看一下这个案例:
读取数据:
import pandas as pd
ks = pd.read_csv('../input/kickstarter-projects/ks-projects-201801.csv',
parse_dates=['deadline', 'launched'])
ks.head(10)
可以看出,我们需要预测Kickstarter项目是否能成功。state列是目标值,可以用 category, currency, funding goal, country和launched时间来作为属性值。
pd.unique(ks.state)
返回state维度的唯一值。pandas.unique¶
找到唯一值,类似于SQL的SELECT DISTINCT
,返回的是唯一的标签。
ks.groupby('state')['ID'].count()
对每个state进行计数。
pandas.DataFrame.query¶ 对数据进行筛选,如下面的实例,选择所有state状态不为live的数据(等同于SQL语句的SELECT):
ks = ks.query('state != "live"')
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))
添加结果列,如果是successful
,设置为1,其他设置为0
时间格式
pandas.DataFrame.assign¶ 和 Datetime properties¶,利用这一时间属性,对时间进行切分:
ks = ks.assign(hour=ks.launched.dt.hour,
day=ks.launched.dt.day,
month=ks.launched.dt.month,
year=ks.launched.dt.year)
标签型变量的预处理
sklearn.preprocessing.LabelEncoder¶ 会自动对标签值建立字典索引,对每一个标签值分配一个数值:
from sklearn.preprocessing import LabelEncoder
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
# Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transform)
LightGBM
LightGBM的介绍:
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Categorical Encodings
对于类别型特征,有几种不同的特征工程方法,有一个Python的包:Category Encoders就是用来进行encoding的,下面是几种常用方法:
Count Encoding
导入category_encoders
包,并确定特征的维度:
import category_encoders as ce
cat_features = ['category', 'currency', 'country']
CountEncoder
的使用方法是:category_encoders.CountEncoder().fit_transform(特征维度)
count_enc = ce.CountEncoder()
count_encoded = count_enc.fit_transform(ks[cat_features])
将Count Encoding之后的维度与原来的数组进行join操作,对数组进行准备:
data = baseline_data.join(count_encoded.add_suffix("_count"))
# Training a model on the baseline data
train, valid, test = get_data_splits(data)
bst = train_model(train, valid)
get_data_splits()
函数用于将数据集分为训练集、验证集和测试集,只是简单地根据数组的索引进行简单地划分,没有用其他方法:
def get_data_splits(dataframe, valid_fraction=0.1):
valid_fraction = 0.1
valid_size = int(len(dataframe) * valid_fraction)
train = dataframe[:-valid_size * 2]
# valid size == test size, last two sections of the data
valid = dataframe[-valid_size * 2:-valid_size]
test = dataframe[-valid_size:]
return train, valid, test
Target Encoder
Target Encoder和Count Encounter的区别在于,在创建模型的时候需要输入目标变量列:
target_enc.fit(train[cat_features], train['outcome'])
CatBoost: unbiased boosting with categorical features
CatBoost与Traget Encoder类似,只不过CatBoost计算目标变量概率的范围是前面的行:
cb_enc.fit(train[cat_features], train['is_attributed'])
更多方法详见:
Categorical Encoding Methods
One-hot Encoding