龙珠训练营机器学习task04

学习笔记为阿里云天池龙珠计划机器学习训练营的学习内容,学习链接为:https://tianchi.aliyun.com/competition/entrance/231702/introduction?spm=5176.20222472.J_3678908510.8.8f5e67c2RKrT98
总体思路:分别使用LightGBM,xgboost,gbdt,catboost建立多个个体学习器(加入bagging的策略,对数据随机采样),对最终学习器的输出使用岭回归进一步提升精度。代码如下。

改进点:
1.可以在详细分析一下字段,可以考虑对字段进行特殊处理。
2.超参数还可以调,我没有使用网格搜索,只是简单的进行的调参。
3.如果单纯为了提高精度,可以更高随机种子,多试几次
 

import pandas as pd
import numpy as np
df = pd.read_csv("happiness_train_complete.csv",encoding="GB2312")
df = df.sample(frac=1,replace=False,random_state=11)
df.reset_index(inplace=True)
df = df[df["happiness"]>0]
Y = df["happiness"]
df["survey_month"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[1]).astype("int64")
df["survey_day"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[2]).astype("int64")
df["survey_hour"] = df["survey_time"].map(lambda line:line.split(" ")[1].split(":")[0]).astype("int64")
X = df.drop(columns=["id","index","happiness","survey_time","edu_other","property_other","invest_other"])
 
from sklearn.model_selection import train_test_split
from lightgbm.sklearn import LGBMRegressor
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib
from sklearn.model_selection import KFold
kfold = KFold(n_splits=15, shuffle = True, random_state= 12)
model = LGBMRegressor(n_jobs=-1,learning_rate=0.051,
                      n_estimators=400,
                      num_leaves=11,
                      reg_alpha=2.0, 
                      reg_lambda=2.1,
                      min_child_samples=6,
                      min_split_gain=0.5,
                      colsample_bytree=0.2
                     )
mse = []
i=0
for train, test in kfold.split(X):
    X_train = X.iloc[train]
    y_train = Y.iloc[train]
    X_test = X.iloc[test]
    y_test = Y.iloc[test]
    model.fit(X_train,y_train)
#     model2.fit(model.predict(X_train,pred_leaf=True),y_train)
#     y_pred = model2.predict(model.predict(X=X_test,pred_leaf=True))
    y_pred = model.predict(X=X_test)
    e = mean_squared_error(y_true=y_test,y_pred=y_pred)
    mse.append(e)
    print(e)
    joblib.dump(filename="light"+str(i),value=model)
    i+=1
print("lightgbm",np.mean(mse),mse)
#CatBoostRegressor
import pandas as pd
import numpy as np
df = pd.read_csv("happiness_train_complete.csv",encoding="GB2312")
df = df.sample(frac=1,replace=False,random_state=11)
df.reset_index(inplace=True)
 
df = df[df["happiness"]>0]
Y = df["happiness"]
df["survey_month"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[1]).astype("int64")
df["survey_day"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[2]).astype("int64")
df["survey_hour"] = df["survey_time"].map(lambda line:line.split(" ")[1].split(":")[0]).astype("int64")
X = df.drop(columns=["id","index","happiness","survey_time","edu_other","property_other","invest_other"])
 
 
from sklearn.model_selection import train_test_split
from catboost import Pool, CatBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.externals import joblib
kfold = KFold(n_splits=15, shuffle = True, random_state= 12)
model = CatBoostRegressor(colsample_bylevel=0.1,thread_count=6,silent=Tru

你可能感兴趣的:(python)