9.ML学习流程

1. A Data Science Framework

  1. Define the Problem:
  • Problems before requirements, requirements before solutions, solutions before design, and design before technology.
  • 提出问题>需求>解决方法>思路>技术

#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from lightgbm import LGBMClassifier

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
  1. Gather the Data:
  • we will worry about transforming "dirty data" to "clean data."
  • 清洗数据
    • Correcting,更正极端数据
    • Completing,完成缺失值填充
    • Creating,创建新的特征
    • Converting,编码或者标准化
   [pandas.DataFrame
   [pandas.DataFrame.info]
   [pandas.DataFrame.describe]
* one
   [pandas.isnull]
   [pandas.DataFrame.sum]
   [pandas.DataFrame.mode]
   [pandas.DataFrame.copy]
   [pandas.DataFrame.fillna]
   [pandas.DataFrame.drop]
   [pandas.Series.value_counts]
   [pandas.DataFrame.loc]
* two
   [Categorical Encoding]
   [Sklearn LabelEncoder]
   [Sklearn OneHotEncoder]
   [Pandas Categorical dtype]
   [pandas.get_dummies]

  1. Prepare Data for Consumption:

  2. Perform Exploratory Analysis:

  • In addition, data categorization (i.e. qualitative vs quantitative) is also important to understand and select the correct hypothesis test or data model.
  • 数据分类(即定性与定量)对于理解和选择正确的假设检验或数据模型也很重要。
  1. Model Data:
#Machine Learning Algorithm (MLA) Selection and Initialization
MLA = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),
    
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    
    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),
    
    #Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(), 

    #lightgbm
    LGBMClassifier()    
    ]
  1. Validate and Implement Data Model:
  • 这个样子进行的交叉验证不能的到验证的结果,只能证明是否过拟合
cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv  = cv_split)
  • 交叉验证到验证的结果,然后去预测新的值。
cv_split = model_selection.KFold(n_splits=n_split, random_state=15, shuffle=False)
for i, (train_index,test_index) in enumerate(cv_split.split(train_x,train_y)):       
    pass
  • blending &stack
def get_oof(clf, train_x, train_y, test_x):
    '''
    blednding
    
    clf:estimator
    train_x: pd.DataFrame
    train_y: np.array
    test_x: pd.DataFrame
    '''
    NFolds=5
    len_train_x=train_x.shape[0]
    len_test_x=test_x.shape[0]
    oof_train = np.zeros((len_train_x,))
    oof_test = np.zeros((len_test_x,))
    oof_test_skf = np.empty((NFolds, len_test_x))#5-170
    kf=StratifiedKFold(n_splits=NFolds)
    for i, (train_index, test_index) in enumerate(kf.split(train_x,train_y)):
        x_tr = train_x.iloc[train_index]
        y_tr = train_y[train_index]
        x_te = train_x.iloc[test_index]

        clf.fit(x_tr, y_tr)
        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(test_x)

    oof_test = oof_test_skf.mean(axis=0)
    #return oof_train, oof_test
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)
  1. Optimize and Strategize:
  • 用网格搜索进行参数优化呢?
  • feature_selection.RFECV(dtree, step = 1, scoring = 'accuracy', cv =2)
    特征筛选应该在哪一步

总结为以下一般步骤:

  • Introduction - Loading libraries and dataset
    *下载数据集,可能是数据库的方法,或者干脆直接放到内存里
  • Exploration, engineering and cleaning features (or variables)
    *探索关系,清洗数据
  • Correlaration analysis - Tri-variate analysis
    *相关性,三变量分析,特征交叉?
  • Prediction models with cross-validation
    *对众多模型进行交叉验证 ,选择较好一类
    *对众多模型进行网格搜索,确定参数
  • Stacking predictions
    *对多个模型进行堆叠

你可能感兴趣的:(9.ML学习流程)