1. A Data Science Framework
- Define the Problem:
- Problems before requirements, requirements before solutions, solutions before design, and design before technology.
- 提出问题>需求>解决方法>思路>技术
#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from lightgbm import LGBMClassifier
#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
- Gather the Data:
- we will worry about transforming "dirty data" to "clean data."
- 清洗数据
- Correcting,更正极端数据
- Completing,完成缺失值填充
- Creating,创建新的特征
- Converting,编码或者标准化
[pandas.DataFrame
[pandas.DataFrame.info]
[pandas.DataFrame.describe]
* one
[pandas.isnull]
[pandas.DataFrame.sum]
[pandas.DataFrame.mode]
[pandas.DataFrame.copy]
[pandas.DataFrame.fillna]
[pandas.DataFrame.drop]
[pandas.Series.value_counts]
[pandas.DataFrame.loc]
* two
[Categorical Encoding]
[Sklearn LabelEncoder]
[Sklearn OneHotEncoder]
[Pandas Categorical dtype]
[pandas.get_dummies]
Prepare Data for Consumption:
Perform Exploratory Analysis:
- In addition, data categorization (i.e. qualitative vs quantitative) is also important to understand and select the correct hypothesis test or data model.
- 数据分类(即定性与定量)对于理解和选择正确的假设检验或数据模型也很重要。
- Model Data:
#Machine Learning Algorithm (MLA) Selection and Initialization
MLA = [
#Ensemble Methods
ensemble.AdaBoostClassifier(),
ensemble.BaggingClassifier(),
ensemble.ExtraTreesClassifier(),
ensemble.GradientBoostingClassifier(),
ensemble.RandomForestClassifier(),
#Gaussian Processes
gaussian_process.GaussianProcessClassifier(),
#GLM
linear_model.LogisticRegressionCV(),
linear_model.PassiveAggressiveClassifier(),
linear_model.RidgeClassifierCV(),
linear_model.SGDClassifier(),
linear_model.Perceptron(),
#Navies Bayes
naive_bayes.BernoulliNB(),
naive_bayes.GaussianNB(),
#Nearest Neighbor
neighbors.KNeighborsClassifier(),
#SVM
svm.SVC(probability=True),
svm.NuSVC(probability=True),
svm.LinearSVC(),
#Trees
tree.DecisionTreeClassifier(),
tree.ExtraTreeClassifier(),
#Discriminant Analysis
discriminant_analysis.LinearDiscriminantAnalysis(),
discriminant_analysis.QuadraticDiscriminantAnalysis(),
#lightgbm
LGBMClassifier()
]
- Validate and Implement Data Model:
- 这个样子进行的交叉验证不能的到验证的结果,只能证明是否过拟合
cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv = cv_split)
- 交叉验证到验证的结果,然后去预测新的值。
cv_split = model_selection.KFold(n_splits=n_split, random_state=15, shuffle=False)
for i, (train_index,test_index) in enumerate(cv_split.split(train_x,train_y)):
pass
- blending &stack
def get_oof(clf, train_x, train_y, test_x):
'''
blednding
clf:estimator
train_x: pd.DataFrame
train_y: np.array
test_x: pd.DataFrame
'''
NFolds=5
len_train_x=train_x.shape[0]
len_test_x=test_x.shape[0]
oof_train = np.zeros((len_train_x,))
oof_test = np.zeros((len_test_x,))
oof_test_skf = np.empty((NFolds, len_test_x))#5-170
kf=StratifiedKFold(n_splits=NFolds)
for i, (train_index, test_index) in enumerate(kf.split(train_x,train_y)):
x_tr = train_x.iloc[train_index]
y_tr = train_y[train_index]
x_te = train_x.iloc[test_index]
clf.fit(x_tr, y_tr)
oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(test_x)
oof_test = oof_test_skf.mean(axis=0)
#return oof_train, oof_test
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)
- Optimize and Strategize:
- 用网格搜索进行参数优化呢?
-
feature_selection.RFECV(dtree, step = 1, scoring = 'accuracy', cv =2)
特征筛选应该在哪一步
总结为以下一般步骤:
- Introduction - Loading libraries and dataset
*下载数据集,可能是数据库的方法,或者干脆直接放到内存里 - Exploration, engineering and cleaning features (or variables)
*探索关系,清洗数据 - Correlaration analysis - Tri-variate analysis
*相关性,三变量分析,特征交叉? - Prediction models with cross-validation
*对众多模型进行交叉验证 ,选择较好一类
*对众多模型进行网格搜索,确定参数 - Stacking predictions
*对多个模型进行堆叠