阅读目录
学完了Coursera上Andrew Ng的Machine Learning后,迫不及待地想去参加一场Kaggle的比赛,却发现从理论到实践的转变实在是太困难了,在此记录学习过程.
回到顶部
教程大多推荐使用Jupyter Notebook来进行数据科学的相关编程,我们通过Anaconda来安装Jupyter Notebook和需要用到的一些python库,按照以下方法重新安装了Anaconda,平台Win10
Anaconda安装
回到顶部
参照以下两篇文章配置好了Jupyter Notebook,学习了相关的基本操作
Jupyter可以做哪些事情
Jupyter Notebook常用快捷键
回到顶部
numpy中文教程
官方文档
回到顶部
官方教程
官方教程中文翻译
matplotlib入门教程
Jupyter Notebook Viewer的matplotlib lecture
建议先看官方教程,通过折线图熟悉基本操作,然后看入门教程第三章到第六章掌握各种图的画法
回到顶部
10 Minutes to Pandas
十分钟搞定pandas(上文翻译版)
利用python进行数据分析
上面两个教程用于速成,下面这本书是pandas的作者写的,用于仔细了解
回到顶部
python机器学习实践与kaggle实战
Sklearn快速入门
官方文档
官方文档中文翻译
特征工程:
在机器学习中,很重要的一步是对特征的处理,我们参考下文,先给出一些常用的特征处理方法在sklearn中的用法
使用sklearn做单机特征工程
from sklearn.preprocessing import StandardScaler data_train = StandardScaler().fit_transform(data_train) data_test = StandardScaler().fit_transform(data_test)
from sklearn.preprocessing import MinMaxScaler data = MinMaxScaler().fit_transform(data)
from sklearn.preprocessing import Normalizer data = Normalizer().fit_transform(data)
from sklearn.preprocessing import Binarizer data = Binarizer(threshold = epsilon).fit_transform(data)
实际上就是保留数值型特征,并将不同的类别转换为哑变量(独热编码),可参考:python中DictVectorizer的使用
from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer(sparse = False) X_train = vec.fit_transform(X_train.to_dict(orient = 'recoed'))
from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 #选择K个最好的特征,返回选择特征后的数据 skb = SelectKBest(chi2, k = 10).fit(X_train, y_train) X_train = skb.transform(X_train) X_test = skb.transform(X_test)
from sklearn.feature_selection import SelectKBest from minepy import MINE #由于MINE的设计不是函数式的,定义mic方法将其为函数式的,返回一个二元组,二元组的第2项设置成固定的P值0.5 def mic(x, y): m = MINE() m.compute_score(x, y) return (m.mic(), 0.5) #选择K个最好的特征,返回特征选择后的数据 SelectKBest(lambda X, Y: array(map(lambda x:mic(x, Y), X.T)).T, k=2).fit_transform(iris.data, iris.target)
from sklearn.decomposition import PCA estimator = PCA(n_components=2)#几个主成分 X_pca = estimator.fit_transform(X_data)
学习算法:
划分训练集和测试集:
from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 33)
训练:
from sklearn import LearnAlgorithm#导入对应的学习算法包 la = LearnAlgorithm() la.fit(X_train, y_train) y_predict = la.predict(x_test)
随机梯度下降法(SGD):
from sklearn.linear_model import SGDClassifier sgd = SGDClassifier() from sklearn.linear_model import SGDRegressor sgd = SGDRegressor(loss='squared_loss', penalty=None, random_state=42)
支持向量机(SVM):
支持向量分类(SVC):
from sklearn.svm import SVC svc_linear = SVC(kernel='linear')#线性核,可以选用不同的核
支持向量回归(SVR):
from sklearn.svm import SVR svr_linear = SVR(kernel='linear')#线性核,可以选用不同的核如poly,rbf
朴素贝叶斯(NaiveBayes):
from sklearn.naive_bayes import MultinomialNB mnb = MultinomialNB()
决策树(DecisionTreeClassifier):
from sklearn.tree import DecisionTreeClassifier dtc = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5)#最大深度和最小样本数,用于防止过拟合
随机森林(RandomForestClassifier):
from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(max_depth=3, min_samples_leaf=5)
梯度提升树(GBDT):
from sklearn.ensemble import GradientBoostingClassifier gbc = GradientBoostingClassifier(max_depth=3, min_samples_leaf=5)
极限回归森林(ExtraTreesRegressor):
from sklearn.ensemble import ExtraTreesRegressor() etr = ExtraTreesRegressor()
评估:
from sklearn import metrics accuracy_rate = metrics.accuracy_score(y_test, y_predict) metrics.classification_report(y_test, y_predict, target_names = data.target_names)#可以获取准确率,召回率等数据
K折交叉检验:
from sklearn.cross_validation import cross_val_score,KFold cv = KFold(len(y), K, shuffle=True, random_state = 0) scores = cross_val_score(clf, X, y, cv = cv)
或
from sklearn.cross_validation import cross_val_score scores = cross_val_score(dt, X_train, y_train, cv = K)
注意这里的X,y需要为ndarray类型,如果是DataFrame则需要用df.values和df.values.flatten()转化
Pipeline机制:
pipeline机制实现了对全部步骤的流式化封装和管理,应用于参数集在数据集上的重复使用.Pipeline对象接受二元tuple构成的list,第一个元素为自定义名称,第二个元素为sklearn中的transformer或estimator,即处理特征和用于学习的方法.以朴素贝叶斯为例,根据处理特征的不同方法有以下代码:
clf_1 = Pipeline([('count_vec', CountVectorizer()), ('mnb', MultinomialNB())]) clf_2 = Pipeline([('hash_vec', HashingVectorizer(non_negative=True)), ('mnb', MultinomialNB())]) clf_3 = Pipeline([('tfidf_vec', TfidfVectorizer()), ('mnb', MultinomialNB())])
特征选择:
from sklearn import feature_selection fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=per) X_train_fs = fs.fit_transform(X_train, y_train)
我们以特征选择和5折交叉检验为例,实现一个完整的参数选择过程:
from sklearn import feature_selection from sklearn.cross_validation import cross_val_score percentiles = range(1,100) results= [] for i in percentiles: fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=i) X_train_fs = fs.fit_transform(X_train, y_train) scores = cross_val_score(dt, X_train_fs, y_train, cv = 5) results = np.append(results, scores.mean()) opt = np.where(results == results.max())[0] fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=opt) X_train_fs = fs.fit_transform(X_train, y_train) dt.fit(X_train_fs, y_train) y_predict = dt.predict(x_test)
超参数:
超参数指机器学习模型里的框架参数,在竞赛和工程中都非常重要
集成学习(Ensemble Learning):
通过对多个模型融合以提升整体性能,如随机森林,XGBoost,参考下文:
Ensemble Learning-模型融合-Python实现
多线程网格搜索:
用于寻找最优参数,可参考下文:
Sklearn-GridSearchCV网格搜索
from sklearn.cross_validation import train_test_split from sklearn.grid_search import GridSearchCV X_train, X_test, y_train, y_test = train_test_split(news.data[:3000], news.target[:3000], test_size=0.25, random_state=33) from sklearn.svm import SVC from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline clf = Pipeline([('vect', TfidfVectorizer(stop_words='english', analyzer='word')), ('svc', SVC())]) parameters = {'svc__gamma': np.logspace(-2, 1, 4), 'svc__C': np.logspace(-1, 1, 3)} gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3, n_jobs=-1) %time _=gs.fit(X_train, y_train) gs.best_params_, gs.best_score_ print gs.score(X_test, y_test)