kaggle初探--泰坦尼克号生存预测

继续学习数据挖掘,尝试了kaggle上的泰坦尼克号生存预测。

Titanic for Machine Learning

导入和读取

# data processing
import numpy as np
import pandas as pd
import re
#visiulization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
train = pd.read_csv('D:/data/titanic/train.csv')
test = pd.read_csv('D:/data/titanic/test.csv')
train.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train.info()

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

数据特征有:PassengerId,无特别意义
Pclass,客舱等级,对生存有影响吗?是否高等仓的有更多机会?
Name,姓名,可帮助我们判断性别,大概年龄。
Sex,女性的生产率是否更高?
Age,不同年龄段是否对生存有影响?
SibSp和Parch,指是否有兄弟姐妹和配偶父母,有亲人的情况下生存率是提高还是降低?
Fare,票价,高票价是否有更多机会?
Cabin,Embarked,客舱和登录港口……自然理解对生存应该没有影响

train.describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
train.describe(include=['O'])#['O'] indicates category feature
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Hippach, Mrs. Louis Albert (Ida Sophia Fischer) male 1601 C23 C25 C27 S
freq 1 577 7 4 644

目标Survived特征

survive_num = train.Survived.value_counts()
survive_num.plot.pie(explode=[0,0.1],autopct='%1.1f%%',labels=['died','survived'],shadow=True)
plt.show()

kaggle初探--泰坦尼克号生存预测_第1张图片

x=[0,1]
plt.bar(x,survive_num,width=0.35)
plt.xticks(x,('died','survived'))
plt.show()

kaggle初探--泰坦尼克号生存预测_第2张图片

特征分析

num_f = [f for f in train.columns if train.dtypes[f] != 'object']
cat_f = [f for f in train.columns if train.dtypes[f]=='object']
print('there are %d numerical features:'%len(num_f),num_f)
print('there are %d category features:'%len(cat_f),cat_f)

there are 7 numerical features: [‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’]
there are 5 category features: [‘Name’, ‘Sex’, ‘Ticket’, ‘Cabin’, ‘Embarked’]

feature类别:
- 数值型
- 特征型:可排序/不可排序型
- category不可排序型:sex,Embarked

category特征

性别

train.groupby(['Sex'])['Survived'].count()
Sex female 314 male 577 Name: Survived, dtype: int64
f,ax = plt.subplots(figsize=(8,6))
fig = sns.countplot(x='Sex',hue='Survived',data=train)
fig.set_title('Sex:Survived vs Dead')
plt.show()

kaggle初探--泰坦尼克号生存预测_第3张图片

train.groupby(['Sex'])['Survived'].sum()/train.groupby(['Sex'])['Survived'].count()
Sex female 0.742038 male 0.188908 Name: Survived, dtype: float64 船上原有人数,男性远高于女性;存活率,女性在75%左右,远高于男性18%-19%.可见女性存活率远高于男性,是重要特征。

Embarked

sns.factorplot('Embarked','Survived',data=train)
plt.show()

kaggle初探--泰坦尼克号生存预测_第4张图片

f,ax = plt.subplots(1,3,figsize=(24,6))
sns.countplot('Embarked',data=train,ax=ax[0])
ax[0].set_title('No. Of Passengers Boarded')
sns.countplot(x='Embarked',hue='Survived',data=train,ax=ax[1])
ax[1].set_title('Embarked vs Survived')
sns.countplot('Embarked',hue='Pclass',data=train,ax=ax[2])
ax[2].set_title('Embarked vs Pclass')
#plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()

kaggle初探--泰坦尼克号生存预测_第5张图片

#pd.pivot_table(train,index='Embarked',columns='Pclass',values='Fare')
sns.boxplot(x='Embarked',y='Fare',hue='Pclass',data=train)
plt.show()

kaggle初探--泰坦尼克号生存预测_第6张图片

从图中看出大部分乘客来自S port,其中多数为class 3,但是class 1 的人数也是3个口中最多的,C port的存活率最高,为0.55,因为C port中class1的人比例较高,Q port 绝大部分乘客是class 3的。C口1,2仓的票价均值较高,可能是暗示这个口上的人的社会地位较高。不过,从逻辑上说登录口对生存率是没有影响的,所以可以将其转成哑变量或drop.

Pclass

train.groupby('Pclass')['Survived'].value_counts()
Pclass Survived 1 1 136 0 80 2 0 97 1 87 3 0 372 1 119 Name: Survived, dtype: int64
plt.subplots(figsize=(8,6))
f = sns.countplot('Pclass',hue='Survived',data=train)

kaggle初探--泰坦尼克号生存预测_第7张图片

sns.factorplot('Pclass','Survived',hue='Sex',data=train)
plt.show()

kaggle初探--泰坦尼克号生存预测_第8张图片

class1,2的存活率明显较高,1有半数以上存活,2也基本持平,1,2仓女性甚至接近于1,所以客舱等级对生存有很大影响。

SibSp

train[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000
sns.factorplot('SibSp','Survived',data=train)
plt.show()

kaggle初探--泰坦尼克号生存预测_第9张图片

#pd.pivot_table(train,values='Survived',index='SibSp',columns='Pclass')
sns.countplot(x='SibSp',hue='Pclass',data=train)
plt.show()

kaggle初探--泰坦尼克号生存预测_第10张图片

在没有同伴的情况下,存活率大概在0.3左右,有一个同伴的存活率最高>0.5,可能原因是1,2仓的乘客比例较高,随后,随着同伴数量增加而降低,降低的主要原因可能是,超过3人以上的乘客主要在class3,class3中3人以上存活率很低

Parch

#pd.pivot_table(train,values='Survived',index='Parch',columns='Pclass')
sns.countplot(x='Parch',hue='Pclass',data=train)
plt.show()

kaggle初探--泰坦尼克号生存预测_第11张图片

sns.factorplot('Parch','Survived',data=train)
plt.show()

kaggle初探--泰坦尼克号生存预测_第12张图片

趋势跟SibSp相似,一个人存活率较低,在有1-3parents时存活率较高,随后迅速降低,因为多数乘客来自class3

Age

train.groupby('Survived')['Age'].describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
count mean std min 25% 50% 75% max
Survived
0 424.0 30.626179 14.172110 1.00 21.0 28.0 39.0 74.0
1 290.0 28.343690 14.950952 0.42 19.0 28.0 36.0 80.0
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.violinplot('Pclass','Age',hue='Survived',data=train,split=True,ax=ax[0])
ax[0].set_title('Pclass Age & Survived')
sns.violinplot('Sex','Age',hue='Survived',data=train,split=True,ax=ax[1])
ax[1].set_title('Sex Age & Survived')
plt.show()

kaggle初探--泰坦尼克号生存预测_第13张图片

1等仓获救年龄总体偏低,生存率年龄跨度大,尤其是20岁以上至50岁的生存率较高,可能和1等仓人年龄总体偏大有关;10岁左右的儿童在2,3等仓的生存率明显提升,对于男性而言同理,儿童有个明显提升,;女性的生存年龄集中在中青年;20-40岁左右的中青年人死亡人数最多。

Name

name主要用途是可以帮助我们分辨性别,帮助补充有相同title的年龄缺失值

#用正则表达式帮助找出姓名中表示年龄的title
def getTitle(data):

    name_sal = []
    for i in range(len(data['Name'])):
        name_sal.append(re.findall(r'.\w*\.',data.Name[i]))

    Salut = []
    for i in range(len(name_sal)):
        name = str(name_sal[i])
        name = name[1:-1].replace("'","")
        name = name.replace(".","").strip()
        name = name.replace(" ","")
        Salut.append(name)

    data['Title'] = Salut

getTitle(train)
train.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C Mrs
pd.crosstab(train['Title'],train['Sex'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Sex female male
Title
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 124 0
Mrs,L 1 0
Ms 1 0
Rev 0 6
Sir 0 1

补习一波英语:Mme:称呼非英语民族的”上层社会”已婚妇女,及有职业的妇女,相当于Mrs;Jonkheer:乡绅;Capt:船长;Lady:贵族夫人;Don唐:是西班牙语中贵族和有地位者的尊称;the Countess:女伯爵;Ms:Ms.或Mz:婚姻状态不明的妇女;Col:上校;Major:少校;Mlle:小姐;Rev:牧师。

Fare

train.groupby('Pclass')['Fare'].mean()
Pclass 1 84.154687 2 20.662183 3 13.675550 Name: Fare, dtype: float64
sns.distplot(train['Fare'].dropna())
plt.xlim((0,200))
plt.xticks(np.arange(0,200,10))
plt.show()

kaggle初探--泰坦尼克号生存预测_第14张图片

初步分析总结:
- 对于性别,女性生存率明显高于男性
- 头等舱生存率很高,3等仓很低,class1,2女性生存率接近于1
- 10岁左右的儿童生存率又明显提升
- SibSp和Parch相似,一个人存活率较低,有1-2SibSp或者1-3Parents生存率较高,但超过后生存率大幅下降
- name和age可以对所有数据进行处理,用name提取性别title,借助均值对age进行补充

数据处理

#合并训练集和测试集
passID = test['PassengerId']
all_data = pd.concat([train,test],keys=["train","test"])
all_data.shape
#all_data.head()
(1309, 13)
#统计缺失值
NAs = pd.concat([train.isnull().sum(),train.isnull().sum()/train.isnull().count(),test.isnull().sum(),test.isnull().sum()/test.isnull().count()],axis=1,keys=["train","percent_train","test","percent"])
NAs[NAs.sum(axis=1)>1].sort_values(by="percent",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
train percent_train test percent
Cabin 687 0.771044 327.0 0.782297
Age 177 0.198653 86.0 0.205742
Fare 0 0.000000 1.0 0.002392
Embarked 2 0.002245 0.0 0.000000
#删除无意义特征
all_data.drop(['PassengerId','Cabin'],axis=1,inplace=True)

all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Age Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket Title
train 0 22.0 S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0.0 A/5 21171 Mr
1 38.0 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th… 0 1 female 1 1.0 PC 17599 Mrs

Age处理

#先提取name中的title
getTitle(all_data)
pd.crosstab(all_data['Title'], all_data['Sex'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Sex female male
Title
Capt 0 1
Col 0 4
Countess 1 0
Don 0 1
Dona 1 0
Dr 1 7
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 61
Miss 260 0
Mlle 2 0
Mme 1 0
Mr 0 757
Mrs 196 0
Mrs,L 1 0
Ms 2 0
Rev 0 8
Sir 0 1

all_data['Title'] = all_data['Title'].replace(
    ['Lady','Dr','Dona','Mme','Countess'],'Mrs')
all_data['Title'] =all_data['Title'].replace('Mlle','Miss')
all_data['Title'] =all_data['Title'].replace('Mrs,L','Mrs')
all_data['Title'] = all_data['Title'].replace('Ms', 'Miss')
#all_data['Title'] = all_data['Title'].replace('Mme', 'Mrs')
all_data['Title'] = all_data['Title'].replace(['Capt','Col','Don','Major','Rev','Jonkheer','Sir'],'Mr')
'''
all_data['Title'] = all_data.Title.replace({'Mlle':'Miss','Mme':'Mrs','Ms':'Miss','Dr':'Mrs',
                        'Major':'Mr','Lady':'Mrs','Countess':'Mrs',
                        'Jonkheer':'Mr','Col':'Mr','Rev':'Mr',
                        'Capt':'Mr','Sir':'Mr','Don':'Mr','Mrs,L':'Mrs'})

'''
all_data.Title.isnull().sum()
0
all_data[:train.shape[0]].groupby('Title')['Age'].mean()
Title Master 4.574167 Miss 21.845638 Mr 32.891990 Mrs 36.188034 Name: Age, dtype: float64
#通过训练集中title对应的age均值替换
all_data.loc[(all_data.Age.isnull()) & (all_data.Title=='Mr'),'Age']=32
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Mrs'),'Age']=36
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Master'),'Age']=5
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Miss'),'Age']=22
#all_data.loc[(all_data.Age.isnull())&(all_data.Title=='other'),'Age']=46

all_data.Age.isnull().sum()
0
all_data[:train.shape[0]][['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Title Survived
0 Master 0.575000
1 Miss 0.702703
2 Mr 0.158192
3 Mrs 0.777778
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='female','Age'],color='red',ax=ax[0])
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='male','Age'],color='blue',ax=ax[0])

sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Age' ],
                 color='red', label='Not Survived', ax=ax[1])
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Age' ],
                 color='blue', label='Survived', ax=ax[1])
plt.legend(loc='best')
plt.show()

kaggle初探--泰坦尼克号生存预测_第15张图片

  • 16岁左右儿童存活率较高,最年长乘客(80岁)幸存
  • 大量16~40青少年没有存活
  • 大多数乘客在16~40岁
  • 为辅助分类,将年龄分段,创造新特征,同时增加儿童特征

add isChild

def male_female_child(passenger):
    # 取年龄和性别
    age,sex = passenger
    # 提出儿童特征
    if age < 16:
        return 'child'
    else:
        return sex
# 创建新特征
all_data['person'] = all_data[['Age','Sex']].apply(male_female_child,axis=1)
#0-80岁的年龄分布,若分段成3组,按少年、中青年、老年分

all_data['Age_band']=0
all_data.loc[all_data['Age']<=16,'Age_band']=0
all_data.loc[(all_data['Age']>16)&(all_data['Age']<=40),'Age_band']=1
all_data.loc[all_data['Age']>40,'Age_band']=2

Name处理

df = pd.get_dummies(all_data['Title'],prefix='Title')
all_data = pd.concat([all_data,df],axis=1)
all_data.drop('Title',axis=1,inplace=True)
#drop name
all_data.drop('Name',axis=1,inplace=True)

fiilna Embarked

all_data.loc[all_data.Embarked.isnull()]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Age Embarked Fare Parch Pclass Sex SibSp Survived Ticket Title person Age_band
train 61 38.0 NaN 80.0 0 1 female 0 1.0 113572 2 female 1
829 62.0 NaN 80.0 0 1 female 0 1.0 113572 3 female 2

票价80,一等舱,很大概率是C口

all_data['Embarked'].fillna('C',inplace=True)

all_data.Embarked.isnull().any()
False
embark_dummy = pd.get_dummies(all_data.Embarked)
all_data = pd.concat([all_data,embark_dummy],axis=1)
all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Age Embarked Fare Parch Pclass Sex SibSp Survived Ticket person Age_band Title_Master Title_Miss Title_Mr Title_Mrs C Q S
train 0 22.0 S 7.2500 0 3 male 1 0.0 A/5 21171 male 1 0 0 1 0 0 0 1
1 38.0 C 71.2833 0 1 female 1 1.0 PC 17599 female 1 0 0 0 1 1 0 0

add SibSp and Parch

#创造familysize和alone两个新特征
all_data['Family_size'] = all_data['SibSp']+all_data['Parch']#是所有亲属总和
all_data['alone'] = 0#不是一个人
all_data.loc[all_data.Family_size==0,'alone']=1#代表是一个人
f,ax=plt.subplots(1,2,figsize=(16,6))
sns.factorplot('Family_size','Survived',data=all_data[:train.shape[0]],ax=ax[0])
ax[0].set_title('Family_size vs Survived')
sns.factorplot('alone','Survived',data=all_data[:train.shape[0]],ax=ax[1])
ax[1].set_title('alone vs Survived')
plt.close(2)
plt.close(3)
plt.show()

kaggle初探--泰坦尼克号生存预测_第16张图片

当乘客一个人的时候,生存率很低,大概在0.3左右,有1-3家庭成员时生存率上升,但>4时,生存率又急速下降。

#再将family size分段
all_data['Family_size'] = np.where(all_data['Family_size']==0, 'solo',
                                    np.where(all_data['Family_size']<=3, 'normal', 'big'))
sns.factorplot('alone','Survived',hue='Sex',data=all_data[:train.shape[0]],col='Pclass')
plt.show()

kaggle初探--泰坦尼克号生存预测_第17张图片

对于女性,1,2等仓来说,是否一个人对生存率影响不大,但对于3等仓女性,一个人时反而生存率提高。

all_data['poor_girl'] = 0
all_data.loc[(all_data['Sex']=='female')&(all_data['Pclass']==3)&(all_data['alone']==1),'poor_girl']=1

连续变量Fare填充、分段

#补充全缺失值
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==1),'Fare']=84
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==2),'Fare']=21
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==3),'Fare']=14
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Fare' ],
                 color='red', label='Not Survived')
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Fare' ],
                 color='blue', label='Survived')
plt.xlim((0,100))
(0, 100)

kaggle初探--泰坦尼克号生存预测_第18张图片

sns.lmplot('Fare','Survived',data=all_data[:train.shape[0]])
plt.show()

kaggle初探--泰坦尼克号生存预测_第19张图片

#Fare平均分成3段取均值
all_data['Fare_band'] = pd.qcut(all_data['Fare'],3)

all_data[:train.shape[0]].groupby('Fare_band')['Survived'].mean()
Fare_band (-0.001, 8.662] 0.198052 (8.662, 26.0] 0.402778 (26.0, 512.329] 0.559322 Name: Survived, dtype: float64
#将连续变量fare分段,离散化

all_data['Fare_cut'] = 0
all_data.loc[all_data['Fare']<=8.662,'Fare_cut'] = 0
all_data.loc[((all_data['Fare']>8.662) & (all_data['Fare']<=26)),'Fare_cut'] = 1
#all_data.loc[((all_data['Fare']>14.454) & (all_data['Fare']<=31.275)),'Fare_cut'] = 2
all_data.loc[((all_data['Fare']>26) & (all_data['Fare']<513)),'Fare_cut'] = 2

sns.factorplot('Fare_cut','Survived',hue='Sex',data=all_data[:train.shape[0]])
plt.show()

kaggle初探--泰坦尼克号生存预测_第20张图片

价格上升,生存率增加,对男性尤为明显

# creat a feature about rich man
all_data['rich_man'] = 0
all_data.loc[((all_data['Fare']>=80) & (all_data['Sex']=='male')),'rich_man'] = 1

类型特征数值化

all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Age Embarked Fare Parch Pclass Sex SibSp Survived Ticket person Title_Mrs C Q S Family_size alone poor_girl Fare_band Fare_cut rich_man
train 0 22.0 S 7.2500 0 3 male 1 0.0 A/5 21171 male 0 0 0 1 normal 0 0 (-0.001, 8.662] 0 0
1 38.0 C 71.2833 0 1 female 1 1.0 PC 17599 female 1 1 0 0 normal 0 0 (26.0, 512.329] 2 0
2 26.0 S 7.9250 0 3 female 0 1.0 STON/O2. 3101282 female 0 0 0 1 solo 1 1 (-0.001, 8.662] 0 0
3 35.0 S 53.1000 0 1 female 1 1.0 113803 female 1 0 0 1 normal 0 0 (26.0, 512.329] 2 0
4 35.0 S 8.0500 0 3 male 0 0.0 373450 male 0 0 0 1 solo 1 0 (-0.001, 8.662] 0 0

5 rows × 24 columns

舍弃特征有Embarked(已离散化),Fare,Fare_band(已用Fare_cut代替),Sex(已用Person代替),Age(有Age_band),Ticket,S,SibSp,Parch

'''
舍弃不需要的特征:Age,用Age_band分段代替了,
Fare,Fare_band用Fare_cut分段代替了
Ticket无意义
'''
#all_data.drop(['Age','Fare','Fare_band','Ticket'],axis=1,inplace=True)
#all_data.drop(['Age','Fare','Fare_band','Ticket','Embarked','C'],axis=1,inplace=True)
all_data.drop(['Age','Fare','Ticket','Embarked','C','Fare_band','SibSp','Parch'],axis=1,inplace=True)
all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Pclass Sex Survived person Age_band Title_Master Title_Miss Title_Mr Title_Mrs Q S Family_size alone poor_girl Fare_cut rich_man
train 0 3 male 0.0 male 1 0 0 1 0 0 1 normal 0 0 0 0
1 1 female 1.0 female 1 0 0 0 1 0 0 normal 0 0 2 0
df1 = pd.get_dummies(all_data['Family_size'],prefix='Family_size')
df2 = pd.get_dummies(all_data['person'],prefix='person')
df3 = pd.get_dummies(all_data['Age_band'],prefix='age')
all_data = pd.concat([all_data,df1,df2,df3],axis=1)
all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Pclass Sex Survived person Age_band Title_Master Title_Miss Title_Mr Title_Mrs Q rich_man Family_size_big Family_size_normal Family_size_solo person_child person_female person_male age_0 age_1 age_2
train 0 3 male 0.0 male 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0
1 1 female 1.0 female 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0
2 3 female 1.0 female 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0
3 1 female 1.0 female 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0
4 3 male 0.0 male 1 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0

5 rows × 25 columns

all_data.drop(['Sex','person','Age_band','Family_size'],axis=1,inplace=True)
all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Pclass Survived Title_Master Title_Miss Title_Mr Title_Mrs Q S alone poor_girl rich_man Family_size_big Family_size_normal Family_size_solo person_child person_female person_male age_0 age_1 age_2
train 0 3 0.0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0
1 1 1.0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0
2 3 1.0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0
3 1 1.0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0
4 3 0.0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0

5 rows × 21 columns

建立模型

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import confusion_matrix#retun array of prredict and target
from sklearn.model_selection import cross_val_predict#use to retun the predict of cross val 

from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
train_data = all_data[:train.shape[0]]
test_data = all_data[train.shape[0]:]
print('train data:'+str(train_data.shape))
print('test data:'+str(test_data.shape))
train data:(668, 21) test data:(641, 21)

train,test = train_test_split(train_data,test_size = 0.25, random_state=0,stratify=train_data['Survived'])
train_x = train.drop('Survived',axis=1)

train_y = train['Survived']

test_x = test.drop('Survived',axis=1)
test_y = test['Survived']
print(train_x.shape)
print(test_x.shape)
(668, 20) (223, 20)
# define score on train and test data
def cv_score(model):
    cv_result = cross_val_score(model,train_x,train_y,cv=10,scoring = "accuracy")
    return(cv_result)

def cv_score_test(model):
    cv_result_test = cross_val_score(model,test_x,test_y,cv=10,scoring = "accuracy")
    return(cv_result_test)

rbf SVM

# RBF SVM model

param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }
clf_svc = GridSearchCV(svm.SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf_svc = clf_svc.fit(train_x, train_y)
print("Best estimator found by grid search:")
print(clf_svc.best_estimator_)
acc_svc_train = cv_score(clf_svc.best_estimator_).mean()
acc_svc_test = cv_score_test(clf_svc.best_estimator_).mean()
print(acc_svc_train)
print(acc_svc_test)
Best estimator found by grid search: SVC(C=1000.0, cache_size=200, class_weight=’balanced’, coef0=0.0, decision_function_shape=None, degree=3, gamma=0.0001, kernel=’rbf’, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) 0.826306967835 0.816196122718

决策树

#a simple tree

clf_tree = DecisionTreeClassifier()
clf_tree.fit(train_x,train_y)
acc_tree_train = cv_score(clf_tree).mean()
acc_tree_test = cv_score_test(clf_tree).mean()
print(acc_tree_train)
print(acc_tree_test)
0.808216271583 0.811631846414

KNN

#test n_neighbors 

pred = []
for i in range(1,11):
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(train_x,train_y)
    pred.append(cv_score(model).mean())
n = list(range(1,11))
plt.plot(n,pred)
plt.xticks(range(1,11))
plt.show()  

kaggle初探--泰坦尼克号生存预测_第21张图片

clf_knn = KNeighborsClassifier(n_neighbors=4)
clf_knn.fit(train_x,train_y)
acc_knn_train = cv_score(clf_knn).mean()
acc_knn_test = cv_score_test(clf_knn).mean()
print(acc_knn_train)
print(acc_knn_test)
0.826239790353 0.829653679654

逻辑回归

#logistic regression

clf_LR = LogisticRegression()
clf_LR.fit(train_x,train_y)
acc_LR_train = cv_score(clf_LR).mean()
acc_LR_test = cv_score_test(clf_LR).mean()
print(acc_LR_train)
print(acc_LR_test)
0.838226647511 0.811848296631

高斯贝叶斯



clf_gb = GaussianNB()
clf_gb.fit(train_x,train_y)
acc_gb_train = cv_score(clf_gb).mean()
acc_gb_test = cv_score_test(clf_gb).mean()
print(acc_gb_train)
print(acc_gb_test)
0.794959693511 0.789695087521

随机森林



n_estimators = range(100,1000,100)
grid = {'n_estimators':n_estimators}

clf_forest = GridSearchCV(RandomForestClassifier(random_state=0),param_grid=grid,verbose=True)
clf_forest.fit(train_x,train_y)
print(clf_forest.best_estimator_)
print(clf_forest.best_score_)
#print(cv_score(clf_forest).mean())
#print(cv_score_test(clf_forest).mean())
Fitting 3 folds for each of 9 candidates, totalling 27 fits [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 32.2s finished RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’, max_depth=None, max_features=’auto’, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1, oob_score=False, random_state=0, verbose=0, warm_start=False) 0.817365269461
clf_forest = RandomForestClassifier(n_estimators=200)
clf_forest.fit(train_x,train_y)
acc_forest_train = cv_score(clf_forest).mean()
acc_forest_test = cv_score_test(clf_forest).mean()
print(acc_forest_train)
print(acc_forest_test)
0.811178066885 0.811434217956
pd.Series(clf_forest.feature_importances_,train_x.columns).sort_values(ascending=True).plot.barh(width=0.8)
plt.show()

kaggle初探--泰坦尼克号生存预测_第22张图片


models = pd.DataFrame({
    'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],
    'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train],
    'score on test':[acc_svc_test,acc_tree_test,acc_knn_test,acc_LR_test,acc_gb_test,acc_forest_test]
})
models.sort_values(by='score on test', ascending=False)
'''
models = pd.DataFrame({
    'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],
    'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train]
})
'''
models.sort_values(by='score on test', ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
model score on test score on train
2 KNN 0.829654 0.826240
0 SVM 0.816196 0.826307
3 Logistic regression 0.811848 0.838227
1 Decision Tree 0.811632 0.808216
5 Random Forest 0.811434 0.811178
4 Gaussion Bayes 0.789695 0.794960

Ensemble

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import GradientBoostingClassifier

# bagging Decision tree
from sklearn.ensemble import BaggingClassifier
bag_tree = BaggingClassifier(base_estimator=clf_svc.best_estimator_,n_estimators=200,random_state=0)
bag_tree.fit(train_x,train_y)
acc_bagtree_train = cv_score(bag_tree).mean()
acc_bagtree_test =cv_score_test(bag_tree).mean()
print(acc_bagtree_train)
print(acc_bagtree_test)
0.82782211935
0.816196122718

Adaboosting

n_estimators = range(100,1000,100)
a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
grid = {'n_estimators':n_estimators,'learning_rate':a}
ada = GridSearchCV(AdaBoostClassifier(),param_grid=grid,verbose=True)
ada.fit(train_x,train_y)
print(ada.best_estimator_)
print(ada.best_score_)
#acc_ada_train = cv_score(ada).mean()
#acc_ada_test = cv_score_test(ada).mean()

#print(acc_ada_train)
#print(acc_ada_test)
Fitting 3 folds for each of 90 candidates, totalling 270 fits


[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:  5.4min finished


AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.05, n_estimators=200, random_state=None)
0.835329341317
ada = AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.2)
ada.fit(train_x,train_y)

acc_ada_train = cv_score(ada).mean()
acc_ada_test = cv_score_test(ada).mean()

print(acc_ada_train)
print(acc_ada_test)
0.829248144305
0.825719932242
#confusion matrix to see the presiction

y_pred = cross_val_predict(ada,test_x,test_y,cv=10)
sns.heatmap(confusion_matrix(test_y,y_pred),cmap='winter',annot=True,fmt='2.0f')
plt.show()

kaggle初探--泰坦尼克号生存预测_第23张图片

GradientBoosting


n_estimators = range(100,1000,100)
a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
grid = {'n_estimators':n_estimators,'learning_rate':a}
grad = GridSearchCV(GradientBoostingClassifier(),param_grid=grid,verbose=True)
grad.fit(train_x,train_y)
print(grad.best_estimator_)
print(grad.best_score_)
Fitting 3 folds for each of 90 candidates, totalling 270 fits


[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:  2.4min finished


GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.05, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=200, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)
0.824850299401
#use best estimator in gradient

clf_grad=GradientBoostingClassifier(n_estimators=200,random_state=0,learning_rate=0.05)
clf_grad.fit(train_x,train_y)
acc_grad_train = cv_score(clf_grad).mean()
acc_grad_test = cv_score_test(clf_grad).mean()

print(acc_grad_train)
print(acc_grad_test)
0.818709926304
0.807500470544
from sklearn.metrics import precision_score
class Ensemble(object):

    def __init__(self,estimators):
        self.estimator_names = []
        self.estimators = []
        for i in estimators:
            self.estimator_names.append(i[0])
            self.estimators.append(i[1])
        self.clf = LogisticRegression()

    def fit(self, train_x, train_y):
        for i in self.estimators:
            i.fit(train_x,train_y)
        x = np.array([i.predict(train_x) for i in self.estimators]).T
        y = train_y
        self.clf.fit(x, y)

    def predict(self,x):
        x = np.array([i.predict(x) for i in self.estimators]).T
        #print(x)
        return self.clf.predict(x)


    def score(self,x,y):
        s = precision_score(y,self.predict(x))
        return s
ensem = Ensemble([('Ada',ada),('Bag',bag_tree),('SVM',clf_svc.best_estimator_),('LR',clf_LR),('gbdt',clf_grad)])
score = 0
for i in range(0,10):
    ensem.fit(train_x, train_y)
    sco = round(ensem.score(test_x,test_y) * 100, 2)
    score+=sco
print(score/10)
89.83

提交

pre = ensem.predict(test_data)
pd.DataFrame({'PassengerId':temp['PassengerId'],'Survived':pre})
submission = pd.DataFrame({'PassengerId':passID,'Survived':pre})

提交结果看,ensemble模型和单个模型比并没有明显提升,分析可能是基模型相关性较强,训练数据不够多,或者是one-hot编码会不会引入共线性。虽然测试集和训练集结果相差不大,但提交结果降低明显,分析可能是数据不够,训练不充分,特征较少且相关性强,可以考虑引入更多特征。

你可能感兴趣的:(kaggle初探--泰坦尼克号生存预测)