About this Dataset,In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user’s first booking destination will be. All the users in this dataset are from the USA.
There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.
train_users_2.csv - the training set of users (训练数据)
* id: user id (用户id)
* date_account_created(帐号注册时间): the date of account creation
* timestamp_first_active(首次活跃时间): timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
* date_first_booking(首次订房时间): date of first booking
* gender(性别)
* age(年龄)
* signup_method(注册方式)
* signup_flow(注册页面): the page a user came to signup up from
* language(语言): international language preference
* affiliate_channel(付费市场渠道): what kind of paid marketing
* affiliate_provider(付费市场渠道名称): where the marketing is e.g. google, craigslist, other
* first_affiliate_tracked(注册前第一个接触的市场渠道): whats the first marketing the user interacted with before the signing up
* signup_app(注册app)
* first_device_type(设备类型)
* first_browser(浏览器类型)
* country_destination(订房国家-需要预测的量)
: this is the target variable you are to predict
test_users.csv - the test set of users (测试数据)
sessions.csv - web sessions log for users(网页浏览数据)
* user_id(用户id): to be joined with the column ‘id’ in users table
* action(用户行为)
* action_type(用户行为类型)
* action_detail(用户行为具体)
* device_type(设备类型)
* secs_elapsed(停留时长)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pickle
import datetime
import os
train = pd.read_csv("../input/train_users_2.csv")
test = pd.read_csv("../input/test_users.csv")
train.head()
id | date_account_created | timestamp_first_active | date_first_booking | gender | age | signup_method | signup_flow | language | affiliate_channel | affiliate_provider | first_affiliate_tracked | signup_app | first_device_type | first_browser | country_destination | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | gxn3p5htnn | 2010-06-28 | 20090319043255 | NaN | -unknown- | NaN | 0 | en | direct | direct | untracked | Web | Mac Desktop | Chrome | NDF | |
1 | 820tgsjxq7 | 2011-05-25 | 20090523174809 | NaN | MALE | 38.0 | 0 | en | seo | untracked | Web | Mac Desktop | Chrome | NDF | ||
2 | 4ft3gnwmtx | 2010-09-28 | 20090609231247 | 2010-08-02 | FEMALE | 56.0 | basic | 3 | en | direct | direct | untracked | Web | Windows Desktop | IE | US |
3 | bjjt8pjhuk | 2011-12-05 | 20091031060129 | 2012-09-08 | FEMALE | 42.0 | 0 | en | direct | direct | untracked | Web | Mac Desktop | Firefox | other | |
4 | 87mebub9p4 | 2010-09-14 | 20091208061105 | 2010-02-18 | -unknown- | 41.0 | basic | 0 | en | direct | direct | untracked | Web | Mac Desktop | Chrome | US |
print("Column names for training dataset : ")
for column in train.columns:
print("-", column)
Column names for training dataset :
- id
- date_account_created
- timestamp_first_active
- date_first_booking
- gender
- age
- signup_method
- signup_flow
- language
- affiliate_channel
- affiliate_provider
- first_affiliate_tracked
- signup_app
- first_device_type
- first_browser
- country_destination
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213451 entries, 0 to 213450
Data columns (total 16 columns):
id 213451 non-null object
date_account_created 213451 non-null object
timestamp_first_active 213451 non-null int64
date_first_booking 88908 non-null object
gender 213451 non-null object
age 125461 non-null float64
signup_method 213451 non-null object
signup_flow 213451 non-null int64
language 213451 non-null object
affiliate_channel 213451 non-null object
affiliate_provider 213451 non-null object
first_affiliate_tracked 207386 non-null object
signup_app 213451 non-null object
first_device_type 213451 non-null object
first_browser 213451 non-null object
country_destination 213451 non-null object
dtypes: float64(1), int64(2), object(13)
memory usage: 26.1+ MB
1. train文件包含213451行数据,16个特征
2. 各特征的数据类型和空值情况
3. age空值较多,特征提取时考虑将空值单独作为一个类别
4. date_first_booking空值较多,在特征提取时可以考虑删除
探索性分析
dac_train = pd.to_datetime(train.date_account_created).value_counts()
dac_test = pd.to_datetime(test.date_account_created).value_counts()
#计算距离首次注册相隔的天数
dac_train_days = dac_train.index - dac_train.index.min()
dac_test_days = dac_test.index - dac_train.index.min()
plt.figure(figsize=[10,6])
plt.scatter(dac_train_days.days, dac_train.values,color='b', label="train set")
plt.scatter(dac_test_days.days, dac_test.values,color='r', label="test set")
plt.title("Accounts Created vs. Days")
plt.xlabel('Days')
plt.ylabel('Accounts Created')
plt.legend(loc="best")
随着时间的增长,用户注册的数量在急剧上升
train.timestamp_first_active.head()
0 20090319043255
1 20090523174809
2 20090609231247
3 20091031060129
4 20091208061105
Name: timestamp_first_active, dtype: int64
# 转化为datetime类型
tfa_train = train.timestamp_first_active.astype(str).apply(lambda x: datetime.datetime(int(x[:4]), int(x[4:6]), int(x[6:8]),
int(x[8:10]), int(x[10:12]), int(x[12:])))
tfa_test = test.timestamp_first_active.astype(str).apply(lambda x: datetime.datetime(int(x[:4]), int(x[4:6]), int(x[6:8]),
int(x[8:10]), int(x[10:12]), int(x[12:])))
# 计算距离用户首次活跃相隔的天数
tfa_train_days = (tfa_train - tfa_train.min()).apply(lambda x: x.days).value_counts()
tfa_test_days = (tfa_test - tfa_train.min()).apply(lambda x: x.days).value_counts()
plt.figure(figsize=[10,6])
plt.scatter(tfa_train_days.index, tfa_train_days.values,color='b', label="train set")
plt.scatter(tfa_test_days.index, tfa_test_days.values,color='r', label="test set")
plt.title("User First Active vs. Days")
plt.xlabel('Days')
plt.ylabel('User First Active')
plt.legend(loc="best")
print(train.date_account_created.shape[0])
print(train.date_first_booking.isnull().sum())
print(test.date_account_created.shape[0])
print(test.date_first_booking.isnull().sum()
213451
124543
62096
62096
date_first_booking字段包含超过50%比例的空值,考虑在特征提取时直接将该字段剔除
train.age.value_counts().head()
30.0 6124
31.0 6016
29.0 5963
28.0 5939
32.0 5855
Name: age, dtype: int64
用户年龄在30岁左右较为集中
age_step = 10
def ageProcess(age):
'''
对age字段进行前处理及离散化
1. 字段中存在19xx岁的数据存在,猜测是用户误将年龄信息填写为出生年份
2. 对age进行离散化处理,每隔一个age_step分为一个组
'''
if age >= 1900 and age < 2014:
age = 2014 - age
if age < 0 or pd.isnull(age):
return "NA"
if age < (1 * age_step):
return "< %d" % age_step
elif age < (2 * age_step):
return "%d ~ %d" % (1*age_step, 2*age_step)
elif age < (3 * age_step):
return "%d ~ %d" % (2*age_step, 3*age_step)
elif age < (4 * age_step):
return "%d ~ %d" % (3*age_step, 4*age_step)
elif age < (5 * age_step):
return "%d ~ %d" % (4*age_step, 5*age_step)
elif age < (6 * age_step):
return "%d ~ %d" % (5*age_step, 6*age_step)
elif age < (7 * age_step):
return "%d ~ %d" % (6*age_step, 7*age_step)
elif age < (8 * age_step):
return "%d ~ %d" % (7*age_step, 8*age_step)
elif age < (9 * age_step):
return "%d ~ %d" % (8*age_step, 9*age_step)
elif age < 110:
return "%d ~ 110" % (9 * age_step)
else:
return "non-physical"
age_train = train.age.apply(ageProcess)
age_test = test.age.apply(age,Process)
plt.figure(figsize=(10,6))
fig, (ax1, ax2) = plt.subplots(1,2,)
plt.bar(age_train.value_counts().index, age_train.value_counts().values)
def feature_barplot(df_train, df_test, feature, figsize=(12,6),rot=90, saveimg=False ):
feature_train_counts = df_train[feature].value_counts()
feature_test_counts = df_test[feature].value_counts()
fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=False, figsize=figsize)
sns.barplot(feature_train_counts.index, feature_train_counts.values, ax=ax1)
sns.barplot(feature_test_counts.index, feature_test_counts.values, ax=ax2)
ax1.set_xticklabels(ax1.xaxis.get_majorticklabels(), rotation = rot)
ax2.set_xticklabels(ax1.xaxis.get_majorticklabels(), rotation = rot)
ax1.set_title(feature + ' of training set')
ax2.set_title(feature + ' of test set')
ax1.set_ylabel('Counts')
plt.tight_layout()
if saveimg == True:
figname = feature + ".png"
fig_feature.savefig(figname, dpi = 75)
对于类别型字段,通过barplot的方式呈现其具体的分布情况
feature_pool = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked',
'signup_app', 'first_device_type', 'first_browser']
for feature in feature_pool:
feature_barplot(train, test, feature)
### `特征提取`
# 划分训练集特征与标签
X_train = train.loc[:, train.columns != 'country_destination'].copy()
y_train = train.loc[:, train.columns == 'country_destination'].copy()
X_test = test.copy()
# 将训练集与测试集特征合并,便于后续特征处理
X_train["train/test"] = "train"
X_test["train/test"] = "test"
X_total = pd.concat([X_train, X_test], axis=0, ignore_index=True)
Y = 2000
season = [(0, (datetime.date(Y,1,1), datetime.date(Y,3,20))),
(1, (datetime.date(Y,3,21), datetime.date(Y,6,20))),
(2, (datetime.date(Y,6,21), datetime.date(Y,9,20))),
(3, (datetime.date(Y,9,21), datetime.date(Y,12,20))),
(0, (datetime.date(Y,12,21), datetime.date(Y,12,31)))]
def get_season(dt):
dt = datetime.date(Y, dt.month, dt.day)
for season_id, (start_date, end_date) in season:
if start_date <= dt and dt <= end_date:
return season_id
def create_dt_feature(df, feature):
year = "%s_year" % feature
month = "%s_month" % feature
day_of_month = "%s_day_of_month" % feature
weekday = "%s_weekday" % feature
season = "%s_season" % feature
df[year] = df[feature].apply(lambda x: x.year)
df[month] = df[feature].apply(lambda x: x.month)
df[day_of_month] = df[feature].apply(lambda x: x.day)
df[weekday] = df[feature].apply(lambda x: x.isoweekday())
df[season] = df[feature].apply(get_season)
return df
X_total["dac"] = pd.to_datetime(X_total["date_account_created"])
X_total = create_dt_feature(X_total, "dac")
X_total.drop(["date_account_created"], axis=1, inplace=True)
X_total.head()
id | timestamp_first_active | date_first_booking | gender | age | signup_method | signup_flow | language | affiliate_channel | affiliate_provider | … | signup_app | first_device_type | first_browser | train/test | dac | dac_year | dac_month | dac_day_of_month | dac_weekday | dac_season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | gxn3p5htnn | 20090319043255 | NaN | -unknown- | NaN | 0 | en | direct | direct | … | Web | Mac Desktop | Chrome | train | 2010-06-28 | 2010 | 6 | 28 | 1 | 2 | |
1 | 820tgsjxq7 | 20090523174809 | NaN | MALE | 38.0 | 0 | en | seo | … | Web | Mac Desktop | Chrome | train | 2011-05-25 | 2011 | 5 | 25 | 3 | 1 | ||
2 | 4ft3gnwmtx | 20090609231247 | 2010-08-02 | FEMALE | 56.0 | basic | 3 | en | direct | direct | … | Web | Windows Desktop | IE | train | 2010-09-28 | 2010 | 9 | 28 | 2 | 3 |
3 | bjjt8pjhuk | 20091031060129 | 2012-09-08 | FEMALE | 42.0 | 0 | en | direct | direct | … | Web | Mac Desktop | Firefox | train | 2011-12-05 | 2011 | 12 | 5 | 1 | 3 | |
4 | 87mebub9p4 | 20091208061105 | 2010-02-18 | -unknown- | 41.0 | basic | 0 | en | direct | direct | … | Web | Mac Desktop | Chrome | train | 2010-09-14 | 2010 | 9 | 14 | 2 | 2 |
5 rows × 21 columns
X_total["tfa"] = X_total["timestamp_first_active"].astype("str").apply(lambda x: datetime.datetime(int(x[:4]),
int(x[4:6]),
int(x[6:8]),
int(x[8:10]),
int(x[10:12]),
int(x[12:])))
X_total = create_dt_feature(X_total, "tfa")
X_total.drop(["timestamp_first_active"], axis=1, inplace=True)
X_total.head()
id | date_first_booking | gender | age | signup_method | signup_flow | language | affiliate_channel | affiliate_provider | first_affiliate_tracked | … | dac_month | dac_day_of_month | dac_weekday | dac_season | tfa | tfa_year | tfa_month | tfa_day_of_month | tfa_weekday | tfa_season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | gxn3p5htnn | NaN | -unknown- | NaN | 0 | en | direct | direct | untracked | … | 6 | 28 | 1 | 2 | 2009-03-19 04:32:55 | 2009 | 3 | 19 | 4 | 0 | |
1 | 820tgsjxq7 | NaN | MALE | 38.0 | 0 | en | seo | untracked | … | 5 | 25 | 3 | 1 | 2009-05-23 17:48:09 | 2009 | 5 | 23 | 6 | 1 | ||
2 | 4ft3gnwmtx | 2010-08-02 | FEMALE | 56.0 | basic | 3 | en | direct | direct | untracked | … | 9 | 28 | 2 | 3 | 2009-06-09 23:12:47 | 2009 | 6 | 9 | 2 | 1 |
3 | bjjt8pjhuk | 2012-09-08 | FEMALE | 42.0 | 0 | en | direct | direct | untracked | … | 12 | 5 | 1 | 3 | 2009-10-31 06:01:29 | 2009 | 10 | 31 | 6 | 3 | |
4 | 87mebub9p4 | 2010-02-18 | -unknown- | 41.0 | basic | 0 | en | direct | direct | untracked | … | 9 | 14 | 2 | 2 | 2009-12-08 06:11:05 | 2009 | 12 | 8 | 2 | 3 |
5 rows × 26 columns
X_total["dt_span"] = X_total['dac'].subtract(X_total["tfa"]).dt.days
X_total["dt_span"].value_counts().head()
-1 275369
0 7
6 4
5 4
1 4
Name: dt_span, dtype: int64
X_total.dt_span.describe()
count 275547.000000
mean -0.820423
std 10.516688
min -1.000000
25% -1.000000
50% -1.000000
75% -1.000000
max 1455.000000
Name: dt_span, dtype: float64
def get_span_class(dt_span):
if dt_span == -1:
return "One day"
elif dt_span < 7:
return "One week"
elif dt_span < 30:
return "One month"
elif dt_span < 365:
return "One year"
else:
return "Over one year"
X_total["dt_span"] = X_total["dt_span"].apply(get_span_class)
X_total["age"] = X_total["age"].apply(ageProcess)
X_total["age"].value_counts()
NA 116866
30 ~ 40 58598
20 ~ 30 48776
40 ~ 50 24623
50 ~ 60 12774
60 ~ 70 6191
10 ~ 20 2984
90 ~ 110 1879
70 ~ 80 1472
non-physical 960
80 ~ 90 325
< 10 99
Name: age, dtype: int64
# 选出需要进行one-hot-encoding的特征,其中包括之前生成的dac、tfa、dt_span、age相关特征
feature_pool = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked',
'signup_app', 'first_device_type', 'first_browser']
feature_pool = feature_pool + ["dac_weekday", "dac_season", "tfa_weekday", "tfa_season", "dt_span","age"]
for feature in feature_pool:
df_temp = pd.get_dummies(X_total[feature], prefix=feature)
X_total = pd.concat([X_total, df_temp], axis=1)
X_total.drop(["dac", "tfa", "date_first_booking"], axis=1, inplace=True)
X_total.drop(feature_pool, axis=1, inplace=True)
X_total.shape[1]
200
session = pd.read_csv("../input/sessions.csv")
session.head()
user_id | action | action_type | action_detail | device_type | secs_elapsed | |
---|---|---|---|---|---|---|
0 | d1mm9tcy42 | lookup | NaN | NaN | Windows Desktop | 319.0 |
1 | d1mm9tcy42 | search_results | click | view_search_results | Windows Desktop | 67753.0 |
2 | d1mm9tcy42 | lookup | NaN | NaN | Windows Desktop | 301.0 |
3 | d1mm9tcy42 | search_results | click | view_search_results | Windows Desktop | 22141.0 |
4 | d1mm9tcy42 | lookup | NaN | NaN | Windows Desktop | 435.0 |
session["id"] = session["user_id"]
session.drop(["user_id"], inplace=True, axis=1)
session.head()
action | action_type | action_detail | device_type | secs_elapsed | id | |
---|---|---|---|---|---|---|
0 | lookup | NaN | NaN | Windows Desktop | 319.0 | d1mm9tcy42 |
1 | search_results | click | view_search_results | Windows Desktop | 67753.0 | d1mm9tcy42 |
2 | lookup | NaN | NaN | Windows Desktop | 301.0 | d1mm9tcy42 |
3 | search_results | click | view_search_results | Windows Desktop | 22141.0 | d1mm9tcy42 |
4 | lookup | NaN | NaN | Windows Desktop | 435.0 | d1mm9tcy42 |
session.isnull().sum()
action 79626
action_type 1126204
action_detail 1126204
device_type 0
secs_elapsed 136031
id 34496
dtype: int64
session.shape[0]
10567737
session.action.fillna("NAN", inplace=True)
session.action_type.fillna("NAN", inplace=True)
session.action_detail.fillna("NAN", inplace=True)
session.isnull().sum()
action 0
action_type 0
action_detail 0
device_type 0
secs_elapsed 136031
id 34496
dtype: int64
session1 = session.copy()
对特征action,action_detail,action_type,device_type,secs_elapsed进行细化
- 首先将用户的特征根据用户id进行分组
- 特征action:统计每个用户总的action出现的次数,各个action类型的数量,平均值以及标准差
- 特征action_detail:统计每个用户总的action_detail出现的次数,各个action_detail类型的数量,平均值以及标准差
- 特征action_type:统计每个用户总的action_type出现的次数,各个action_type类型的数量,平均值,标准差以及总的停留时长(进行log处理)
- 特征device_type:统计每个用户总的device_type出现的次数,各个device_type类型的数量,平均值以及标准差
- 特征secs_elapsed:对缺失值用0填充,统计每个用户secs_elapsed时间的总和,平均值,标准差以及中位数(进行log处理),(总和/平均数),secs_elapsed(log处理后)各个时间出现的次数
for feature in ["action", "action_detail", "action_type", "device_type"]:
print(len(session1[feature].value_counts().index))
360
156
11
14
action_list = dict(session1.action.value_counts())
action_detail_list = dict(session1.action_detail.value_counts())
session1["action"] = session1["action"].apply(lambda x: x if action_list[x] > 1000 else "other")
session1["action_detail"] = session1["action_detail"].apply(lambda x: x if action_detail_list[x] > 1000 else "other")
print(len(session1.action.value_counts()))
print(len(session1.action_detail.value_counts()))
149
96
f_act = session1.action.value_counts().argsort()
f_act_detail = session1.action_detail.value_counts().argsort()
f_act_type = session1.action_type.value_counts().argsort()
f_dev_type = session1.device_type.value_counts().argsort()
f_act.head()
show 148
index 147
search_results 146
personalize 145
search 144
Name: action, dtype: int64
dgr_session = session1.groupby(["id"])
samples = []
ln = len(dgr_session)
for g in dgr_session:
gr = g[1] # 每个id包含的所有session信息组成的dataframe
l = [] # 存放临时特征
# id
l.append(g[0])
# number of total actions
l.append(len(gr)) #将id对应数据的长度放入列表
# secs_elapsed特征中的缺失值用0填充再获取具体的停留时长
sev = gr.secs_elapsed.fillna(0).values
# action
# 每个用户行为出现的次数,各个行为类型的数量,平均值以及标准差
c_act = [0] * len(f_act)
for i, v in enumerate(gr.action.values):
c_act[f_act[v]] += 1
_, c_act_uqc = np.unique(gr.action.values, return_counts=True)
# 计算用户行为特征各个类型数量的长度,平均值以及标准差
c_act += [len(c_act_uqc), np.mean(c_act_uqc), np.std(c_act_uqc)]
l = l + c_act
# action_type
c_act_type = [0] * len(f_act)
l_act_type = [0] * len(f_act)
for i, v in enumerate(gr.action_type.values):
c_act_type[f_act_type[v]] += 1
l_act_type[f_act_type[v]] += sev[i]
_, c_act_type_uqc = np.unique(gr.action_type.values, return_counts=True)
l_act_type = np.log(1 + np.array(l_act_type)).tolist()
c_act_type += [len(c_act_type_uqc), np.mean(c_act_type_uqc), np.std(c_act_type_uqc)]
l = l + c_act_type + l_act_type
# action_detail
c_act_detail = [0] * len(f_act)
for i, v in enumerate(gr.action_detail.values):
c_act_detail[f_act_detail[v]] += 1
_, c_act_detail_uqc = np.unique(gr.action_detail.values, return_counts=True)
c_act_detail += [len(c_act_detail_uqc), np.mean(c_act_detail_uqc), np.std(c_act_detail_uqc)]
l = l + c_act_detail
# device type
c_dev_type = [0] * len(f_act)
for i, v in enumerate(gr.device_type.values):
c_dev_type[f_dev_type[v]] += 1
c_dev_type.append(len(np.unique(gr.device_type.values)))
_, c_dev_type_uqc = np.unique(gr.device_type.values, return_counts=True)
c_dev_type += [len(c_dev_type_uqc), np.mean(c_dev_type_uqc), np.std(c_dev_type_uqc)]
l = l + c_dev_type
#secs_elapsed features 特征-停留时长
l_secs = [0] * 5
l_log = [0] * 15
if len(sev) > 0:
#Simple statistics about the secs_elapsed values.
l_secs[0] = np.log(1 + np.sum(sev))
l_secs[1] = np.log(1 + np.mean(sev))
l_secs[2] = np.log(1 + np.std(sev))
l_secs[3] = np.log(1 + np.median(sev))
l_secs[4] = l_secs[0] / float(l[1]) #
log_sev = np.log(1 + sev).astype(int)
l_log = np.bincount(log_sev, minlength=15).tolist()
l = l + l_secs + l_log
samples.append(l)
samples = np.array(samples)
samp_ar = samples[:, 1:].astype(np.float16)
samp_id = samples[:, 0]
col_names = []
for i in range(len(samples[0])-1):
col_names.append('c_' + str(i))
df_agg_sess = pd.DataFrame(samp_ar, columns=col_names)
df_agg_sess['id'] = samp_id
df_agg_sess.index = df_agg_sess.id
# 未完待续