Airbnb新用户民宿预定情况预测

1. 背景

About this Dataset,In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user’s first booking destination will be. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.

2. 数据描述

train_users_2.csv - the training set of users (训练数据)
* id: user id (用户id)
* date_account_created(帐号注册时间): the date of account creation
* timestamp_first_active(首次活跃时间): timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
* date_first_booking(首次订房时间): date of first booking
* gender(性别)
* age(年龄)
* signup_method(注册方式)
* signup_flow(注册页面): the page a user came to signup up from
* language(语言): international language preference
* affiliate_channel(付费市场渠道): what kind of paid marketing
* affiliate_provider(付费市场渠道名称): where the marketing is e.g. google, craigslist, other
* first_affiliate_tracked(注册前第一个接触的市场渠道): whats the first marketing the user interacted with before the signing up
* signup_app(注册app)
* first_device_type(设备类型)
* first_browser(浏览器类型)
* country_destination(订房国家-需要预测的量): this is the target variable you are to predict
test_users.csv - the test set of users (测试数据)
sessions.csv - web sessions log for users(网页浏览数据)
* user_id(用户id): to be joined with the column ‘id’ in users table
* action(用户行为)
* action_type(用户行为类型)
* action_detail(用户行为具体)
* device_type(设备类型)
* secs_elapsed(停留时长)

3. 探索性分析与特征工程

3.1 train_user_2和test_user文件

导入所需库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pickle
import datetime
import os

读取文件

train = pd.read_csv("../input/train_users_2.csv")
test = pd.read_csv("../input/test_users.csv")
train.head()
id date_account_created timestamp_first_active date_first_booking gender age signup_method signup_flow language affiliate_channel affiliate_provider first_affiliate_tracked signup_app first_device_type first_browser country_destination
0 gxn3p5htnn 2010-06-28 20090319043255 NaN -unknown- NaN facebook 0 en direct direct untracked Web Mac Desktop Chrome NDF
1 820tgsjxq7 2011-05-25 20090523174809 NaN MALE 38.0 facebook 0 en seo google untracked Web Mac Desktop Chrome NDF
2 4ft3gnwmtx 2010-09-28 20090609231247 2010-08-02 FEMALE 56.0 basic 3 en direct direct untracked Web Windows Desktop IE US
3 bjjt8pjhuk 2011-12-05 20091031060129 2012-09-08 FEMALE 42.0 facebook 0 en direct direct untracked Web Mac Desktop Firefox other
4 87mebub9p4 2010-09-14 20091208061105 2010-02-18 -unknown- 41.0 basic 0 en direct direct untracked Web Mac Desktop Chrome US

查看数据所包含的特征

print("Column names for training dataset : ")
for column in train.columns:
    print("-", column)
    Column names for training dataset : 
    - id
    - date_account_created
    - timestamp_first_active
    - date_first_booking
    - gender
    - age
    - signup_method
    - signup_flow
    - language
    - affiliate_channel
    - affiliate_provider
    - first_affiliate_tracked
    - signup_app
    - first_device_type
    - first_browser
    - country_destination

查看数据信息

train.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 213451 entries, 0 to 213450
    Data columns (total 16 columns):
    id                         213451 non-null object
    date_account_created       213451 non-null object
    timestamp_first_active     213451 non-null int64
    date_first_booking         88908 non-null object
    gender                     213451 non-null object
    age                        125461 non-null float64
    signup_method              213451 non-null object
    signup_flow                213451 non-null int64
    language                   213451 non-null object
    affiliate_channel          213451 non-null object
    affiliate_provider         213451 non-null object
    first_affiliate_tracked    207386 non-null object
    signup_app                 213451 non-null object
    first_device_type          213451 non-null object
    first_browser              213451 non-null object
    country_destination        213451 non-null object
    dtypes: float64(1), int64(2), object(13)
    memory usage: 26.1+ MB

1. train文件包含213451行数据,16个特征
2. 各特征的数据类型和空值情况
3. age空值较多,特征提取时考虑将空值单独作为一个类别
4. date_first_booking空值较多,在特征提取时可以考虑删除

探索性分析

date_account_created特征

dac_train = pd.to_datetime(train.date_account_created).value_counts()
dac_test = pd.to_datetime(test.date_account_created).value_counts()

#计算距离首次注册相隔的天数
dac_train_days = dac_train.index - dac_train.index.min()
dac_test_days = dac_test.index - dac_train.index.min()

plt.figure(figsize=[10,6])
plt.scatter(dac_train_days.days, dac_train.values,color='b', label="train set")
plt.scatter(dac_test_days.days, dac_test.values,color='r', label="test set")
plt.title("Accounts Created vs. Days")
plt.xlabel('Days')
plt.ylabel('Accounts Created')
plt.legend(loc="best")

Airbnb新用户民宿预定情况预测_第1张图片

随着时间的增长,用户注册的数量在急剧上升

timestamp_first_active特征

train.timestamp_first_active.head()
    0    20090319043255
    1    20090523174809
    2    20090609231247
    3    20091031060129
    4    20091208061105
    Name: timestamp_first_active, dtype: int64
# 转化为datetime类型
tfa_train = train.timestamp_first_active.astype(str).apply(lambda x: datetime.datetime(int(x[:4]), int(x[4:6]), int(x[6:8]),
                                                                                      int(x[8:10]), int(x[10:12]), int(x[12:])))
tfa_test = test.timestamp_first_active.astype(str).apply(lambda x: datetime.datetime(int(x[:4]), int(x[4:6]), int(x[6:8]),
                                                                                      int(x[8:10]), int(x[10:12]), int(x[12:])))
# 计算距离用户首次活跃相隔的天数
tfa_train_days = (tfa_train - tfa_train.min()).apply(lambda x: x.days).value_counts()
tfa_test_days = (tfa_test - tfa_train.min()).apply(lambda x: x.days).value_counts()

plt.figure(figsize=[10,6])
plt.scatter(tfa_train_days.index, tfa_train_days.values,color='b', label="train set")
plt.scatter(tfa_test_days.index, tfa_test_days.values,color='r', label="test set")
plt.title("User First Active vs. Days")
plt.xlabel('Days')
plt.ylabel('User First Active')
plt.legend(loc="best")

Airbnb新用户民宿预定情况预测_第2张图片

date_first_booking特征

print(train.date_account_created.shape[0])
print(train.date_first_booking.isnull().sum())

print(test.date_account_created.shape[0])
print(test.date_first_booking.isnull().sum()
    213451
    124543
    62096
    62096

date_first_booking字段包含超过50%比例的空值,考虑在特征提取时直接将该字段剔除

age特征

train.age.value_counts().head()
30.0    6124
31.0    6016
29.0    5963
28.0    5939
32.0    5855
Name: age, dtype: int64

用户年龄在30岁左右较为集中

age_step = 10

def ageProcess(age):
    '''
    对age字段进行前处理及离散化
    1. 字段中存在19xx岁的数据存在,猜测是用户误将年龄信息填写为出生年份
    2. 对age进行离散化处理,每隔一个age_step分为一个组
    '''
    if age >= 1900 and age < 2014:
        age = 2014 - age
    if age < 0 or pd.isnull(age):
        return "NA"
    if age < (1 * age_step):
        return "< %d" % age_step
    elif age < (2 * age_step):
        return "%d ~ %d" % (1*age_step, 2*age_step)
    elif age < (3 * age_step):
        return "%d ~ %d" % (2*age_step, 3*age_step)
    elif age < (4 * age_step):
        return "%d ~ %d" % (3*age_step, 4*age_step)
    elif age < (5 * age_step):
        return "%d ~ %d" % (4*age_step, 5*age_step)
    elif age < (6 * age_step):
        return "%d ~ %d" % (5*age_step, 6*age_step)
    elif age < (7 * age_step):
        return "%d ~ %d" % (6*age_step, 7*age_step)
    elif age < (8 * age_step):
        return "%d ~ %d" % (7*age_step, 8*age_step)
    elif age < (9 * age_step):
        return "%d ~ %d" % (8*age_step, 9*age_step)
    elif age < 110:
        return "%d ~ 110" % (9 * age_step)
    else:
        return "non-physical"


age_train = train.age.apply(ageProcess)
age_test = test.age.apply(age,Process)

plt.figure(figsize=(10,6))
fig, (ax1, ax2) = plt.subplots(1,2,)
plt.bar(age_train.value_counts().index, age_train.value_counts().values)

Airbnb新用户民宿预定情况预测_第3张图片

定义函数用于呈现类别型数据的分布情况

def feature_barplot(df_train, df_test, feature, figsize=(12,6),rot=90, saveimg=False ):

    feature_train_counts = df_train[feature].value_counts()
    feature_test_counts = df_test[feature].value_counts()

    fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=False, figsize=figsize)
    sns.barplot(feature_train_counts.index, feature_train_counts.values, ax=ax1)
    sns.barplot(feature_test_counts.index, feature_test_counts.values, ax=ax2)

    ax1.set_xticklabels(ax1.xaxis.get_majorticklabels(), rotation = rot)
    ax2.set_xticklabels(ax1.xaxis.get_majorticklabels(), rotation = rot)
    ax1.set_title(feature + ' of training set')
    ax2.set_title(feature + ' of test set')
    ax1.set_ylabel('Counts')
    plt.tight_layout()
    if saveimg == True:
        figname = feature + ".png"
        fig_feature.savefig(figname, dpi = 75)   

对于类别型字段,通过barplot的方式呈现其具体的分布情况

feature_pool = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked',
               'signup_app', 'first_device_type', 'first_browser']

for feature in feature_pool:
    feature_barplot(train, test, feature)

Airbnb新用户民宿预定情况预测_第4张图片

Airbnb新用户民宿预定情况预测_第5张图片

Airbnb新用户民宿预定情况预测_第6张图片

Airbnb新用户民宿预定情况预测_第7张图片

Airbnb新用户民宿预定情况预测_第8张图片

Airbnb新用户民宿预定情况预测_第9张图片

Airbnb新用户民宿预定情况预测_第10张图片

Airbnb新用户民宿预定情况预测_第11张图片

Airbnb新用户民宿预定情况预测_第12张图片

Airbnb新用户民宿预定情况预测_第13张图片

### `特征提取`
# 划分训练集特征与标签
X_train = train.loc[:, train.columns != 'country_destination'].copy()
y_train = train.loc[:, train.columns == 'country_destination'].copy()

X_test = test.copy()
# 将训练集与测试集特征合并,便于后续特征处理
X_train["train/test"] = "train"
X_test["train/test"] = "test"

X_total = pd.concat([X_train, X_test], axis=0, ignore_index=True)

date_account_created特征

Y = 2000

season = [(0, (datetime.date(Y,1,1), datetime.date(Y,3,20))),
          (1, (datetime.date(Y,3,21), datetime.date(Y,6,20))),
          (2, (datetime.date(Y,6,21), datetime.date(Y,9,20))),
          (3, (datetime.date(Y,9,21), datetime.date(Y,12,20))),
          (0, (datetime.date(Y,12,21), datetime.date(Y,12,31)))]

def get_season(dt):
    dt = datetime.date(Y, dt.month, dt.day)
    for season_id, (start_date, end_date) in season:
        if start_date <= dt and dt <= end_date:
            return season_id

def create_dt_feature(df, feature):

    year = "%s_year" % feature
    month = "%s_month" % feature
    day_of_month = "%s_day_of_month" % feature
    weekday = "%s_weekday" % feature
    season = "%s_season" % feature

    df[year] = df[feature].apply(lambda x: x.year)
    df[month] = df[feature].apply(lambda x: x.month)
    df[day_of_month] = df[feature].apply(lambda x: x.day)

    df[weekday] = df[feature].apply(lambda x: x.isoweekday())
    df[season] = df[feature].apply(get_season)

    return df 
X_total["dac"] = pd.to_datetime(X_total["date_account_created"])
X_total = create_dt_feature(X_total, "dac")
X_total.drop(["date_account_created"], axis=1, inplace=True)
X_total.head()
id timestamp_first_active date_first_booking gender age signup_method signup_flow language affiliate_channel affiliate_provider signup_app first_device_type first_browser train/test dac dac_year dac_month dac_day_of_month dac_weekday dac_season
0 gxn3p5htnn 20090319043255 NaN -unknown- NaN facebook 0 en direct direct Web Mac Desktop Chrome train 2010-06-28 2010 6 28 1 2
1 820tgsjxq7 20090523174809 NaN MALE 38.0 facebook 0 en seo google Web Mac Desktop Chrome train 2011-05-25 2011 5 25 3 1
2 4ft3gnwmtx 20090609231247 2010-08-02 FEMALE 56.0 basic 3 en direct direct Web Windows Desktop IE train 2010-09-28 2010 9 28 2 3
3 bjjt8pjhuk 20091031060129 2012-09-08 FEMALE 42.0 facebook 0 en direct direct Web Mac Desktop Firefox train 2011-12-05 2011 12 5 1 3
4 87mebub9p4 20091208061105 2010-02-18 -unknown- 41.0 basic 0 en direct direct Web Mac Desktop Chrome train 2010-09-14 2010 9 14 2 2

5 rows × 21 columns

timestamp_first_active特征

X_total["tfa"] = X_total["timestamp_first_active"].astype("str").apply(lambda x: datetime.datetime(int(x[:4]),
                                                                                                                     int(x[4:6]),
                                                                                                                     int(x[6:8]),
                                                                                                                     int(x[8:10]),
                                                                                                                     int(x[10:12]),
                                                                                                                     int(x[12:])))
X_total = create_dt_feature(X_total, "tfa")
X_total.drop(["timestamp_first_active"], axis=1, inplace=True)
X_total.head()
id date_first_booking gender age signup_method signup_flow language affiliate_channel affiliate_provider first_affiliate_tracked dac_month dac_day_of_month dac_weekday dac_season tfa tfa_year tfa_month tfa_day_of_month tfa_weekday tfa_season
0 gxn3p5htnn NaN -unknown- NaN facebook 0 en direct direct untracked 6 28 1 2 2009-03-19 04:32:55 2009 3 19 4 0
1 820tgsjxq7 NaN MALE 38.0 facebook 0 en seo google untracked 5 25 3 1 2009-05-23 17:48:09 2009 5 23 6 1
2 4ft3gnwmtx 2010-08-02 FEMALE 56.0 basic 3 en direct direct untracked 9 28 2 3 2009-06-09 23:12:47 2009 6 9 2 1
3 bjjt8pjhuk 2012-09-08 FEMALE 42.0 facebook 0 en direct direct untracked 12 5 1 3 2009-10-31 06:01:29 2009 10 31 6 3
4 87mebub9p4 2010-02-18 -unknown- 41.0 basic 0 en direct direct untracked 9 14 2 2 2009-12-08 06:11:05 2009 12 8 2 3

5 rows × 26 columns

提取用户从首次活跃到最终注册所花费的时间作为新的特征

X_total["dt_span"] = X_total['dac'].subtract(X_total["tfa"]).dt.days

X_total["dt_span"].value_counts().head()
-1    275369
 0         7
 6         4
 5         4
 1         4
Name: dt_span, dtype: int64
X_total.dt_span.describe()
count    275547.000000
mean         -0.820423
std          10.516688
min          -1.000000
25%          -1.000000
50%          -1.000000
75%          -1.000000
max        1455.000000
Name: dt_span, dtype: float64
def get_span_class(dt_span):
    if dt_span == -1:
        return "One day"
    elif dt_span < 7:
        return "One week"
    elif dt_span < 30:
        return "One month"
    elif dt_span < 365:
        return "One year"
    else:
        return "Over one year"
X_total["dt_span"] = X_total["dt_span"].apply(get_span_class)

age特征

X_total["age"] = X_total["age"].apply(ageProcess)
X_total["age"].value_counts()
NA              116866
30 ~ 40          58598
20 ~ 30          48776
40 ~ 50          24623
50 ~ 60          12774
60 ~ 70           6191
10 ~ 20           2984
90 ~ 110          1879
70 ~ 80           1472
non-physical       960
80 ~ 90            325
< 10                99
Name: age, dtype: int64

类别型特征one-hot encoding

# 选出需要进行one-hot-encoding的特征,其中包括之前生成的dac、tfa、dt_span、age相关特征
feature_pool = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked',
'signup_app', 'first_device_type', 'first_browser']
feature_pool = feature_pool + ["dac_weekday", "dac_season", "tfa_weekday", "tfa_season", "dt_span","age"]
for feature in feature_pool:

    df_temp = pd.get_dummies(X_total[feature], prefix=feature)
    X_total = pd.concat([X_total, df_temp], axis=1)

删除中间过程特征以及date_first_booking

X_total.drop(["dac", "tfa", "date_first_booking"], axis=1, inplace=True)
X_total.drop(feature_pool, axis=1, inplace=True)
X_total.shape[1]
200

3.2 Session文件

session = pd.read_csv("../input/sessions.csv")
session.head()
user_id action action_type action_detail device_type secs_elapsed
0 d1mm9tcy42 lookup NaN NaN Windows Desktop 319.0
1 d1mm9tcy42 search_results click view_search_results Windows Desktop 67753.0
2 d1mm9tcy42 lookup NaN NaN Windows Desktop 301.0
3 d1mm9tcy42 search_results click view_search_results Windows Desktop 22141.0
4 d1mm9tcy42 lookup NaN NaN Windows Desktop 435.0

为便于之后的数据合并,将user_id字段重命名为id

session["id"] = session["user_id"]
session.drop(["user_id"], inplace=True, axis=1)
session.head()
action action_type action_detail device_type secs_elapsed id
0 lookup NaN NaN Windows Desktop 319.0 d1mm9tcy42
1 search_results click view_search_results Windows Desktop 67753.0 d1mm9tcy42
2 lookup NaN NaN Windows Desktop 301.0 d1mm9tcy42
3 search_results click view_search_results Windows Desktop 22141.0 d1mm9tcy42
4 lookup NaN NaN Windows Desktop 435.0 d1mm9tcy42
session.isnull().sum()
action             79626
action_type      1126204
action_detail    1126204
device_type            0
secs_elapsed      136031
id                 34496
dtype: int64
session.shape[0]
10567737

缺失值填充

session.action.fillna("NAN", inplace=True)
session.action_type.fillna("NAN", inplace=True)
session.action_detail.fillna("NAN", inplace=True)
session.isnull().sum()
action                0
action_type           0
action_detail         0
device_type           0
secs_elapsed     136031
id                34496
dtype: int64
session1 = session.copy()

对特征action,action_detail,action_type,device_type,secs_elapsed进行细化
- 首先将用户的特征根据用户id进行分组
- 特征action:统计每个用户总的action出现的次数,各个action类型的数量,平均值以及标准差
- 特征action_detail:统计每个用户总的action_detail出现的次数,各个action_detail类型的数量,平均值以及标准差
- 特征action_type:统计每个用户总的action_type出现的次数,各个action_type类型的数量,平均值,标准差以及总的停留时长(进行log处理)
- 特征device_type:统计每个用户总的device_type出现的次数,各个device_type类型的数量,平均值以及标准差
- 特征secs_elapsed:对缺失值用0填充,统计每个用户secs_elapsed时间的总和,平均值,标准差以及中位数(进行log处理),(总和/平均数),secs_elapsed(log处理后)各个时间出现的次数

for feature in ["action", "action_detail", "action_type", "device_type"]:
    print(len(session1[feature].value_counts().index))
360
156
11
14
action_list = dict(session1.action.value_counts())
action_detail_list = dict(session1.action_detail.value_counts())
session1["action"] = session1["action"].apply(lambda x: x if action_list[x] > 1000 else "other")
session1["action_detail"] = session1["action_detail"].apply(lambda x: x if action_detail_list[x] > 1000 else "other")
print(len(session1.action.value_counts()))
print(len(session1.action_detail.value_counts()))
149
96
f_act = session1.action.value_counts().argsort()
f_act_detail = session1.action_detail.value_counts().argsort()
f_act_type = session1.action_type.value_counts().argsort()
f_dev_type = session1.device_type.value_counts().argsort()
f_act.head()
show              148
index             147
search_results    146
personalize       145
search            144
Name: action, dtype: int64
dgr_session = session1.groupby(["id"])
samples = []
ln = len(dgr_session)
for g in dgr_session:
    gr = g[1] # 每个id包含的所有session信息组成的dataframe

    l = [] # 存放临时特征

    # id
    l.append(g[0])

    # number of total actions
    l.append(len(gr)) #将id对应数据的长度放入列表

    # secs_elapsed特征中的缺失值用0填充再获取具体的停留时长
    sev = gr.secs_elapsed.fillna(0).values

    # action
    # 每个用户行为出现的次数,各个行为类型的数量,平均值以及标准差
    c_act = [0] * len(f_act)
    for i, v in enumerate(gr.action.values):
        c_act[f_act[v]] += 1
    _, c_act_uqc = np.unique(gr.action.values, return_counts=True)
    # 计算用户行为特征各个类型数量的长度,平均值以及标准差
    c_act += [len(c_act_uqc), np.mean(c_act_uqc), np.std(c_act_uqc)]
    l = l + c_act

    # action_type
    c_act_type = [0] * len(f_act)
    l_act_type = [0] * len(f_act)
    for i, v in enumerate(gr.action_type.values):
        c_act_type[f_act_type[v]] += 1
        l_act_type[f_act_type[v]] += sev[i]
    _, c_act_type_uqc = np.unique(gr.action_type.values, return_counts=True)
    l_act_type = np.log(1 + np.array(l_act_type)).tolist()
    c_act_type += [len(c_act_type_uqc), np.mean(c_act_type_uqc), np.std(c_act_type_uqc)]
    l = l + c_act_type + l_act_type

    # action_detail
    c_act_detail = [0] * len(f_act)
    for i, v in enumerate(gr.action_detail.values):
        c_act_detail[f_act_detail[v]] += 1
    _, c_act_detail_uqc = np.unique(gr.action_detail.values, return_counts=True)
    c_act_detail += [len(c_act_detail_uqc), np.mean(c_act_detail_uqc), np.std(c_act_detail_uqc)]
    l = l + c_act_detail

    # device type
    c_dev_type = [0] * len(f_act)
    for i, v in enumerate(gr.device_type.values):
        c_dev_type[f_dev_type[v]] += 1
    c_dev_type.append(len(np.unique(gr.device_type.values)))
    _, c_dev_type_uqc = np.unique(gr.device_type.values, return_counts=True)
    c_dev_type += [len(c_dev_type_uqc), np.mean(c_dev_type_uqc), np.std(c_dev_type_uqc)]
    l = l + c_dev_type

    #secs_elapsed features  特征-停留时长     
    l_secs = [0] * 5 
    l_log = [0] * 15
    if len(sev) > 0:
        #Simple statistics about the secs_elapsed values.
        l_secs[0] = np.log(1 + np.sum(sev))
        l_secs[1] = np.log(1 + np.mean(sev)) 
        l_secs[2] = np.log(1 + np.std(sev))
        l_secs[3] = np.log(1 + np.median(sev))
        l_secs[4] = l_secs[0] / float(l[1]) #

        log_sev = np.log(1 + sev).astype(int)  
        l_log = np.bincount(log_sev, minlength=15).tolist()                    
    l = l + l_secs + l_log

    samples.append(l)

samples = np.array(samples)
samp_ar = samples[:, 1:].astype(np.float16)
samp_id = samples[:, 0]

col_names = []
for i in range(len(samples[0])-1):
    col_names.append('c_' + str(i))  
df_agg_sess = pd.DataFrame(samp_ar, columns=col_names)
df_agg_sess['id'] = samp_id
df_agg_sess.index = df_agg_sess.id

3.3 整合提取的所有特征

# 未完待续

你可能感兴趣的:(Airbnb新用户民宿预定情况预测)