DS Wannabe Prep学习笔记: 5. Technical Interview: Coding (Part1)

Sample Data- and ML-Related Interview and Questions

模拟面试题

Question 5-1 (a)

At [the social media company that you’re interviewing for], we are looking into user behavior. The data format we have is [sample data in .json format]. The data is provided in the following two .json objects (referred to as “tables” for convenience).

Table 1:

user_signups = {
  "user_signups": [
      { "user_id": 31876, "timestamp": "2023-05-14 09:18:15" },
      { "user_id": 59284, "timestamp": "2023-05-13 15:12:45" },
      { "user_id": 86729, "timestamp": "2023-06-18 09:03:30" },
  ]
}

Table 2: 

user_logins = {
  "user_logins": [
      { "user_identifier": 31876, "login_time": "2023-05-15 10:28:15", 
"logoff_time": "2023-07-15 13:47:30" },
      { "user_identifier": 31876, "login_time": "2023-06-17 15:12:45", 
"logoff_time": "2023-07-17 18:31:20" },
      { "user_identifier": 31876, "login_time": "2023-06-20 09:03:30", 
"logoff_time": "2023-07-20 12:17:10" },
      { "user_identifier": 59284, "login_time": "2023-05-16 14:49:10", 
"logoff_time": "2023-07-16 18:02:45" },
      { "user_identifier": 59284, "login_time": "2023-05-18 09:33:25", 
"logoff_time": "2023-07-18 12:48:15" },
      { "user_identifier": 59284, "login_time": "2023-06-19 14:06:40", 
"logoff_time": "2023-07-19 16:34:50" },
      { "user_identifier": 59284, "login_time": "2023-06-21 08:20:05", 
"logoff_time": "2023-07-21 11:36:25" },
      { "user_identifier": 59284, "login_time": "2023-07-23 15:28:50", 
"logoff_time": "2023-07-23 18:44:40" },
      { "user_identifier": 86729, "login_time": "2023-06-18 10:48:30", 
"logoff_time": "2023-07-18 10:58:20" },
      { "user_identifier": 86729, "login_time": "2023-06-19 13:31:05", 
"logoff_time": "2023-07-19 15:50:40" },
      { "user_identifier": 86729, "login_time": "2023-06-21 10:10:25", 
"logoff_time": "2023-06-21 12:21:15" }
  ]
}

Table 1 has the new user signup time:

  • user_id

  • timestamp

Table 2 has current accounts:

  • user_identifier

  • login_time

  • logoff_time

Question: Given these two tables, which users’ latest activity is beyond 60 days since signup?

Here are some things you should do as you begin to answer this first question:

  • Confirm what each data type means if unclear: is user_id in Table 1 the same as user_identifier in Table 2? (For the example answer, we assume that it is.)

  • Load up the .json format into your format of choice—for example, into a pandas DataFrame.

  • Think out loud—explain your approach and thoughts as you code. Confirm the question if you’re unsure; in this case, even if there are several columns, you don’t need some of them in the end, or you could simplify the results.

Steps:

1. Read in the .json objects as pandas DataFrames.

2. For each user, get their latest login time, and store in a DataFrame named latest_login_times.

3. Merge the two DataFrames. Now, for each user, it displays their signup time (timestamp) and latest login time (login_time). (Tip: if time allows in the interview, rename these columns to have clearer names.)

4. Calculate the time between signup timestamp and the latest login_time, putting the results into a new column.

5. Filter the users, keeping only the one(s) that have logged in over 60 days after signing up.

# python

import json
import pandas as pd

user_signins_df = pd.DataFrame(user_signins["user_signins"]) 1

user_logins_df = pd.DataFrame(user_logins["user_logins"])
latest_login_times = user_logins_df.groupby(
    'user_identifier')['login_time'].max() 2

merged_df = user_signups_df.merge( 3
    latest_login_times,
    left_on="user_id",
    right_on="user_identifier",
    how="inner"
    )

merged_df['timestamp'] = pd.to_datetime(merged_df['timestamp'])
merged_df['last_login_time'] = pd.to_datetime(merged_df['login_time'])

# merged_df
 
|user_id   |timestamp               |login_time
|31876     |2023-05-14 09:18:15     |2023-06-20 09:03:30
|59284     |2023-05-13 15:12:45     |2023-07-23 15:28:50
|86729     |2023-06-18 09:03:30     |2023-06-21 10:10:25

merged_df['time_between_signup_and_latest_login'] = \
merged_df['last_login_time'] - merged_df['timestamp'] 4


# merged_df

|user_id |timestamp           |login_time          |time_between
                                                    _signup_and
                                                    _latest_login
|31876   |2023-05-14 09:18:15 |2023-06-20 09:03:30 |36 days 23:45:15
|59284   |2023-05-13 15:12:45 |2023-07-23 15:28:50 |71 days 00:16:05
|86729   |2023-06-18 09:03:30 |2023-06-21 10:10:25 |3 days 01:06:55

filtered_users = merged_df[merged_df['time_between_signup_and_latest_login'] \
> pd.Timedelta(days=60)] 5


filtered_user['user_id']
# Result: 59284
Table 3:

{
  "user_information": [
    {
      "user_id": "31876",
      "feature_id": "profile_completion",
      "feature_value": "55%"
    },
    {
      "user_id": "31876",
      "feature_id": "friend_connections",
      "feature_value": "127"
    },
    {
      "user_id": "31876",
      "feature_id": "posts",
      "feature_value": "42"
    },
    {
      "user_id": "31876",
      "feature_id": "saved_posts",
      "feature_value": "3"
    },
    {
      "user_id": "59284",
      "feature_id": "profile_completion",
      "feature_value": "92%"
    },
    {

Question 5-1 (b)

Question: Based on the tables in the previous question as well as this new table (Table 3), how would you approach building a predictive churn model with this toy data, assuming that the model can be run on much more data than this? Assume that the day the analysis is run is July 25, 2023. Create the churn indicators and feature table, and verbally describe how you would proceed with modeling.

Here are some tips for answering question 5-1 (b):
  • You should analyze the data, even if it’s a toy dataset, and walk through what you would do if it were larger.

  • Define what it means that the user will stay or leave. Does the company have a definition of churned users? (For example, are they users who haven’t logged on in 30 days?) Get clarity on anything that’s unclear to you.

  • Suggest some possible ways to find correlation—for example, if the user has a lower profile completion rate, are they more likely to churn? Also, outline and code out how you can test and confirm those assumptions in the dataset.

  • For modeling, what are some low-effort baseline models you could try? Perhaps regression or a simple tree-based model?

  • For a more complex model, what would you do?

  • If you notice that your time is running out, tell the interviewer you’ll give a quick high-level overview of how a complex model might work, and wrap it up.

For illustration, here’s an example of what some rows of the table could look like after you load in the table form:

DS Wannabe Prep学习笔记: 5. Technical Interview: Coding (Part1)_第1张图片

The following code is the first part of an example answer for question 5-1 (b), which loads Table 3:

# python

import pandas as pd

user_info_df = pd.DataFrame(user_info["user_information"])

user_info_df.head() # print top 5 rows


    |user_id    |feature_id            |feature_value
    |31876      |profile_completion    |55%
    |31876      |friend_connections    |127
    |31876      |posts                 |42
    |31876      |saved_posts           |3
    |59284      |profile_completion    |92%

The interviewer confirmed that if the user hasn’t logged in for 30 days, then they can be considered as churned. Note that we assume the current date is July 25, 2023. You create a binary column that indicates churn. The following code is the second part of the answer for question 5-1 (b), creating a churn indicator:

# python
import numpy as np
# add churn indicators
merged_df['churn_status'] = np.where(
  pd.to_datetime('2023-07-25') - 
merged_df['login_time'] >= pd.Timedelta(days=30), 
  1, 0)

# merged_df
|user_id |timestamp           |login_time           |time_between       |churn
                                                      _signup_and        _status
                                                      _latest_login 
|31876   |2023-05-14 09:18:15 |2023-06-20 09:03:30  |36 days 23:45:15   |1
|59284   |2023-05-13 15:12:45 |2023-07-23 15:28:50  |71 days 00:16:05   |0
|86729   |2023-06-18 09:03:30 |2023-06-21 10:10:25  |3 days 01:06:55    |1

From here you can join it to the features table:

# python
user_info_df["user_id"] = pd.to_numeric(user_info_df["user_id"])
features_df = user_info_df.merge(merged_df_2[["user_id", "churn_status"]], 
                                 left_on="user_id", right_on="user_id")  1
# features_df

|user_id    |feature_id            |feature_value    |churn_status
|31876      |profile_completion    |55%              |1
|31876      |friend_connections    |127              |1
|31876      |posts                 |42               |1
|31876      |saved_posts           |3                |1
|59284      |profile_completion    |92%              |0
|59284      |friend_connections    |95               |0
|59284      |posts                 |63               |0
|59284      |saved_posts           |8                |0
|86729      |profile_completion    |75%              |1
|86729      |friend_connections    |58               |1
|86729      |posts                 |31               |1
|86729      |saved_posts           |1                |1

Next, you decide on a simple ML model, CatBoost, and proceed to convert this DataFrame to the required format (in this case, it’s easier if the columns are the features).

An easy rules-based method could simply be that if a user doesn’t log on for 20 days, they might already be likely to churn (not logged in for 30 days). It’s kind of a simple “wait and see” rules-based approach, but it’s an option. Another option is that they will likely churn if they have not logged in for 14 days and also have added no friends. Our guess is that if they have no friends at that point, then maybe they don’t have an incentive to come back, so it’s possible they’ll churn.

你可能感兴趣的:(Machine,Learning,Data,Science,学习,笔记,python,机器学习,面试)