Pandas
*pandas* is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. *pandas* build upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing. The main data structures *pandas* provides are *Series* and *DataFrames*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are: * Generating descriptive statistics on data * Data cleaning using built in pandas functions * Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data * Merging multiple datasets using dataframes * Working with timestamps and time-series data **Additional Recommended Resources:** * *pandas* Documentation: http://pandas.pydata.org/pandas-docs/stable/ * *Python for Data Analysis* by Wes McKinney * *Python Data Science Handbook* by Jake VanderPlas Let’s get started with our first *pandas* notebook! Import Libraries
import pandas as pd
Introduction to pandas Data Structures
*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.
pandas Series
*pandas Series* one-dimensional labeled array.
ser = pd.Series(data = [100, 'foo', 300, 'bar', 500], index = ['tom', 'bob', 'nancy', 'dan', 'eric'])
ser
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object
ser.index
Index([‘tom’, ‘bob’, ‘nancy’, ‘dan’, ‘eric’], dtype=’object’)
ser.loc[['nancy','bob']]
nancy 300 bob foo dtype: object
ser[[4, 3, 1]]
eric 500 dan bar bob foo dtype: object
ser.iloc[2]
300
'bob' in ser
True
ser
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object
ser * 2
tom 200 bob foofoo nancy 600 dan barbar eric 1000 dtype: object
ser[['nancy', 'eric']] ** 2
nancy 90000 eric 250000 dtype: object
pandas DataFrame
*pandas DataFrame* is a 2-dimensional labeled data structure.
Create DataFrame from dictionary of Python Series
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
df = pd.DataFrame(d)
print(df)
one two apple 100.0 111.0 ball 200.0 222.0 cerill NaN 333.0 clock 300.0 NaN dancy NaN 4444.0
df.index
Index([‘apple’, ‘ball’, ‘cerill’, ‘clock’, ‘dancy’], dtype=’object’)
df.columns
Index([‘one’, ‘two’], dtype=’object’)
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])
|
one |
two |
dancy |
NaN |
4444.0 |
ball |
200.0 |
222.0 |
apple |
100.0 |
111.0 |
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])
|
two |
five |
dancy |
4444.0 |
NaN |
ball |
222.0 |
NaN |
apple |
111.0 |
NaN |
Create DataFrame from list of Python dictionaries
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
pd.DataFrame(data)
|
alex |
alice |
dora |
ema |
joe |
0 |
1.0 |
NaN |
NaN |
NaN |
2.0 |
1 |
NaN |
20.0 |
10.0 |
5.0 |
NaN |
pd.DataFrame(data, index=['orange', 'red'])
|
alex |
alice |
dora |
ema |
joe |
orange |
1.0 |
NaN |
NaN |
NaN |
2.0 |
red |
NaN |
20.0 |
10.0 |
5.0 |
NaN |
pd.DataFrame(data, columns=['joe', 'dora','alice'])
|
joe |
dora |
alice |
0 |
2.0 |
NaN |
NaN |
1 |
NaN |
10.0 |
20.0 |
Basic DataFrame operations
df
|
one |
two |
apple |
100.0 |
111.0 |
ball |
200.0 |
222.0 |
cerill |
NaN |
333.0 |
clock |
300.0 |
NaN |
dancy |
NaN |
4444.0 |
df['one']
apple 100.0 ball 200.0 cerill NaN clock 300.0 dancy NaN Name: one, dtype: float64
df['three'] = df['one'] * df['two']
df
|
one |
two |
three |
apple |
100.0 |
111.0 |
11100.0 |
ball |
200.0 |
222.0 |
44400.0 |
cerill |
NaN |
333.0 |
NaN |
clock |
300.0 |
NaN |
NaN |
dancy |
NaN |
4444.0 |
NaN |
df['flag'] = df['one'] > 250
df
|
one |
two |
three |
flag |
apple |
100.0 |
111.0 |
11100.0 |
False |
ball |
200.0 |
222.0 |
44400.0 |
False |
cerill |
NaN |
333.0 |
NaN |
False |
clock |
300.0 |
NaN |
NaN |
True |
dancy |
NaN |
4444.0 |
NaN |
False |
three = df.pop('three')
three
apple 11100.0 ball 44400.0 cerill NaN clock NaN dancy NaN Name: three, dtype: float64
df
|
one |
two |
flag |
apple |
100.0 |
111.0 |
False |
ball |
200.0 |
222.0 |
False |
cerill |
NaN |
333.0 |
False |
clock |
300.0 |
NaN |
True |
dancy |
NaN |
4444.0 |
False |
del df['two']
df
|
one |
flag |
apple |
100.0 |
False |
ball |
200.0 |
False |
cerill |
NaN |
False |
clock |
300.0 |
True |
dancy |
NaN |
False |
df.insert(2, 'copy_of_one', df['one'])
df
|
one |
flag |
copy_of_one |
apple |
100.0 |
False |
100.0 |
ball |
200.0 |
False |
200.0 |
cerill |
NaN |
False |
NaN |
clock |
300.0 |
True |
300.0 |
dancy |
NaN |
False |
NaN |
df['one_upper_half'] = df['one'][:2]
df
|
one |
flag |
copy_of_one |
one_upper_half |
apple |
100.0 |
False |
100.0 |
100.0 |
ball |
200.0 |
False |
200.0 |
200.0 |
cerill |
NaN |
False |
NaN |
NaN |
clock |
300.0 |
True |
300.0 |
NaN |
dancy |
NaN |
False |
NaN |
NaN |
Case Study: Movie Data Analysis
This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*. ## Download the Dataset Please note that **you will need to download the dataset**. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints. Here are the links to the data source and location: * **Data Source: ** MovieLens web site (filename: ml-20m.zip) * **Location:** https://grouplens.org/datasets/movielens/ Once the download completes, please make sure the data files are in a directory called *movielens* in your *Week-3-pandas* folder. Let us look at the files in this dataset using the UNIX command ls.
!ls ./movielens
!cat ./movielens/movies.csv | wc -l
!head -5 ./movielens/ratings.csv
Use Pandas to Read the Dataset
In this notebook, we will be using three CSV files:
- ratings.csv : userId,movieId,rating, timestamp
- tags.csv : userId,movieId, tag, timestamp
- movies.csv : movieId, title, genres
Using the read_csv function in pandas, we will ingest these three files.
movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)
tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()
|
userId |
movieId |
tag |
timestamp |
0 |
18 |
4141 |
Mark Waters |
1240597180 |
1 |
65 |
208 |
dark hero |
1368150078 |
2 |
65 |
353 |
dark hero |
1368150079 |
3 |
65 |
521 |
noir thriller |
1368149983 |
4 |
65 |
592 |
dark hero |
1368150078 |
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()
|
userId |
movieId |
rating |
timestamp |
0 |
1 |
2 |
3.5 |
1112486027 |
1 |
1 |
29 |
3.5 |
1112484676 |
2 |
1 |
32 |
3.5 |
1112484819 |
3 |
1 |
47 |
3.5 |
1112484727 |
4 |
1 |
50 |
3.5 |
1112484580 |
del ratings['timestamp']
del tags['timestamp']
Data Structures
Series
row_0 = tags.iloc[0]
type(row_0)
pandas.core.series.Series
print(row_0)
userId 18 movieId 4141 tag Mark Waters Name: 0, dtype: object
row_0.index
Index([‘userId’, ‘movieId’, ‘tag’], dtype=’object’)
row_0['userId']
18
'rating' in row_0
False
row_0.name
0
row_0 = row_0.rename('first_row')
row_0.name
‘first_row’
DataFrames
tags.head()
|
userId |
movieId |
tag |
0 |
18 |
4141 |
Mark Waters |
1 |
65 |
208 |
dark hero |
2 |
65 |
353 |
dark hero |
3 |
65 |
521 |
noir thriller |
4 |
65 |
592 |
dark hero |
tags.index
RangeIndex(start=0, stop=465564, step=1)
tags.columns
Index([‘userId’, ‘movieId’, ‘tag’], dtype=’object’)
tags.iloc[ [0,11,2000] ]
|
userId |
movieId |
tag |
0 |
18 |
4141 |
Mark Waters |
11 |
65 |
1783 |
noir thriller |
2000 |
910 |
68554 |
conspiracy theory |
Descriptive Statistics
Let’s look how the ratings are distributed!
ratings['rating'].describe()
count 2.000026e+07 mean 3.525529e+00 std 1.051989e+00 min 5.000000e-01 25% 3.000000e+00 50% 3.500000e+00 75% 4.000000e+00 max 5.000000e+00 Name: rating, dtype: float64
ratings.describe()
|
userId |
movieId |
rating |
count |
2.000026e+07 |
2.000026e+07 |
2.000026e+07 |
mean |
6.904587e+04 |
9.041567e+03 |
3.525529e+00 |
std |
4.003863e+04 |
1.978948e+04 |
1.051989e+00 |
min |
1.000000e+00 |
1.000000e+00 |
5.000000e-01 |
25% |
3.439500e+04 |
9.020000e+02 |
3.000000e+00 |
50% |
6.914100e+04 |
2.167000e+03 |
3.500000e+00 |
75% |
1.036370e+05 |
4.770000e+03 |
4.000000e+00 |
max |
1.384930e+05 |
1.312620e+05 |
5.000000e+00 |
ratings['rating'].mean()
3.5255285642993797
ratings.mean()
userId 69045.872583 movieId 9041.567330 rating 3.525529 dtype: float64
ratings['rating'].min()
0.5
ratings['rating'].max()
5.0
ratings['rating'].std()
1.051988919275684
ratings['rating'].mode()
0 4.0 dtype: float64
ratings.corr()
|
userId |
movieId |
rating |
userId |
1.000000 |
-0.000850 |
0.001175 |
movieId |
-0.000850 |
1.000000 |
0.002606 |
rating |
0.001175 |
0.002606 |
1.000000 |
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False … 20000233 False 20000234 False 20000235 False 20000236 False 20000237 False 20000238 False 20000239 False 20000240 False 20000241 False 20000242 False 20000243 False 20000244 False 20000245 False 20000246 False 20000247 False 20000248 False 20000249 False 20000250 False 20000251 False 20000252 False 20000253 False 20000254 False 20000255 False 20000256 False 20000257 False 20000258 False 20000259 False 20000260 False 20000261 False 20000262 False Name: rating, dtype: bool False
filter_2 = ratings['rating'] > 0
filter_2.all()
True
Data Cleaning: Handling Missing Data
movies.shape
(27278, 3)
movies.isnull().any()
movieId False title False genres False dtype: bool Thats nice ! No NULL values !
ratings.shape
(20000263, 3)
ratings.isnull().any()
userId False movieId False rating False dtype: bool Thats nice ! No NULL values !
tags.shape
(465564, 3)
tags.isnull().any()
userId False movieId False tag True dtype: bool We have some tags which are NULL.
tags = tags.dropna()
tags.isnull().any()
userId False movieId False tag False dtype: bool
tags.shape
(465548, 3) Thats nice ! No NULL values ! Notice the number of lines have reduced.
Data Visualization
%matplotlib inline
ratings.hist(column='rating', figsize=(15,10))
array([[
ratings.boxplot(column='rating', figsize=(15,20))
tags['tag'].head()
0 Mark Waters 1 dark hero 2 dark hero 3 noir thriller 4 dark hero Name: tag, dtype: object
movies[['title','genres']].head()
|
title |
genres |
0 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1 |
Jumanji (1995) |
Adventure|Children|Fantasy |
2 |
Grumpier Old Men (1995) |
Comedy|Romance |
3 |
Waiting to Exhale (1995) |
Comedy|Drama|Romance |
4 |
Father of the Bride Part II (1995) |
Comedy |
ratings[-10:]
|
userId |
movieId |
rating |
20000253 |
138493 |
60816 |
4.5 |
20000254 |
138493 |
61160 |
4.0 |
20000255 |
138493 |
65682 |
4.5 |
20000256 |
138493 |
66762 |
4.5 |
20000257 |
138493 |
68319 |
4.5 |
20000258 |
138493 |
68954 |
4.5 |
20000259 |
138493 |
69526 |
4.5 |
20000260 |
138493 |
69644 |
3.0 |
20000261 |
138493 |
70286 |
5.0 |
20000262 |
138493 |
71619 |
2.5 |
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]
Venice Film Festival Winner 2002 1 based on the life of Buford Pusser 1 but no way as good as the other two 1 see 1 Jeffrey Kimball 1 tolerable 1 Fake History - Don’t Believe a Thing 1 Boy 1 urlaub 1 conservative 1 Name: tag, dtype: int64
tag_counts[:10].plot(kind='bar', figsize=(15,10))
is_highly_rated = ratings['rating'] >= 4.0
ratings[is_highly_rated][30:50]
|
userId |
movieId |
rating |
68 |
1 |
2021 |
4.0 |
69 |
1 |
2100 |
4.0 |
70 |
1 |
2118 |
4.0 |
71 |
1 |
2138 |
4.0 |
72 |
1 |
2140 |
4.0 |
73 |
1 |
2143 |
4.0 |
74 |
1 |
2173 |
4.0 |
75 |
1 |
2174 |
4.0 |
76 |
1 |
2193 |
4.0 |
79 |
1 |
2288 |
4.0 |
80 |
1 |
2291 |
4.0 |
81 |
1 |
2542 |
4.0 |
82 |
1 |
2628 |
4.0 |
90 |
1 |
2762 |
4.0 |
92 |
1 |
2872 |
4.0 |
94 |
1 |
2944 |
4.0 |
96 |
1 |
2959 |
4.0 |
97 |
1 |
2968 |
4.0 |
101 |
1 |
3081 |
4.0 |
102 |
1 |
3153 |
4.0 |
is_animation = movies['genres'].str.contains('Animation')
movies[is_animation][5:15]
|
movieId |
title |
genres |
310 |
313 |
Swan Princess, The (1994) |
Animation|Children |
360 |
364 |
Lion King, The (1994) |
Adventure|Animation|Children|Drama|Musical|IMAX |
388 |
392 |
Secret Adventures of Tom Thumb, The (1993) |
Adventure|Animation |
547 |
551 |
Nightmare Before Christmas, The (1993) |
Animation|Children|Fantasy|Musical |
553 |
558 |
Pagemaster, The (1994) |
Action|Adventure|Animation|Children|Fantasy |
582 |
588 |
Aladdin (1992) |
Adventure|Animation|Children|Comedy|Musical |
588 |
594 |
Snow White and the Seven Dwarfs (1937) |
Animation|Children|Drama|Fantasy|Musical |
589 |
595 |
Beauty and the Beast (1991) |
Animation|Children|Fantasy|Musical|Romance|IMAX |
590 |
596 |
Pinocchio (1940) |
Animation|Children|Fantasy|Musical |
604 |
610 |
Heavy Metal (1981) |
Action|Adventure|Animation|Horror|Sci-Fi |
movies[is_animation].head(15)
|
movieId |
title |
genres |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
12 |
13 |
Balto (1995) |
Adventure|Animation|Children |
47 |
48 |
Pocahontas (1995) |
Animation|Children|Drama|Musical|Romance |
236 |
239 |
Goofy Movie, A (1995) |
Animation|Children|Comedy|Romance |
241 |
244 |
Gumby: The Movie (1995) |
Animation|Children |
310 |
313 |
Swan Princess, The (1994) |
Animation|Children |
360 |
364 |
Lion King, The (1994) |
Adventure|Animation|Children|Drama|Musical|IMAX |
388 |
392 |
Secret Adventures of Tom Thumb, The (1993) |
Adventure|Animation |
547 |
551 |
Nightmare Before Christmas, The (1993) |
Animation|Children|Fantasy|Musical |
553 |
558 |
Pagemaster, The (1994) |
Action|Adventure|Animation|Children|Fantasy |
582 |
588 |
Aladdin (1992) |
Adventure|Animation|Children|Comedy|Musical |
588 |
594 |
Snow White and the Seven Dwarfs (1937) |
Animation|Children|Drama|Fantasy|Musical |
589 |
595 |
Beauty and the Beast (1991) |
Animation|Children|Fantasy|Musical|Romance|IMAX |
590 |
596 |
Pinocchio (1940) |
Animation|Children|Fantasy|Musical |
604 |
610 |
Heavy Metal (1981) |
Action|Adventure|Animation|Horror|Sci-Fi |
Group By and Aggregate
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count
|
movieId |
rating |
|
0.5 |
239125 |
1.0 |
680732 |
1.5 |
279252 |
2.0 |
1430997 |
2.5 |
883398 |
3.0 |
4291193 |
3.5 |
2200156 |
4.0 |
5561926 |
4.5 |
1534824 |
5.0 |
2898660 |
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()
|
rating |
movieId |
|
1 |
3.921240 |
2 |
3.211977 |
3 |
3.151040 |
4 |
2.861393 |
5 |
3.064592 |
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()
|
rating |
movieId |
|
1 |
49695 |
2 |
22243 |
3 |
12735 |
4 |
2756 |
5 |
12161 |
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()
|
rating |
movieId |
|
131254 |
1 |
131256 |
1 |
131258 |
1 |
131260 |
1 |
131262 |
1 |
Merge Dataframes
tags.head()
|
userId |
movieId |
tag |
0 |
18 |
4141 |
Mark Waters |
1 |
65 |
208 |
dark hero |
2 |
65 |
353 |
dark hero |
3 |
65 |
521 |
noir thriller |
4 |
65 |
592 |
dark hero |
movies.head()
|
movieId |
title |
genres |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1 |
2 |
Jumanji (1995) |
Adventure|Children|Fantasy |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama|Romance |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
t = movies.merge(tags, on='movieId', how='inner')
t.head()
|
movieId |
title |
genres |
userId |
tag |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1644 |
Watched |
1 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1741 |
computer animation |
2 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1741 |
Disney animated feature |
3 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1741 |
Pixar animation |
4 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1741 |
Téa Leoni does not star in this movie |
More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html
Combine aggreagation, merging, and filters to get useful analytics
avg_ratings = ratings.groupby('movieId', as_index=False).mean().rename(columns={'rating':'avg_rating'})
del avg_ratings['userId']
avg_ratings.head()
|
movieId |
avg_rating |
0 |
1 |
3.921240 |
1 |
2 |
3.211977 |
2 |
3 |
3.151040 |
3 |
4 |
2.861393 |
4 |
5 |
3.064592 |
avg_ratings = avg_ratings.rename({'ratings':'avg_rating'})
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.head()
|
movieId |
title |
genres |
avg_rating |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
3.921240 |
1 |
2 |
Jumanji (1995) |
Adventure|Children|Fantasy |
3.211977 |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3.151040 |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama|Romance |
2.861393 |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
3.064592 |
is_highly_rated = box_office['avg_rating'] >= 4.0
box_office[is_highly_rated][-5:]
|
movieId |
title |
genres |
avg_rating |
26737 |
131250 |
No More School (2000) |
Comedy |
4.0 |
26738 |
131252 |
Forklift Driver Klaus: The First Day on the Jo… |
Comedy|Horror |
4.0 |
26739 |
131254 |
Kein Bund für’s Leben (2007) |
Comedy |
4.0 |
26740 |
131256 |
Feuer, Eis & Dosenbier (2002) |
Comedy |
4.0 |
26743 |
131262 |
Innocence (2014) |
Adventure|Fantasy|Horror |
4.0 |
is_comedy = box_office['genres'].str.contains('Comedy')
box_office[is_comedy][:5]
|
movieId |
title |
genres |
avg_rating |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
3.921240 |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3.151040 |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama|Romance |
2.861393 |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
3.064592 |
6 |
7 |
Sabrina (1995) |
Comedy|Romance |
3.366484 |
box_office[is_comedy & is_highly_rated][-5:]
|
movieId |
title |
genres |
avg_rating |
26736 |
131248 |
Brother Bear 2 (2006) |
Adventure|Animation|Children|Comedy|Fantasy |
4.0 |
26737 |
131250 |
No More School (2000) |
Comedy |
4.0 |
26738 |
131252 |
Forklift Driver Klaus: The First Day on the Jo… |
Comedy|Horror |
4.0 |
26739 |
131254 |
Kein Bund für’s Leben (2007) |
Comedy |
4.0 |
26740 |
131256 |
Feuer, Eis & Dosenbier (2002) |
Comedy |
4.0 |
Vectorized String Operations
movies.head()
|
movieId |
title |
genres |
0 |
1 |
Toy Story (1995) |
Adventure|Animation|Children|Comedy|Fantasy |
1 |
2 |
Jumanji (1995) |
Adventure|Children|Fantasy |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama|Romance |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
Split ‘genres’ into multiple columns
movie_genres = movies['genres'].str.split('|', expand=True)
movie_genres[:10]
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
0 |
Adventure |
Animation |
Children |
Comedy |
Fantasy |
None |
None |
None |
None |
None |
1 |
Adventure |
Children |
Fantasy |
None |
None |
None |
None |
None |
None |
None |
2 |
Comedy |
Romance |
None |
None |
None |
None |
None |
None |
None |
None |
3 |
Comedy |
Drama |
Romance |
None |
None |
None |
None |
None |
None |
None |
4 |
Comedy |
None |
None |
None |
None |
None |
None |
None |
None |
None |
5 |
Action |
Crime |
Thriller |
None |
None |
None |
None |
None |
None |
None |
6 |
Comedy |
Romance |
None |
None |
None |
None |
None |
None |
None |
None |
7 |
Adventure |
Children |
None |
None |
None |
None |
None |
None |
None |
None |
8 |
Action |
None |
None |
None |
None |
None |
None |
None |
None |
None |
9 |
Action |
Adventure |
Thriller |
None |
None |
None |
None |
None |
None |
None |
Add a new column for comedy genre flag
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')
movie_genres[:10]
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
isComedy |
0 |
Adventure |
Animation |
Children |
Comedy |
Fantasy |
None |
None |
None |
None |
None |
True |
1 |
Adventure |
Children |
Fantasy |
None |
None |
None |
None |
None |
None |
None |
False |
2 |
Comedy |
Romance |
None |
None |
None |
None |
None |
None |
None |
None |
True |
3 |
Comedy |
Drama |
Romance |
None |
None |
None |
None |
None |
None |
None |
True |
4 |
Comedy |
None |
None |
None |
None |
None |
None |
None |
None |
None |
True |
5 |
Action |
Crime |
Thriller |
None |
None |
None |
None |
None |
None |
None |
False |
6 |
Comedy |
Romance |
None |
None |
None |
None |
None |
None |
None |
None |
True |
7 |
Adventure |
Children |
None |
None |
None |
None |
None |
None |
None |
None |
False |
8 |
Action |
None |
None |
None |
None |
None |
None |
None |
None |
None |
False |
9 |
Action |
Adventure |
Thriller |
None |
None |
None |
None |
None |
None |
None |
False |
Extract year from title e.g. (1995)
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
movies.tail()
|
movieId |
title |
genres |
year |
27273 |
131254 |
Kein Bund für’s Leben (2007) |
Comedy |
2007 |
27274 |
131256 |
Feuer, Eis & Dosenbier (2002) |
Comedy |
2002 |
27275 |
131258 |
The Pirates (2014) |
Adventure |
2014 |
27276 |
131260 |
Rentun Ruusu (2001) |
(no genres listed) |
2001 |
27277 |
131262 |
Innocence (2014) |
Adventure|Fantasy|Horror |
2014 |
More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
Parsing Timestamps
Timestamps are common in sensor data or other time series datasets. Let us revisit the *tags.csv* dataset and read the timestamps!
tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.dtypes
userId int64 movieId int64 tag object timestamp int64 dtype: object
Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970
tags.head(5)
|
userId |
movieId |
tag |
timestamp |
0 |
18 |
4141 |
Mark Waters |
1240597180 |
1 |
65 |
208 |
dark hero |
1368150078 |
2 |
65 |
353 |
dark hero |
1368150079 |
3 |
65 |
521 |
noir thriller |
1368149983 |
4 |
65 |
592 |
dark hero |
1368150078 |
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')
Data Type datetime64[ns] maps to either
tags['parsed_time'].dtype
dtype(‘
tags.head(2)
|
userId |
movieId |
tag |
timestamp |
parsed_time |
0 |
18 |
4141 |
Mark Waters |
1240597180 |
2009-04-24 18:19:40 |
1 |
65 |
208 |
dark hero |
1368150078 |
2013-05-10 01:41:18 |
Selecting rows based on timestamps
greater_than_t = tags['parsed_time'] > '2015-02-01'
selected_rows = tags[greater_than_t]
tags.shape, selected_rows.shape
((465564, 5), (12130, 5))
Sorting the table using the timestamps
tags.sort_values(by='parsed_time', ascending=True)[:10]
|
userId |
movieId |
tag |
timestamp |
parsed_time |
333932 |
100371 |
2788 |
monty python |
1135429210 |
2005-12-24 13:00:10 |
333927 |
100371 |
1732 |
coen brothers |
1135429236 |
2005-12-24 13:00:36 |
333924 |
100371 |
1206 |
stanley kubrick |
1135429248 |
2005-12-24 13:00:48 |
333923 |
100371 |
1193 |
jack nicholson |
1135429371 |
2005-12-24 13:02:51 |
333939 |
100371 |
5004 |
peter sellers |
1135429399 |
2005-12-24 13:03:19 |
333922 |
100371 |
47 |
morgan freeman |
1135429412 |
2005-12-24 13:03:32 |
333921 |
100371 |
47 |
brad pitt |
1135429412 |
2005-12-24 13:03:32 |
333936 |
100371 |
4011 |
brad pitt |
1135429431 |
2005-12-24 13:03:51 |
333937 |
100371 |
4011 |
guy ritchie |
1135429431 |
2005-12-24 13:03:51 |
333920 |
100371 |
32 |
bruce willis |
1135429442 |
2005-12-24 13:04:02 |
Average Movie Ratings over Time
## Are Movie ratings related to the year of launch?
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()
|
movieId |
rating |
26739 |
131254 |
4.0 |
26740 |
131256 |
4.0 |
26741 |
131258 |
2.5 |
26742 |
131260 |
3.0 |
26743 |
131262 |
4.0 |
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()
|
movieId |
rating |
movieId |
1.000000 |
-0.090369 |
rating |
-0.090369 |
1.000000 |
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]
|
year |
rating |
0 |
1891 |
3.000000 |
1 |
1893 |
3.375000 |
2 |
1894 |
3.071429 |
3 |
1895 |
3.125000 |
4 |
1896 |
3.183036 |
5 |
1898 |
3.850000 |
6 |
1899 |
3.625000 |
7 |
1900 |
3.166667 |
8 |
1901 |
5.000000 |
9 |
1902 |
3.738189 |
import matplotlib.pyplot as plt
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)
plt.show()
Do some years look better for the boxoffice movies than others?
Does any data point seem like an outlier in some sense?
注意: 本文为edx上 UCSD 的课程py for data science笔记