在这个项目中,使用 NumPy、Pandas、matplotlib、seaborn 库中的函数,来对电影数据集进行探索。
下载数据集:
TMDb电影数据
数据集各列名称的含义:
| 列名称 | id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | keywords | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 含义 | 编号 | IMDB 编号 | 知名度 | 预算 | 票房 | 名称 | 主演 | 网站 | 导演 | 宣传词 | 关键词 | 简介 | 时常 | 类别 | 发行公司 | 发行日期 | 投票总数 | 投票均值 | 发行年份 | 预算(调整后) | 票房(调整后) |
任务1.1: 导入库以及数据
NumPy、Pandas、matplotlib、seaborn。Pandas 库,读取 tmdb-movies.csv 中的数据,保存为 movie_data。import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
%matplotlib inline
%timeit
movie_data = pd.read_csv('C:/Users/Administrator/Documents/Explore_Movie_Dataset/Explore Movie Dataset/tmdb-movies.csv')
**任务1.2: ** 了解数据
movie_data.head(2)
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
2 rows × 21 columns
movie_data.dtypes
id int64
imdb_id object
popularity float64
budget int64
revenue int64
original_title object
cast object
homepage object
director object
tagline object
keywords object
overview object
runtime int64
genres object
production_companies object
release_date object
vote_count int64
vote_average float64
release_year int64
budget_adj float64
revenue_adj float64
dtype: object
movie_data.isnull().sum()
id 0
imdb_id 10
popularity 0
budget 0
revenue 0
original_title 0
cast 76
homepage 7930
director 44
tagline 2824
keywords 1493
overview 4
runtime 0
genres 23
production_companies 1030
release_date 0
vote_count 0
vote_average 0
release_year 0
budget_adj 0
revenue_adj 0
dtype: int64
movie_data.shape
(10866, 21)
movie_data.describe()
| id | popularity | budget | revenue | runtime | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 | 10866.000000 | 10866.000000 | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 |
| mean | 66064.177434 | 0.646441 | 1.462570e+07 | 3.982332e+07 | 102.070863 | 217.389748 | 5.974922 | 2001.322658 | 1.755104e+07 | 5.136436e+07 |
| std | 92130.136561 | 1.000185 | 3.091321e+07 | 1.170035e+08 | 31.381405 | 575.619058 | 0.935142 | 12.812941 | 3.430616e+07 | 1.446325e+08 |
| min | 5.000000 | 0.000065 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 10.000000 | 1.500000 | 1960.000000 | 0.000000e+00 | 0.000000e+00 |
| 25% | 10596.250000 | 0.207583 | 0.000000e+00 | 0.000000e+00 | 90.000000 | 17.000000 | 5.400000 | 1995.000000 | 0.000000e+00 | 0.000000e+00 |
| 50% | 20669.000000 | 0.383856 | 0.000000e+00 | 0.000000e+00 | 99.000000 | 38.000000 | 6.000000 | 2006.000000 | 0.000000e+00 | 0.000000e+00 |
| 75% | 75610.000000 | 0.713817 | 1.500000e+07 | 2.400000e+07 | 111.000000 | 145.750000 | 6.600000 | 2011.000000 | 2.085325e+07 | 3.369710e+07 |
| max | 417859.000000 | 32.985763 | 4.250000e+08 | 2.781506e+09 | 900.000000 | 9767.000000 | 9.200000 | 2015.000000 | 4.250000e+08 | 2.827124e+09 |
**任务1.3: ** 清理数据
在真实的工作场景中,数据处理往往是最为费时费力的环节。但是幸运的是,我们提供给大家的 tmdb 数据集非常的「干净」,不需要大家做特别多的数据清洗以及处理工作。在这一步中,你的核心的工作主要是对数据表中的空值进行处理。你可以使用 .fillna() 来填补空值,当然也可以使用 .dropna() 来丢弃数据表中包含空值的某些行或者列。
任务:使用适当的方法来清理空值,并将得到的数据保存。
movie_data_nonul = movie_data.dropna(axis=1).copy()
相比 Excel 等数据分析软件,Pandas 的一大特长在于,能够轻松地基于复杂的逻辑选择合适的数据。因此,如何根据指定的要求,从数据表当获取适当的数据,是使用 Pandas 中非常重要的技能,也是本节重点考察大家的内容。
**任务2.1: ** 简单读取
id、popularity、budget、runtime、vote_average 列的数据。popularity 那一列的数据。要求:每一个语句只能用一行代码实现。
movie_data_nonul.id.head()
movie_data_nonul['popularity']
movie_data_nonul['vote_average'].head()
0 6.5
1 7.1
2 6.3
3 7.5
4 7.3
Name: vote_average, dtype: float64
movie_data_nonul[:20]
movie_data_nonul[48:50]
| id | popularity | budget | revenue | original_title | runtime | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 48 | 265208 | 2.932340 | 30000000 | 0 | Wild Card | 92 | 1/14/15 | 481 | 5.3 | 2015 | 2.759999e+07 | 0.000000e+00 |
| 49 | 254320 | 2.885126 | 4000000 | 9064511 | The Lobster | 118 | 10/8/15 | 638 | 6.6 | 2015 | 3.679998e+06 | 8.339346e+06 |
movie_data_nonul[50:61]['popularity']
50 2.883233
51 2.814802
52 2.798017
53 2.793297
54 2.614499
55 2.584264
56 2.578919
57 2.575711
58 2.557859
59 2.550747
60 2.487849
Name: popularity, dtype: float64
**任务2.2: **逻辑读取(Logical Indexing)
popularity 大于5 的所有数据。popularity 大于5 的所有数据且发行年份在1996年之后的所有数据。提示:Pandas 中的逻辑运算符如 &、|,分别代表且以及或。
要求:请使用 Logical Indexing实现。
movie_data_nonul[movie_data_nonul['popularity'] > 5]
| id | popularity | budget | revenue | original_title | runtime | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
| 2 | 262500 | 13.112507 | 110000000 | 295238201 | Insurgent | 119 | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
| 3 | 140607 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | 136 | 12/15/15 | 5292 | 7.5 | 2015 | 1.839999e+08 | 1.902723e+09 |
| 4 | 168259 | 9.335014 | 190000000 | 1506249360 | Furious 7 | 137 | 4/1/15 | 2947 | 7.3 | 2015 | 1.747999e+08 | 1.385749e+09 |
| 5 | 281957 | 9.110700 | 135000000 | 532950503 | The Revenant | 156 | 12/25/15 | 3929 | 7.2 | 2015 | 1.241999e+08 | 4.903142e+08 |
| 6 | 87101 | 8.654359 | 155000000 | 440603537 | Terminator Genisys | 125 | 6/23/15 | 2598 | 5.8 | 2015 | 1.425999e+08 | 4.053551e+08 |
| 7 | 286217 | 7.667400 | 108000000 | 595380321 | The Martian | 141 | 9/30/15 | 4572 | 7.6 | 2015 | 9.935996e+07 | 5.477497e+08 |
| 8 | 211672 | 7.404165 | 74000000 | 1156730962 | Minions | 91 | 6/17/15 | 2893 | 6.5 | 2015 | 6.807997e+07 | 1.064192e+09 |
| 9 | 150540 | 6.326804 | 175000000 | 853708609 | Inside Out | 94 | 6/9/15 | 3935 | 8.0 | 2015 | 1.609999e+08 | 7.854116e+08 |
| 10 | 206647 | 6.200282 | 245000000 | 880674609 | Spectre | 148 | 10/26/15 | 3254 | 6.2 | 2015 | 2.253999e+08 | 8.102203e+08 |
| 11 | 76757 | 6.189369 | 176000003 | 183987723 | Jupiter Ascending | 124 | 2/4/15 | 1937 | 5.2 | 2015 | 1.619199e+08 | 1.692686e+08 |
| 12 | 264660 | 6.118847 | 15000000 | 36869414 | Ex Machina | 108 | 1/21/15 | 2854 | 7.6 | 2015 | 1.379999e+07 | 3.391985e+07 |
| 13 | 257344 | 5.984995 | 88000000 | 243637091 | Pixels | 105 | 7/16/15 | 1575 | 5.8 | 2015 | 8.095996e+07 | 2.241460e+08 |
| 14 | 99861 | 5.944927 | 280000000 | 1405035767 | Avengers: Age of Ultron | 141 | 4/22/15 | 4304 | 7.4 | 2015 | 2.575999e+08 | 1.292632e+09 |
| 15 | 273248 | 5.898400 | 44000000 | 155760117 | The Hateful Eight | 167 | 12/25/15 | 2389 | 7.4 | 2015 | 4.047998e+07 | 1.432992e+08 |
| 16 | 260346 | 5.749758 | 48000000 | 325771424 | Taken 3 | 109 | 1/1/15 | 1578 | 6.1 | 2015 | 4.415998e+07 | 2.997096e+08 |
| 17 | 102899 | 5.573184 | 130000000 | 518602163 | Ant-Man | 115 | 7/14/15 | 3779 | 7.0 | 2015 | 1.195999e+08 | 4.771138e+08 |
| 18 | 150689 | 5.556818 | 95000000 | 542351353 | Cinderella | 112 | 3/12/15 | 1495 | 6.8 | 2015 | 8.739996e+07 | 4.989630e+08 |
| 19 | 131634 | 5.476958 | 160000000 | 650523427 | The Hunger Games: Mockingjay - Part 2 | 136 | 11/18/15 | 2380 | 6.5 | 2015 | 1.471999e+08 | 5.984813e+08 |
| 20 | 158852 | 5.462138 | 190000000 | 209035668 | Tomorrowland | 130 | 5/19/15 | 1899 | 6.2 | 2015 | 1.747999e+08 | 1.923127e+08 |
| 21 | 307081 | 5.337064 | 30000000 | 91709827 | Southpaw | 123 | 6/15/15 | 1386 | 7.3 | 2015 | 2.759999e+07 | 8.437300e+07 |
| 629 | 157336 | 24.949134 | 165000000 | 621752480 | Interstellar | 169 | 11/5/14 | 6498 | 8.0 | 2014 | 1.519800e+08 | 5.726906e+08 |
| 630 | 118340 | 14.311205 | 170000000 | 773312399 | Guardians of the Galaxy | 121 | 7/30/14 | 5612 | 7.9 | 2014 | 1.565855e+08 | 7.122911e+08 |
| 631 | 100402 | 12.971027 | 170000000 | 714766572 | Captain America: The Winter Soldier | 136 | 3/20/14 | 3848 | 7.6 | 2014 | 1.565855e+08 | 6.583651e+08 |
| 632 | 245891 | 11.422751 | 20000000 | 78739897 | John Wick | 101 | 10/22/14 | 2712 | 7.0 | 2014 | 1.842182e+07 | 7.252661e+07 |
| 633 | 131631 | 10.739009 | 125000000 | 752100229 | The Hunger Games: Mockingjay - Part 1 | 123 | 11/18/14 | 3590 | 6.6 | 2014 | 1.151364e+08 | 6.927528e+08 |
| 634 | 122917 | 10.174599 | 250000000 | 955119788 | The Hobbit: The Battle of the Five Armies | 144 | 12/10/14 | 3110 | 7.1 | 2014 | 2.302728e+08 | 8.797523e+08 |
| 635 | 177572 | 8.691294 | 165000000 | 652105443 | Big Hero 6 | 102 | 10/24/14 | 4185 | 7.8 | 2014 | 1.519800e+08 | 6.006485e+08 |
| 636 | 205596 | 8.110711 | 14000000 | 233555708 | The Imitation Game | 113 | 11/14/14 | 3478 | 8.0 | 2014 | 1.289527e+07 | 2.151261e+08 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2876 | 10681 | 5.678119 | 180000000 | 521311860 | WALL·E | 98 | 6/22/08 | 4209 | 7.6 | 2008 | 1.823016e+08 | 5.279777e+08 |
| 3371 | 161337 | 8.411577 | 0 | 0 | Underworld: Endless War | 18 | 10/19/11 | 21 | 5.9 | 2011 | 0.000000e+00 | 0.000000e+00 |
| 3372 | 1771 | 7.959228 | 140000000 | 370569774 | Captain America: The First Avenger | 124 | 7/22/11 | 5025 | 6.5 | 2011 | 1.357157e+08 | 3.592296e+08 |
| 3373 | 64690 | 5.903353 | 15000000 | 76175166 | Drive | 100 | 1/10/11 | 2347 | 7.3 | 2011 | 1.454097e+07 | 7.384406e+07 |
| 3374 | 12445 | 5.711315 | 125000000 | 1327817822 | Harry Potter and the Deathly Hallows: Part 2 | 130 | 7/7/11 | 3750 | 7.7 | 2011 | 1.211748e+08 | 1.287184e+09 |
| 3911 | 121 | 8.095275 | 79000000 | 926287400 | The Lord of the Rings: The Two Towers | 179 | 12/18/02 | 5114 | 7.8 | 2002 | 9.576865e+07 | 1.122902e+09 |
| 3912 | 672 | 6.012584 | 100000000 | 876688482 | Harry Potter and the Chamber of Secrets | 161 | 11/13/02 | 3458 | 7.2 | 2002 | 1.212261e+08 | 1.062776e+09 |
| 4177 | 680 | 8.093754 | 8000000 | 213928762 | Pulp Fiction | 154 | 10/14/94 | 5343 | 8.1 | 1994 | 1.176889e+07 | 3.147131e+08 |
| 4178 | 278 | 7.192039 | 25000000 | 28341469 | The Shawshank Redemption | 142 | 9/10/94 | 5754 | 8.4 | 1994 | 3.677779e+07 | 4.169346e+07 |
| 4179 | 13 | 6.715966 | 55000000 | 677945399 | Forrest Gump | 142 | 7/6/94 | 4856 | 8.1 | 1994 | 8.091114e+07 | 9.973333e+08 |
| 4361 | 24428 | 7.637767 | 220000000 | 1519557910 | The Avengers | 143 | 4/25/12 | 8903 | 7.3 | 2012 | 2.089437e+08 | 1.443191e+09 |
| 4362 | 52520 | 7.031452 | 70000000 | 132400000 | Underworld: Awakening | 88 | 1/19/12 | 1426 | 6.0 | 2012 | 6.648210e+07 | 1.257461e+08 |
| 4363 | 49026 | 6.591277 | 250000000 | 1081041287 | The Dark Knight Rises | 165 | 7/16/12 | 6723 | 7.5 | 2012 | 2.374361e+08 | 1.026713e+09 |
| 4364 | 68718 | 5.944518 | 100000000 | 425368238 | Django Unchained | 165 | 12/25/12 | 7375 | 7.7 | 2012 | 9.497443e+07 | 4.039911e+08 |
| 4365 | 37724 | 5.603587 | 200000000 | 1108561013 | Skyfall | 143 | 10/25/12 | 6137 | 6.8 | 2012 | 1.899489e+08 | 1.052849e+09 |
| 4949 | 122 | 7.122455 | 94000000 | 1118888979 | The Lord of the Rings: The Return of the King | 201 | 12/1/03 | 5636 | 7.9 | 2003 | 1.114231e+08 | 1.326278e+09 |
| 4950 | 277 | 6.887883 | 22000000 | 95708457 | Underworld | 121 | 9/19/03 | 1708 | 6.5 | 2003 | 2.607776e+07 | 1.134483e+08 |
| 4951 | 22 | 6.864067 | 140000000 | 655011224 | Pirates of the Caribbean: The Curse of the Bla... | 143 | 7/9/03 | 4223 | 7.3 | 2003 | 1.659494e+08 | 7.764193e+08 |
| 4952 | 24 | 6.174132 | 30000000 | 180949000 | Kill Bill: Vol. 1 | 111 | 10/10/03 | 2932 | 7.6 | 2003 | 3.556058e+07 | 2.144884e+08 |
| 5230 | 13590 | 6.668990 | 0 | 0 | Eddie Izzard: Glorious | 99 | 11/17/97 | 11 | 5.5 | 1997 | 0.000000e+00 | 0.000000e+00 |
| 5422 | 109445 | 6.112766 | 150000000 | 1274219009 | Frozen | 102 | 11/27/13 | 3369 | 7.5 | 2013 | 1.404050e+08 | 1.192711e+09 |
| 5423 | 49047 | 5.242753 | 105000000 | 716392705 | Gravity | 91 | 9/27/13 | 3775 | 7.4 | 2013 | 9.828350e+07 | 6.705675e+08 |
| 5424 | 76338 | 5.111900 | 170000000 | 479765000 | Thor: The Dark World | 112 | 10/29/13 | 3025 | 6.8 | 2013 | 1.591257e+08 | 4.490760e+08 |
| 6081 | 105 | 6.095293 | 19000000 | 381109762 | Back to the Future | 116 | 7/3/85 | 3785 | 7.8 | 1985 | 3.851615e+07 | 7.725728e+08 |
| 6190 | 674 | 5.939927 | 150000000 | 895921036 | Harry Potter and the Goblet of Fire | 157 | 11/5/05 | 3406 | 7.3 | 2005 | 1.674845e+08 | 1.000353e+09 |
| 6191 | 272 | 5.400826 | 150000000 | 374218673 | Batman Begins | 140 | 6/14/05 | 4914 | 7.3 | 2005 | 1.674845e+08 | 4.178388e+08 |
| 6554 | 834 | 5.838503 | 50000000 | 111340801 | Underworld: Evolution | 106 | 1/12/06 | 1015 | 6.3 | 2006 | 5.408346e+07 | 1.204339e+08 |
| 6962 | 673 | 5.827781 | 130000000 | 789804554 | Harry Potter and the Prisoner of Azkaban | 141 | 5/31/04 | 3550 | 7.4 | 2004 | 1.500779e+08 | 9.117862e+08 |
| 7269 | 238 | 5.738034 | 6000000 | 245066411 | The Godfather | 175 | 3/15/72 | 3970 | 8.3 | 1972 | 3.128737e+07 | 1.277914e+09 |
| 7309 | 1891 | 5.488441 | 18000000 | 538400000 | The Empire Strikes Back | 124 | 1/1/80 | 3954 | 8.0 | 1980 | 4.762866e+07 | 1.424626e+09 |
85 rows × 12 columns
movie_data_nonul[(movie_data_nonul['popularity'] > 5) & (movie_data_nonul['release_year'] > 1996)]
| id | popularity | budget | revenue | original_title | runtime | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
| 2 | 262500 | 13.112507 | 110000000 | 295238201 | Insurgent | 119 | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
| 3 | 140607 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | 136 | 12/15/15 | 5292 | 7.5 | 2015 | 1.839999e+08 | 1.902723e+09 |
| 4 | 168259 | 9.335014 | 190000000 | 1506249360 | Furious 7 | 137 | 4/1/15 | 2947 | 7.3 | 2015 | 1.747999e+08 | 1.385749e+09 |
| 5 | 281957 | 9.110700 | 135000000 | 532950503 | The Revenant | 156 | 12/25/15 | 3929 | 7.2 | 2015 | 1.241999e+08 | 4.903142e+08 |
| 6 | 87101 | 8.654359 | 155000000 | 440603537 | Terminator Genisys | 125 | 6/23/15 | 2598 | 5.8 | 2015 | 1.425999e+08 | 4.053551e+08 |
| 7 | 286217 | 7.667400 | 108000000 | 595380321 | The Martian | 141 | 9/30/15 | 4572 | 7.6 | 2015 | 9.935996e+07 | 5.477497e+08 |
| 8 | 211672 | 7.404165 | 74000000 | 1156730962 | Minions | 91 | 6/17/15 | 2893 | 6.5 | 2015 | 6.807997e+07 | 1.064192e+09 |
| 9 | 150540 | 6.326804 | 175000000 | 853708609 | Inside Out | 94 | 6/9/15 | 3935 | 8.0 | 2015 | 1.609999e+08 | 7.854116e+08 |
| 10 | 206647 | 6.200282 | 245000000 | 880674609 | Spectre | 148 | 10/26/15 | 3254 | 6.2 | 2015 | 2.253999e+08 | 8.102203e+08 |
| 11 | 76757 | 6.189369 | 176000003 | 183987723 | Jupiter Ascending | 124 | 2/4/15 | 1937 | 5.2 | 2015 | 1.619199e+08 | 1.692686e+08 |
| 12 | 264660 | 6.118847 | 15000000 | 36869414 | Ex Machina | 108 | 1/21/15 | 2854 | 7.6 | 2015 | 1.379999e+07 | 3.391985e+07 |
| 13 | 257344 | 5.984995 | 88000000 | 243637091 | Pixels | 105 | 7/16/15 | 1575 | 5.8 | 2015 | 8.095996e+07 | 2.241460e+08 |
| 14 | 99861 | 5.944927 | 280000000 | 1405035767 | Avengers: Age of Ultron | 141 | 4/22/15 | 4304 | 7.4 | 2015 | 2.575999e+08 | 1.292632e+09 |
| 15 | 273248 | 5.898400 | 44000000 | 155760117 | The Hateful Eight | 167 | 12/25/15 | 2389 | 7.4 | 2015 | 4.047998e+07 | 1.432992e+08 |
| 16 | 260346 | 5.749758 | 48000000 | 325771424 | Taken 3 | 109 | 1/1/15 | 1578 | 6.1 | 2015 | 4.415998e+07 | 2.997096e+08 |
| 17 | 102899 | 5.573184 | 130000000 | 518602163 | Ant-Man | 115 | 7/14/15 | 3779 | 7.0 | 2015 | 1.195999e+08 | 4.771138e+08 |
| 18 | 150689 | 5.556818 | 95000000 | 542351353 | Cinderella | 112 | 3/12/15 | 1495 | 6.8 | 2015 | 8.739996e+07 | 4.989630e+08 |
| 19 | 131634 | 5.476958 | 160000000 | 650523427 | The Hunger Games: Mockingjay - Part 2 | 136 | 11/18/15 | 2380 | 6.5 | 2015 | 1.471999e+08 | 5.984813e+08 |
| 20 | 158852 | 5.462138 | 190000000 | 209035668 | Tomorrowland | 130 | 5/19/15 | 1899 | 6.2 | 2015 | 1.747999e+08 | 1.923127e+08 |
| 21 | 307081 | 5.337064 | 30000000 | 91709827 | Southpaw | 123 | 6/15/15 | 1386 | 7.3 | 2015 | 2.759999e+07 | 8.437300e+07 |
| 629 | 157336 | 24.949134 | 165000000 | 621752480 | Interstellar | 169 | 11/5/14 | 6498 | 8.0 | 2014 | 1.519800e+08 | 5.726906e+08 |
| 630 | 118340 | 14.311205 | 170000000 | 773312399 | Guardians of the Galaxy | 121 | 7/30/14 | 5612 | 7.9 | 2014 | 1.565855e+08 | 7.122911e+08 |
| 631 | 100402 | 12.971027 | 170000000 | 714766572 | Captain America: The Winter Soldier | 136 | 3/20/14 | 3848 | 7.6 | 2014 | 1.565855e+08 | 6.583651e+08 |
| 632 | 245891 | 11.422751 | 20000000 | 78739897 | John Wick | 101 | 10/22/14 | 2712 | 7.0 | 2014 | 1.842182e+07 | 7.252661e+07 |
| 633 | 131631 | 10.739009 | 125000000 | 752100229 | The Hunger Games: Mockingjay - Part 1 | 123 | 11/18/14 | 3590 | 6.6 | 2014 | 1.151364e+08 | 6.927528e+08 |
| 634 | 122917 | 10.174599 | 250000000 | 955119788 | The Hobbit: The Battle of the Five Armies | 144 | 12/10/14 | 3110 | 7.1 | 2014 | 2.302728e+08 | 8.797523e+08 |
| 635 | 177572 | 8.691294 | 165000000 | 652105443 | Big Hero 6 | 102 | 10/24/14 | 4185 | 7.8 | 2014 | 1.519800e+08 | 6.006485e+08 |
| 636 | 205596 | 8.110711 | 14000000 | 233555708 | The Imitation Game | 113 | 11/14/14 | 3478 | 8.0 | 2014 | 1.289527e+07 | 2.151261e+08 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1922 | 44214 | 5.293180 | 13000000 | 327803731 | Black Swan | 108 | 12/2/10 | 2597 | 7.1 | 2010 | 1.300000e+07 | 3.278037e+08 |
| 2409 | 550 | 8.947905 | 63000000 | 100853753 | Fight Club | 139 | 10/14/99 | 5923 | 8.1 | 1999 | 8.247033e+07 | 1.320229e+08 |
| 2410 | 603 | 7.753899 | 63000000 | 463517383 | The Matrix | 136 | 3/30/99 | 6351 | 7.8 | 1999 | 8.247033e+07 | 6.067687e+08 |
| 2633 | 120 | 8.575419 | 93000000 | 871368364 | The Lord of the Rings: The Fellowship of the Ring | 178 | 12/18/01 | 6079 | 7.8 | 2001 | 1.145284e+08 | 1.073080e+09 |
| 2634 | 671 | 8.021423 | 125000000 | 976475550 | Harry Potter and the Philosopher's Stone | 152 | 11/16/01 | 4265 | 7.2 | 2001 | 1.539360e+08 | 1.202518e+09 |
| 2875 | 155 | 8.466668 | 185000000 | 1001921825 | The Dark Knight | 152 | 7/16/08 | 8432 | 8.1 | 2008 | 1.873655e+08 | 1.014733e+09 |
| 2876 | 10681 | 5.678119 | 180000000 | 521311860 | WALL·E | 98 | 6/22/08 | 4209 | 7.6 | 2008 | 1.823016e+08 | 5.279777e+08 |
| 3371 | 161337 | 8.411577 | 0 | 0 | Underworld: Endless War | 18 | 10/19/11 | 21 | 5.9 | 2011 | 0.000000e+00 | 0.000000e+00 |
| 3372 | 1771 | 7.959228 | 140000000 | 370569774 | Captain America: The First Avenger | 124 | 7/22/11 | 5025 | 6.5 | 2011 | 1.357157e+08 | 3.592296e+08 |
| 3373 | 64690 | 5.903353 | 15000000 | 76175166 | Drive | 100 | 1/10/11 | 2347 | 7.3 | 2011 | 1.454097e+07 | 7.384406e+07 |
| 3374 | 12445 | 5.711315 | 125000000 | 1327817822 | Harry Potter and the Deathly Hallows: Part 2 | 130 | 7/7/11 | 3750 | 7.7 | 2011 | 1.211748e+08 | 1.287184e+09 |
| 3911 | 121 | 8.095275 | 79000000 | 926287400 | The Lord of the Rings: The Two Towers | 179 | 12/18/02 | 5114 | 7.8 | 2002 | 9.576865e+07 | 1.122902e+09 |
| 3912 | 672 | 6.012584 | 100000000 | 876688482 | Harry Potter and the Chamber of Secrets | 161 | 11/13/02 | 3458 | 7.2 | 2002 | 1.212261e+08 | 1.062776e+09 |
| 4361 | 24428 | 7.637767 | 220000000 | 1519557910 | The Avengers | 143 | 4/25/12 | 8903 | 7.3 | 2012 | 2.089437e+08 | 1.443191e+09 |
| 4362 | 52520 | 7.031452 | 70000000 | 132400000 | Underworld: Awakening | 88 | 1/19/12 | 1426 | 6.0 | 2012 | 6.648210e+07 | 1.257461e+08 |
| 4363 | 49026 | 6.591277 | 250000000 | 1081041287 | The Dark Knight Rises | 165 | 7/16/12 | 6723 | 7.5 | 2012 | 2.374361e+08 | 1.026713e+09 |
| 4364 | 68718 | 5.944518 | 100000000 | 425368238 | Django Unchained | 165 | 12/25/12 | 7375 | 7.7 | 2012 | 9.497443e+07 | 4.039911e+08 |
| 4365 | 37724 | 5.603587 | 200000000 | 1108561013 | Skyfall | 143 | 10/25/12 | 6137 | 6.8 | 2012 | 1.899489e+08 | 1.052849e+09 |
| 4949 | 122 | 7.122455 | 94000000 | 1118888979 | The Lord of the Rings: The Return of the King | 201 | 12/1/03 | 5636 | 7.9 | 2003 | 1.114231e+08 | 1.326278e+09 |
| 4950 | 277 | 6.887883 | 22000000 | 95708457 | Underworld | 121 | 9/19/03 | 1708 | 6.5 | 2003 | 2.607776e+07 | 1.134483e+08 |
| 4951 | 22 | 6.864067 | 140000000 | 655011224 | Pirates of the Caribbean: The Curse of the Bla... | 143 | 7/9/03 | 4223 | 7.3 | 2003 | 1.659494e+08 | 7.764193e+08 |
| 4952 | 24 | 6.174132 | 30000000 | 180949000 | Kill Bill: Vol. 1 | 111 | 10/10/03 | 2932 | 7.6 | 2003 | 3.556058e+07 | 2.144884e+08 |
| 5230 | 13590 | 6.668990 | 0 | 0 | Eddie Izzard: Glorious | 99 | 11/17/97 | 11 | 5.5 | 1997 | 0.000000e+00 | 0.000000e+00 |
| 5422 | 109445 | 6.112766 | 150000000 | 1274219009 | Frozen | 102 | 11/27/13 | 3369 | 7.5 | 2013 | 1.404050e+08 | 1.192711e+09 |
| 5423 | 49047 | 5.242753 | 105000000 | 716392705 | Gravity | 91 | 9/27/13 | 3775 | 7.4 | 2013 | 9.828350e+07 | 6.705675e+08 |
| 5424 | 76338 | 5.111900 | 170000000 | 479765000 | Thor: The Dark World | 112 | 10/29/13 | 3025 | 6.8 | 2013 | 1.591257e+08 | 4.490760e+08 |
| 6190 | 674 | 5.939927 | 150000000 | 895921036 | Harry Potter and the Goblet of Fire | 157 | 11/5/05 | 3406 | 7.3 | 2005 | 1.674845e+08 | 1.000353e+09 |
| 6191 | 272 | 5.400826 | 150000000 | 374218673 | Batman Begins | 140 | 6/14/05 | 4914 | 7.3 | 2005 | 1.674845e+08 | 4.178388e+08 |
| 6554 | 834 | 5.838503 | 50000000 | 111340801 | Underworld: Evolution | 106 | 1/12/06 | 1015 | 6.3 | 2006 | 5.408346e+07 | 1.204339e+08 |
| 6962 | 673 | 5.827781 | 130000000 | 789804554 | Harry Potter and the Prisoner of Azkaban | 141 | 5/31/04 | 3550 | 7.4 | 2004 | 1.500779e+08 | 9.117862e+08 |
78 rows × 12 columns
**任务2.3: **分组读取
movie_data_nonul.groupby('release_year').agg({'revenue': np.mean}).head()
| revenue | |
|---|---|
| release_year | |
| 1960 | 4.531406e+06 |
| 1961 | 1.089420e+07 |
| 1962 | 6.736870e+06 |
| 1963 | 5.511911e+06 |
| 1964 | 8.118614e+06 |
movie_data.dropna().groupby('director').agg({'popularity': np.mean}).sort_values(by='popularity', ascending=False).head()
| popularity | |
|---|---|
| director | |
| Colin Trevorrow | 32.985763 |
| George Miller | 14.675428 |
| Joe Russo|Anthony Russo | 12.971027 |
| Chad Stahelski|David Leitch | 11.422751 |
| Don Hall|Chris Williams | 8.691294 |
| 可视化的目标 | 可以使用的图像 |
|---|---|
| 表示某一属性数据的分布 | 饼图、直方图、散点图 |
| 表示某一属性数据随着某一个变量变化 | 条形图、折线图、热力图 |
| 比较多个属性的数据之间的关系 | 散点图、小提琴图、堆积条形图、堆积折线图 |
**任务3.1:**对 popularity 最高的20名电影绘制其 popularity 值。
movie_top20_popularity = movie_data.loc[movie_data.sort_values(by='popularity', ascending=False)[:20].index]
base_color = sns.color_palette()[9]
sns.barplot(data=movie_top20_popularity, x='popularity', y='original_title', color=base_color)
sns.title = ('movie_top20_popularity')
**任务3.2:**分析电影净利润(票房-成本)随着年份变化的情况,并简单进行分析。
movie_data['profit'] = movie_data['revenue'] - movie_data['budget']
movies_profit = movie_data[['release_year', 'profit'] ]
profit_mean = movie_data.groupby('release_year').agg({'profit': np.sum})
sns.title = ('movie_profit')
base_color = sns.color_palette()[9]
bin_edges = np.arange(movie_data.release_year.min(), movie_data.release_year.max()+1)
plt.xlabel('release_year')
plt.ylabel('profit')
plt.errorbar(data=profit_mean, x=profit_mean.index,y=profit_mean, color=base_color)
sns.title = ('movie_profit')
base_color = sns.color_palette()[0]
plt.xlabel('year')
plt.ylabel('profit')
xbin_edges = np.arange(1960, 2016, 1)
xbin_centers = (xbin_edges)[:-1]
data_xbins = pd.cut(movie_data['release_year'], xbin_edges, right = False, include_lowest = True)
y_sum = movie_data['profit'].groupby(data_xbins).sum()
y_sems = movie_data['profit'].groupby(data_xbins).sem()
plt.errorbar(x = xbin_centers, y = y_sum, yerr = y_sems, color = base_color)
**任务3.3:**选择最多产的10位导演(电影数量最多的),绘制他们排行前3的三部电影的票房情况,并简要进行分析。
movie_data[['director', 'original_title', 'revenue']].head()
| director | original_title | revenue | |
|---|---|---|---|
| 0 | Colin Trevorrow | Jurassic World | 1513528810 |
| 1 | George Miller | Mad Max: Fury Road | 378436354 |
| 2 | Robert Schwentke | Insurgent | 295238201 |
| 3 | J.J. Abrams | Star Wars: The Force Awakens | 2068178225 |
| 4 | James Wan | Furious 7 | 1506249360 |
director_info = movie_data[['director', 'original_title']].\
groupby('director').count()['original_title'].nlargest(10).index
director_info
Index(['Woody Allen', 'Clint Eastwood', 'Martin Scorsese', 'Steven Spielberg',
'Ridley Scott', 'Ron Howard', 'Steven Soderbergh', 'Joel Schumacher',
'Brian De Palma', 'Barry Levinson'],
dtype='object', name='director')
analysis = pd.DataFrame()
for i in director_info:
data = movie_data.loc[movie_data['director'] == i][['director', 'original_title',
'revenue','revenue_adj']].nlargest(3, 'revenue_adj')
analysis = analysis.append(data)
analysis
| director | original_title | revenue | revenue_adj | |
|---|---|---|---|---|
| 3429 | Woody Allen | Midnight in Paris | 151119219 | 1.464947e+08 |
| 1332 | Woody Allen | Annie Hall | 38251425 | 1.376203e+08 |
| 7835 | Woody Allen | Manhattan | 39946780 | 1.200223e+08 |
| 657 | Clint Eastwood | American Sniper | 542307423 | 4.995145e+08 |
| 2888 | Clint Eastwood | Gran Torino | 269958228 | 2.734101e+08 |
| 8092 | Clint Eastwood | The Bridges of Madison County | 182016617 | 2.604597e+08 |
| 5428 | Martin Scorsese | The Wolf of Wall Street | 392000694 | 3.669257e+08 |
| 6563 | Martin Scorsese | The Departed | 289847354 | 3.135189e+08 |
| 1927 | Martin Scorsese | Shutter Island | 294804195 | 2.948042e+08 |
| 9806 | Steven Spielberg | Jaws | 470654000 | 1.907006e+09 |
| 8889 | Steven Spielberg | E.T. the Extra-Terrestrial | 792910554 | 1.791694e+09 |
| 10223 | Steven Spielberg | Jurassic Park | 920100000 | 1.388863e+09 |
| 8661 | Ridley Scott | Gladiator | 457640427 | 5.795065e+08 |
| 7 | Ridley Scott | The Martian | 595380321 | 5.477497e+08 |
| 2778 | Ridley Scott | Hannibal | 351692268 | 4.331048e+08 |
| 6558 | Ron Howard | The Da Vinci Code | 758239851 | 8.201647e+08 |
| 8076 | Ron Howard | Apollo 13 | 355237933 | 5.083337e+08 |
| 8663 | Ron Howard | How the Grinch Stole Christmas | 345141403 | 4.370498e+08 |
| 2641 | Steven Soderbergh | Ocean's Eleven | 450717150 | 5.550528e+08 |
| 6980 | Steven Soderbergh | Ocean's Twelve | 362744280 | 4.187685e+08 |
| 7425 | Steven Soderbergh | Ocean's Thirteen | 311312624 | 3.273977e+08 |
| 8082 | Joel Schumacher | Batman Forever | 336529144 | 4.815621e+08 |
| 5236 | Joel Schumacher | Batman & Robin | 238207122 | 3.235949e+08 |
| 8471 | Joel Schumacher | A Time to Kill | 152266007 | 2.116828e+08 |
| 8458 | Brian De Palma | Mission: Impossible | 457696359 | 6.362971e+08 |
| 9610 | Brian De Palma | The Untouchables | 76270454 | 1.463691e+08 |
| 7988 | Brian De Palma | Scarface | 65884703 | 1.442422e+08 |
| 9454 | Barry Levinson | Rain Man | 354825435 | 6.542594e+08 |
| 4209 | Barry Levinson | Disclosure | 214015089 | 3.148401e+08 |
| 9611 | Barry Levinson | Good Morning, Vietnam | 123922370 | 2.378170e+08 |
fig = plt.figure(figsize=(15, 6))
ax = sns.barplot(data=analysis, x='original_title', y='revenue_adj', hue='director', dodge=False, palette="Set2")
plt.xticks(rotation = 90)
plt.ylabel('Revenue')
plt.xlabel('Original Title')
Text(0.5,0,'Original Title')
**[选做]任务3.4:**分析1968年~2015年六月电影的数量的变化。
movie_data_summary = movie_data[['id', 'release_year']].groupby('release_year')['id'].agg(len)
fig = plt.figure(figsize=(15, 6))
movie_data_summary[:5]
plt.errorbar(data=profit_mean, x=movie_data_summary.index,y=movie_data_summary.values)
movies_public_number = movie_data[['id', 'release_year']]
public_number = movie_data.groupby('release_year').agg({'id': np.count_nonzero})
sns.title = ('movie_number')
base_color = sns.color_palette()[2]
bin_edges = np.arange(movies_public_number['release_year'].min(), movies_public_number['release_year'].max()+1)
plt.xlabel('release_year')
plt.ylabel('number')
plt.errorbar(data=public_number , x=public_number.index,y=public_number, color=base_color)
**任务3.5:**分析1968年~2015年六月电影 Comedy 和 Drama 两类电影的数量的变化。
movie_year = np.arange(1968, 2016)
movie_df = movie_data[movie_data['release_year'].isin(movie_year)][['release_year', 'genres']]
movie_df.dropna(inplace=True)
Comedy_trendy = movie_df[movie_df['genres'].str.contains(r'Comedy')]
Comedy_sum = Comedy_trendy.groupby('release_year').agg(len)
Comedy_sum.head(2)
| genres | |
|---|---|
| release_year | |
| 1968 | 9 |
| 1969 | 12 |
Drama_trendy = movie_df[movie_df['genres'].str.contains(r'Drama')]
Dram_sum = Drama_trendy.groupby('release_year').agg(len)
Dram_sum.rename(columns={'genres': 'D_genres'}, inplace=True)
Dram_sum.head(2)
| D_genres | |
|---|---|
| release_year | |
| 1968 | 20 |
| 1969 | 13 |
trendy_cd= Comedy_sum.join(Dram_sum)
trendy_cd.head(2)
| genres | D_genres | |
|---|---|---|
| release_year | ||
| 1968 | 9 | 20 |
| 1969 | 12 | 13 |
fig = plt.figure(figsize=(15, 6))
plt.xticks(rotation = 90)
base_color = sns.color_palette()[1]
bin_edges = np.arange(trendy_cd.index.min(), trendy_cd.index.max()+1)
plt.xlabel('release_year')
plt.ylabel('number')
plt.errorbar(data=trendy_cd, x=trendy_cd.index,y=trendy_cd['genres'], color=base_color)
plt.errorbar(data=trendy_cd, x=trendy_cd.index,y=trendy_cd['D_genres'])
# 拆分电影类型
df = movie_data
df_genres = df.drop('genres', axis=1).join(df['genres'].str.split('|', expand=True) \
.stack().reset_index(level=1, drop=True).rename('genres'))
fig, axes = plt.subplots(2, 1, figsize=(20, 10))
# 为了易于辨认, 只展示部分电影类型
top5_genres = df_genres['genres'].value_counts().nlargest(5).index
btm5_genres = df_genres['genres'].value_counts().nsmallest(5).index
# 作出中位数参考线
median_cir = df_genres.groupby('release_year')['genres'].value_counts().unstack().mean(axis=1)
median_cir.plot(ax=axes[0], ls='--', label='mean', legend=True)
median_cir.plot(ax=axes[1], ls='--', label='mean', legend=True)
# 按年份作图
vis_params = {'grid': True, 'marker': 'o', 'markersize': 2, 'linewidth': 1}
df_genres[df_genres['genres'].isin(top5_genres)] \
.groupby('release_year')['genres'].value_counts().unstack().fillna(0) \
.plot(ax=axes[0], title='circulation over years of top 5 genres', **vis_params)
df_genres[df_genres['genres'].isin(btm5_genres)] \
.groupby('release_year')['genres'].value_counts().unstack().fillna(0) \
.plot(ax=axes[1], title='circulation over years of bottom 5 genres', **vis_params)
plt.tight_layout()
data = df
kw_expand = data['keywords'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('keywords')
df_kw_rev = data[[factor]].join(kw_expand)
word_dict = df_kw_rev.groupby('keywords')[factor].sum().to_dict()
# create wordcloud
params = {'mode': 'RGBA',
'background_color': 'rgba(255, 255, 255, 0)',
'colormap': 'Spectral'}
wordcloud = WordCloud(width=1200, height=800, **params)
wordcloud.generate_from_frequencies(word_dict)
# plot
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
(-0.5, 1199.5, 799.5, -0.5)
df_genres = movie_data.drop('genres', axis=1).join(movie_data['genres'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('genres'))
# 筛选条件1 年份
sel_year = df_genres['release_year'].between(1968, 2015)
# 筛选条件2 六月
sel_June = pd.to_datetime(df_genres['release_date']).dt.month == 6
# 筛选条件3 类型
sel_genre = df_genres['genres'].isin(['Drama', 'Comedy'])
# 筛选数据并作图(参考逻辑读取部分)
plt.figure(figsize=[18, 5])
sns.countplot(data=df_genres[sel_year&sel_June&sel_genre], x='release_year', hue='genres')
plt.xticks(rotation=90)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]),
)