电影数据探索分析

探索电影数据集

在这个项目中,使用 NumPyPandasmatplotlibseaborn 库中的函数,来对电影数据集进行探索。

下载数据集:
TMDb电影数据

数据集各列名称的含义:

列名称 id imdb_id popularity budget revenue original_title cast homepage director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
含义 编号 IMDB 编号 知名度 预算 票房 名称 主演 网站 导演 宣传词 关键词 简介 时常 类别 发行公司 发行日期 投票总数 投票均值 发行年份 预算(调整后) 票房(调整后)


第一节 数据的导入与处理

任务1.1: 导入库以及数据

  1. 载入需要的库 NumPyPandasmatplotlibseaborn
  2. 利用 Pandas 库,读取 tmdb-movies.csv 中的数据,保存为 movie_data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

%matplotlib inline
%timeit
movie_data = pd.read_csv('C:/Users/Administrator/Documents/Explore_Movie_Dataset/Explore Movie Dataset/tmdb-movies.csv')

**任务1.2: ** 了解数据

movie_data.head(2)
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08

2 rows × 21 columns

movie_data.dtypes
id                        int64
imdb_id                  object
popularity              float64
budget                    int64
revenue                   int64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                   int64
genres                   object
production_companies     object
release_date             object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object
movie_data.isnull().sum()
id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64
movie_data.shape
(10866, 21)
movie_data.describe()
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09

**任务1.3: ** 清理数据

在真实的工作场景中,数据处理往往是最为费时费力的环节。但是幸运的是,我们提供给大家的 tmdb 数据集非常的「干净」,不需要大家做特别多的数据清洗以及处理工作。在这一步中,你的核心的工作主要是对数据表中的空值进行处理。你可以使用 .fillna() 来填补空值,当然也可以使用 .dropna() 来丢弃数据表中包含空值的某些行或者列。

任务:使用适当的方法来清理空值,并将得到的数据保存。

movie_data_nonul = movie_data.dropna(axis=1).copy()


第二节 根据指定要求读取数据

相比 Excel 等数据分析软件,Pandas 的一大特长在于,能够轻松地基于复杂的逻辑选择合适的数据。因此,如何根据指定的要求,从数据表当获取适当的数据,是使用 Pandas 中非常重要的技能,也是本节重点考察大家的内容。


**任务2.1: ** 简单读取

  1. 读取数据表中名为 idpopularitybudgetruntimevote_average 列的数据。
  2. 读取数据表中前1~20行以及48、49行的数据。
  3. 读取数据表中第50~60行的 popularity 那一列的数据。

要求:每一个语句只能用一行代码实现。

movie_data_nonul.id.head()
movie_data_nonul['popularity']
movie_data_nonul['vote_average'].head()
0    6.5
1    7.1
2    6.3
3    7.5
4    7.3
Name: vote_average, dtype: float64
movie_data_nonul[:20]
movie_data_nonul[48:50]
id popularity budget revenue original_title runtime release_date vote_count vote_average release_year budget_adj revenue_adj
48 265208 2.932340 30000000 0 Wild Card 92 1/14/15 481 5.3 2015 2.759999e+07 0.000000e+00
49 254320 2.885126 4000000 9064511 The Lobster 118 10/8/15 638 6.6 2015 3.679998e+06 8.339346e+06
movie_data_nonul[50:61]['popularity']
50    2.883233
51    2.814802
52    2.798017
53    2.793297
54    2.614499
55    2.584264
56    2.578919
57    2.575711
58    2.557859
59    2.550747
60    2.487849
Name: popularity, dtype: float64

**任务2.2: **逻辑读取(Logical Indexing)

  1. 读取数据表中 popularity 大于5 的所有数据。
  2. 读取数据表中 popularity 大于5 的所有数据且发行年份在1996年之后的所有数据。

提示:Pandas 中的逻辑运算符如 &|,分别代表以及

要求:请使用 Logical Indexing实现。

movie_data_nonul[movie_data_nonul['popularity'] > 5]
id popularity budget revenue original_title runtime release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 32.985763 150000000 1513528810 Jurassic World 124 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 28.419936 150000000 378436354 Mad Max: Fury Road 120 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 13.112507 110000000 295238201 Insurgent 119 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 11.173104 200000000 2068178225 Star Wars: The Force Awakens 136 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 9.335014 190000000 1506249360 Furious 7 137 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09
5 281957 9.110700 135000000 532950503 The Revenant 156 12/25/15 3929 7.2 2015 1.241999e+08 4.903142e+08
6 87101 8.654359 155000000 440603537 Terminator Genisys 125 6/23/15 2598 5.8 2015 1.425999e+08 4.053551e+08
7 286217 7.667400 108000000 595380321 The Martian 141 9/30/15 4572 7.6 2015 9.935996e+07 5.477497e+08
8 211672 7.404165 74000000 1156730962 Minions 91 6/17/15 2893 6.5 2015 6.807997e+07 1.064192e+09
9 150540 6.326804 175000000 853708609 Inside Out 94 6/9/15 3935 8.0 2015 1.609999e+08 7.854116e+08
10 206647 6.200282 245000000 880674609 Spectre 148 10/26/15 3254 6.2 2015 2.253999e+08 8.102203e+08
11 76757 6.189369 176000003 183987723 Jupiter Ascending 124 2/4/15 1937 5.2 2015 1.619199e+08 1.692686e+08
12 264660 6.118847 15000000 36869414 Ex Machina 108 1/21/15 2854 7.6 2015 1.379999e+07 3.391985e+07
13 257344 5.984995 88000000 243637091 Pixels 105 7/16/15 1575 5.8 2015 8.095996e+07 2.241460e+08
14 99861 5.944927 280000000 1405035767 Avengers: Age of Ultron 141 4/22/15 4304 7.4 2015 2.575999e+08 1.292632e+09
15 273248 5.898400 44000000 155760117 The Hateful Eight 167 12/25/15 2389 7.4 2015 4.047998e+07 1.432992e+08
16 260346 5.749758 48000000 325771424 Taken 3 109 1/1/15 1578 6.1 2015 4.415998e+07 2.997096e+08
17 102899 5.573184 130000000 518602163 Ant-Man 115 7/14/15 3779 7.0 2015 1.195999e+08 4.771138e+08
18 150689 5.556818 95000000 542351353 Cinderella 112 3/12/15 1495 6.8 2015 8.739996e+07 4.989630e+08
19 131634 5.476958 160000000 650523427 The Hunger Games: Mockingjay - Part 2 136 11/18/15 2380 6.5 2015 1.471999e+08 5.984813e+08
20 158852 5.462138 190000000 209035668 Tomorrowland 130 5/19/15 1899 6.2 2015 1.747999e+08 1.923127e+08
21 307081 5.337064 30000000 91709827 Southpaw 123 6/15/15 1386 7.3 2015 2.759999e+07 8.437300e+07
629 157336 24.949134 165000000 621752480 Interstellar 169 11/5/14 6498 8.0 2014 1.519800e+08 5.726906e+08
630 118340 14.311205 170000000 773312399 Guardians of the Galaxy 121 7/30/14 5612 7.9 2014 1.565855e+08 7.122911e+08
631 100402 12.971027 170000000 714766572 Captain America: The Winter Soldier 136 3/20/14 3848 7.6 2014 1.565855e+08 6.583651e+08
632 245891 11.422751 20000000 78739897 John Wick 101 10/22/14 2712 7.0 2014 1.842182e+07 7.252661e+07
633 131631 10.739009 125000000 752100229 The Hunger Games: Mockingjay - Part 1 123 11/18/14 3590 6.6 2014 1.151364e+08 6.927528e+08
634 122917 10.174599 250000000 955119788 The Hobbit: The Battle of the Five Armies 144 12/10/14 3110 7.1 2014 2.302728e+08 8.797523e+08
635 177572 8.691294 165000000 652105443 Big Hero 6 102 10/24/14 4185 7.8 2014 1.519800e+08 6.006485e+08
636 205596 8.110711 14000000 233555708 The Imitation Game 113 11/14/14 3478 8.0 2014 1.289527e+07 2.151261e+08
... ... ... ... ... ... ... ... ... ... ... ... ...
2876 10681 5.678119 180000000 521311860 WALL·E 98 6/22/08 4209 7.6 2008 1.823016e+08 5.279777e+08
3371 161337 8.411577 0 0 Underworld: Endless War 18 10/19/11 21 5.9 2011 0.000000e+00 0.000000e+00
3372 1771 7.959228 140000000 370569774 Captain America: The First Avenger 124 7/22/11 5025 6.5 2011 1.357157e+08 3.592296e+08
3373 64690 5.903353 15000000 76175166 Drive 100 1/10/11 2347 7.3 2011 1.454097e+07 7.384406e+07
3374 12445 5.711315 125000000 1327817822 Harry Potter and the Deathly Hallows: Part 2 130 7/7/11 3750 7.7 2011 1.211748e+08 1.287184e+09
3911 121 8.095275 79000000 926287400 The Lord of the Rings: The Two Towers 179 12/18/02 5114 7.8 2002 9.576865e+07 1.122902e+09
3912 672 6.012584 100000000 876688482 Harry Potter and the Chamber of Secrets 161 11/13/02 3458 7.2 2002 1.212261e+08 1.062776e+09
4177 680 8.093754 8000000 213928762 Pulp Fiction 154 10/14/94 5343 8.1 1994 1.176889e+07 3.147131e+08
4178 278 7.192039 25000000 28341469 The Shawshank Redemption 142 9/10/94 5754 8.4 1994 3.677779e+07 4.169346e+07
4179 13 6.715966 55000000 677945399 Forrest Gump 142 7/6/94 4856 8.1 1994 8.091114e+07 9.973333e+08
4361 24428 7.637767 220000000 1519557910 The Avengers 143 4/25/12 8903 7.3 2012 2.089437e+08 1.443191e+09
4362 52520 7.031452 70000000 132400000 Underworld: Awakening 88 1/19/12 1426 6.0 2012 6.648210e+07 1.257461e+08
4363 49026 6.591277 250000000 1081041287 The Dark Knight Rises 165 7/16/12 6723 7.5 2012 2.374361e+08 1.026713e+09
4364 68718 5.944518 100000000 425368238 Django Unchained 165 12/25/12 7375 7.7 2012 9.497443e+07 4.039911e+08
4365 37724 5.603587 200000000 1108561013 Skyfall 143 10/25/12 6137 6.8 2012 1.899489e+08 1.052849e+09
4949 122 7.122455 94000000 1118888979 The Lord of the Rings: The Return of the King 201 12/1/03 5636 7.9 2003 1.114231e+08 1.326278e+09
4950 277 6.887883 22000000 95708457 Underworld 121 9/19/03 1708 6.5 2003 2.607776e+07 1.134483e+08
4951 22 6.864067 140000000 655011224 Pirates of the Caribbean: The Curse of the Bla... 143 7/9/03 4223 7.3 2003 1.659494e+08 7.764193e+08
4952 24 6.174132 30000000 180949000 Kill Bill: Vol. 1 111 10/10/03 2932 7.6 2003 3.556058e+07 2.144884e+08
5230 13590 6.668990 0 0 Eddie Izzard: Glorious 99 11/17/97 11 5.5 1997 0.000000e+00 0.000000e+00
5422 109445 6.112766 150000000 1274219009 Frozen 102 11/27/13 3369 7.5 2013 1.404050e+08 1.192711e+09
5423 49047 5.242753 105000000 716392705 Gravity 91 9/27/13 3775 7.4 2013 9.828350e+07 6.705675e+08
5424 76338 5.111900 170000000 479765000 Thor: The Dark World 112 10/29/13 3025 6.8 2013 1.591257e+08 4.490760e+08
6081 105 6.095293 19000000 381109762 Back to the Future 116 7/3/85 3785 7.8 1985 3.851615e+07 7.725728e+08
6190 674 5.939927 150000000 895921036 Harry Potter and the Goblet of Fire 157 11/5/05 3406 7.3 2005 1.674845e+08 1.000353e+09
6191 272 5.400826 150000000 374218673 Batman Begins 140 6/14/05 4914 7.3 2005 1.674845e+08 4.178388e+08
6554 834 5.838503 50000000 111340801 Underworld: Evolution 106 1/12/06 1015 6.3 2006 5.408346e+07 1.204339e+08
6962 673 5.827781 130000000 789804554 Harry Potter and the Prisoner of Azkaban 141 5/31/04 3550 7.4 2004 1.500779e+08 9.117862e+08
7269 238 5.738034 6000000 245066411 The Godfather 175 3/15/72 3970 8.3 1972 3.128737e+07 1.277914e+09
7309 1891 5.488441 18000000 538400000 The Empire Strikes Back 124 1/1/80 3954 8.0 1980 4.762866e+07 1.424626e+09

85 rows × 12 columns

movie_data_nonul[(movie_data_nonul['popularity'] > 5) & (movie_data_nonul['release_year'] > 1996)]
id popularity budget revenue original_title runtime release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 32.985763 150000000 1513528810 Jurassic World 124 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 28.419936 150000000 378436354 Mad Max: Fury Road 120 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 13.112507 110000000 295238201 Insurgent 119 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 11.173104 200000000 2068178225 Star Wars: The Force Awakens 136 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 9.335014 190000000 1506249360 Furious 7 137 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09
5 281957 9.110700 135000000 532950503 The Revenant 156 12/25/15 3929 7.2 2015 1.241999e+08 4.903142e+08
6 87101 8.654359 155000000 440603537 Terminator Genisys 125 6/23/15 2598 5.8 2015 1.425999e+08 4.053551e+08
7 286217 7.667400 108000000 595380321 The Martian 141 9/30/15 4572 7.6 2015 9.935996e+07 5.477497e+08
8 211672 7.404165 74000000 1156730962 Minions 91 6/17/15 2893 6.5 2015 6.807997e+07 1.064192e+09
9 150540 6.326804 175000000 853708609 Inside Out 94 6/9/15 3935 8.0 2015 1.609999e+08 7.854116e+08
10 206647 6.200282 245000000 880674609 Spectre 148 10/26/15 3254 6.2 2015 2.253999e+08 8.102203e+08
11 76757 6.189369 176000003 183987723 Jupiter Ascending 124 2/4/15 1937 5.2 2015 1.619199e+08 1.692686e+08
12 264660 6.118847 15000000 36869414 Ex Machina 108 1/21/15 2854 7.6 2015 1.379999e+07 3.391985e+07
13 257344 5.984995 88000000 243637091 Pixels 105 7/16/15 1575 5.8 2015 8.095996e+07 2.241460e+08
14 99861 5.944927 280000000 1405035767 Avengers: Age of Ultron 141 4/22/15 4304 7.4 2015 2.575999e+08 1.292632e+09
15 273248 5.898400 44000000 155760117 The Hateful Eight 167 12/25/15 2389 7.4 2015 4.047998e+07 1.432992e+08
16 260346 5.749758 48000000 325771424 Taken 3 109 1/1/15 1578 6.1 2015 4.415998e+07 2.997096e+08
17 102899 5.573184 130000000 518602163 Ant-Man 115 7/14/15 3779 7.0 2015 1.195999e+08 4.771138e+08
18 150689 5.556818 95000000 542351353 Cinderella 112 3/12/15 1495 6.8 2015 8.739996e+07 4.989630e+08
19 131634 5.476958 160000000 650523427 The Hunger Games: Mockingjay - Part 2 136 11/18/15 2380 6.5 2015 1.471999e+08 5.984813e+08
20 158852 5.462138 190000000 209035668 Tomorrowland 130 5/19/15 1899 6.2 2015 1.747999e+08 1.923127e+08
21 307081 5.337064 30000000 91709827 Southpaw 123 6/15/15 1386 7.3 2015 2.759999e+07 8.437300e+07
629 157336 24.949134 165000000 621752480 Interstellar 169 11/5/14 6498 8.0 2014 1.519800e+08 5.726906e+08
630 118340 14.311205 170000000 773312399 Guardians of the Galaxy 121 7/30/14 5612 7.9 2014 1.565855e+08 7.122911e+08
631 100402 12.971027 170000000 714766572 Captain America: The Winter Soldier 136 3/20/14 3848 7.6 2014 1.565855e+08 6.583651e+08
632 245891 11.422751 20000000 78739897 John Wick 101 10/22/14 2712 7.0 2014 1.842182e+07 7.252661e+07
633 131631 10.739009 125000000 752100229 The Hunger Games: Mockingjay - Part 1 123 11/18/14 3590 6.6 2014 1.151364e+08 6.927528e+08
634 122917 10.174599 250000000 955119788 The Hobbit: The Battle of the Five Armies 144 12/10/14 3110 7.1 2014 2.302728e+08 8.797523e+08
635 177572 8.691294 165000000 652105443 Big Hero 6 102 10/24/14 4185 7.8 2014 1.519800e+08 6.006485e+08
636 205596 8.110711 14000000 233555708 The Imitation Game 113 11/14/14 3478 8.0 2014 1.289527e+07 2.151261e+08
... ... ... ... ... ... ... ... ... ... ... ... ...
1922 44214 5.293180 13000000 327803731 Black Swan 108 12/2/10 2597 7.1 2010 1.300000e+07 3.278037e+08
2409 550 8.947905 63000000 100853753 Fight Club 139 10/14/99 5923 8.1 1999 8.247033e+07 1.320229e+08
2410 603 7.753899 63000000 463517383 The Matrix 136 3/30/99 6351 7.8 1999 8.247033e+07 6.067687e+08
2633 120 8.575419 93000000 871368364 The Lord of the Rings: The Fellowship of the Ring 178 12/18/01 6079 7.8 2001 1.145284e+08 1.073080e+09
2634 671 8.021423 125000000 976475550 Harry Potter and the Philosopher's Stone 152 11/16/01 4265 7.2 2001 1.539360e+08 1.202518e+09
2875 155 8.466668 185000000 1001921825 The Dark Knight 152 7/16/08 8432 8.1 2008 1.873655e+08 1.014733e+09
2876 10681 5.678119 180000000 521311860 WALL·E 98 6/22/08 4209 7.6 2008 1.823016e+08 5.279777e+08
3371 161337 8.411577 0 0 Underworld: Endless War 18 10/19/11 21 5.9 2011 0.000000e+00 0.000000e+00
3372 1771 7.959228 140000000 370569774 Captain America: The First Avenger 124 7/22/11 5025 6.5 2011 1.357157e+08 3.592296e+08
3373 64690 5.903353 15000000 76175166 Drive 100 1/10/11 2347 7.3 2011 1.454097e+07 7.384406e+07
3374 12445 5.711315 125000000 1327817822 Harry Potter and the Deathly Hallows: Part 2 130 7/7/11 3750 7.7 2011 1.211748e+08 1.287184e+09
3911 121 8.095275 79000000 926287400 The Lord of the Rings: The Two Towers 179 12/18/02 5114 7.8 2002 9.576865e+07 1.122902e+09
3912 672 6.012584 100000000 876688482 Harry Potter and the Chamber of Secrets 161 11/13/02 3458 7.2 2002 1.212261e+08 1.062776e+09
4361 24428 7.637767 220000000 1519557910 The Avengers 143 4/25/12 8903 7.3 2012 2.089437e+08 1.443191e+09
4362 52520 7.031452 70000000 132400000 Underworld: Awakening 88 1/19/12 1426 6.0 2012 6.648210e+07 1.257461e+08
4363 49026 6.591277 250000000 1081041287 The Dark Knight Rises 165 7/16/12 6723 7.5 2012 2.374361e+08 1.026713e+09
4364 68718 5.944518 100000000 425368238 Django Unchained 165 12/25/12 7375 7.7 2012 9.497443e+07 4.039911e+08
4365 37724 5.603587 200000000 1108561013 Skyfall 143 10/25/12 6137 6.8 2012 1.899489e+08 1.052849e+09
4949 122 7.122455 94000000 1118888979 The Lord of the Rings: The Return of the King 201 12/1/03 5636 7.9 2003 1.114231e+08 1.326278e+09
4950 277 6.887883 22000000 95708457 Underworld 121 9/19/03 1708 6.5 2003 2.607776e+07 1.134483e+08
4951 22 6.864067 140000000 655011224 Pirates of the Caribbean: The Curse of the Bla... 143 7/9/03 4223 7.3 2003 1.659494e+08 7.764193e+08
4952 24 6.174132 30000000 180949000 Kill Bill: Vol. 1 111 10/10/03 2932 7.6 2003 3.556058e+07 2.144884e+08
5230 13590 6.668990 0 0 Eddie Izzard: Glorious 99 11/17/97 11 5.5 1997 0.000000e+00 0.000000e+00
5422 109445 6.112766 150000000 1274219009 Frozen 102 11/27/13 3369 7.5 2013 1.404050e+08 1.192711e+09
5423 49047 5.242753 105000000 716392705 Gravity 91 9/27/13 3775 7.4 2013 9.828350e+07 6.705675e+08
5424 76338 5.111900 170000000 479765000 Thor: The Dark World 112 10/29/13 3025 6.8 2013 1.591257e+08 4.490760e+08
6190 674 5.939927 150000000 895921036 Harry Potter and the Goblet of Fire 157 11/5/05 3406 7.3 2005 1.674845e+08 1.000353e+09
6191 272 5.400826 150000000 374218673 Batman Begins 140 6/14/05 4914 7.3 2005 1.674845e+08 4.178388e+08
6554 834 5.838503 50000000 111340801 Underworld: Evolution 106 1/12/06 1015 6.3 2006 5.408346e+07 1.204339e+08
6962 673 5.827781 130000000 789804554 Harry Potter and the Prisoner of Azkaban 141 5/31/04 3550 7.4 2004 1.500779e+08 9.117862e+08

78 rows × 12 columns


**任务2.3: **分组读取

movie_data_nonul.groupby('release_year').agg({'revenue': np.mean}).head()
revenue
release_year
1960 4.531406e+06
1961 1.089420e+07
1962 6.736870e+06
1963 5.511911e+06
1964 8.118614e+06
movie_data.dropna().groupby('director').agg({'popularity': np.mean}).sort_values(by='popularity', ascending=False).head()
popularity
director
Colin Trevorrow 32.985763
George Miller 14.675428
Joe Russo|Anthony Russo 12.971027
Chad Stahelski|David Leitch 11.422751
Don Hall|Chris Williams 8.691294


第三节 绘图与可视化

可视化的目标 可以使用的图像
表示某一属性数据的分布 饼图、直方图、散点图
表示某一属性数据随着某一个变量变化 条形图、折线图、热力图
比较多个属性的数据之间的关系 散点图、小提琴图、堆积条形图、堆积折线图

**任务3.1:**对 popularity 最高的20名电影绘制其 popularity 值。

 movie_top20_popularity = movie_data.loc[movie_data.sort_values(by='popularity', ascending=False)[:20].index]
base_color = sns.color_palette()[9]
sns.barplot(data=movie_top20_popularity, x='popularity', y='original_title', color=base_color)
sns.title = ('movie_top20_popularity')

电影数据探索分析_第1张图片


**任务3.2:**分析电影净利润(票房-成本)随着年份变化的情况,并简单进行分析。

movie_data['profit'] = movie_data['revenue'] - movie_data['budget']
movies_profit = movie_data[['release_year', 'profit'] ]
profit_mean = movie_data.groupby('release_year').agg({'profit': np.sum})

sns.title = ('movie_profit')
base_color = sns.color_palette()[9]
bin_edges = np.arange(movie_data.release_year.min(), movie_data.release_year.max()+1)
plt.xlabel('release_year')
plt.ylabel('profit')

plt.errorbar(data=profit_mean, x=profit_mean.index,y=profit_mean, color=base_color)

电影数据探索分析_第2张图片

sns.title = ('movie_profit')
base_color = sns.color_palette()[0]
plt.xlabel('year')
plt.ylabel('profit')

xbin_edges = np.arange(1960, 2016, 1)
xbin_centers = (xbin_edges)[:-1]

data_xbins = pd.cut(movie_data['release_year'], xbin_edges, right = False, include_lowest = True)
y_sum = movie_data['profit'].groupby(data_xbins).sum()
y_sems = movie_data['profit'].groupby(data_xbins).sem()

plt.errorbar(x = xbin_centers, y = y_sum, yerr = y_sems, color = base_color)


**任务3.3:**选择最多产的10位导演(电影数量最多的),绘制他们排行前3的三部电影的票房情况,并简要进行分析。

movie_data[['director', 'original_title', 'revenue']].head()
director original_title revenue
0 Colin Trevorrow Jurassic World 1513528810
1 George Miller Mad Max: Fury Road 378436354
2 Robert Schwentke Insurgent 295238201
3 J.J. Abrams Star Wars: The Force Awakens 2068178225
4 James Wan Furious 7 1506249360
director_info = movie_data[['director', 'original_title']].\
                groupby('director').count()['original_title'].nlargest(10).index
director_info
Index(['Woody Allen', 'Clint Eastwood', 'Martin Scorsese', 'Steven Spielberg',
       'Ridley Scott', 'Ron Howard', 'Steven Soderbergh', 'Joel Schumacher',
       'Brian De Palma', 'Barry Levinson'],
      dtype='object', name='director')
analysis = pd.DataFrame()
for i in director_info:
    data = movie_data.loc[movie_data['director'] == i][['director', 'original_title',
                                         'revenue','revenue_adj']].nlargest(3, 'revenue_adj')
    analysis = analysis.append(data)

analysis
director original_title revenue revenue_adj
3429 Woody Allen Midnight in Paris 151119219 1.464947e+08
1332 Woody Allen Annie Hall 38251425 1.376203e+08
7835 Woody Allen Manhattan 39946780 1.200223e+08
657 Clint Eastwood American Sniper 542307423 4.995145e+08
2888 Clint Eastwood Gran Torino 269958228 2.734101e+08
8092 Clint Eastwood The Bridges of Madison County 182016617 2.604597e+08
5428 Martin Scorsese The Wolf of Wall Street 392000694 3.669257e+08
6563 Martin Scorsese The Departed 289847354 3.135189e+08
1927 Martin Scorsese Shutter Island 294804195 2.948042e+08
9806 Steven Spielberg Jaws 470654000 1.907006e+09
8889 Steven Spielberg E.T. the Extra-Terrestrial 792910554 1.791694e+09
10223 Steven Spielberg Jurassic Park 920100000 1.388863e+09
8661 Ridley Scott Gladiator 457640427 5.795065e+08
7 Ridley Scott The Martian 595380321 5.477497e+08
2778 Ridley Scott Hannibal 351692268 4.331048e+08
6558 Ron Howard The Da Vinci Code 758239851 8.201647e+08
8076 Ron Howard Apollo 13 355237933 5.083337e+08
8663 Ron Howard How the Grinch Stole Christmas 345141403 4.370498e+08
2641 Steven Soderbergh Ocean's Eleven 450717150 5.550528e+08
6980 Steven Soderbergh Ocean's Twelve 362744280 4.187685e+08
7425 Steven Soderbergh Ocean's Thirteen 311312624 3.273977e+08
8082 Joel Schumacher Batman Forever 336529144 4.815621e+08
5236 Joel Schumacher Batman & Robin 238207122 3.235949e+08
8471 Joel Schumacher A Time to Kill 152266007 2.116828e+08
8458 Brian De Palma Mission: Impossible 457696359 6.362971e+08
9610 Brian De Palma The Untouchables 76270454 1.463691e+08
7988 Brian De Palma Scarface 65884703 1.442422e+08
9454 Barry Levinson Rain Man 354825435 6.542594e+08
4209 Barry Levinson Disclosure 214015089 3.148401e+08
9611 Barry Levinson Good Morning, Vietnam 123922370 2.378170e+08
fig = plt.figure(figsize=(15, 6)) 
ax = sns.barplot(data=analysis, x='original_title', y='revenue_adj', hue='director', dodge=False, palette="Set2")
plt.xticks(rotation = 90)
plt.ylabel('Revenue')
plt.xlabel('Original Title')
Text(0.5,0,'Original Title')

电影数据探索分析_第3张图片


**[选做]任务3.4:**分析1968年~2015年六月电影的数量的变化。

movie_data_summary = movie_data[['id', 'release_year']].groupby('release_year')['id'].agg(len)
fig = plt.figure(figsize=(15, 6)) 
movie_data_summary[:5]
plt.errorbar(data=profit_mean, x=movie_data_summary.index,y=movie_data_summary.values)

电影数据探索分析_第4张图片

movies_public_number = movie_data[['id', 'release_year']]
public_number = movie_data.groupby('release_year').agg({'id': np.count_nonzero})

sns.title = ('movie_number')
base_color = sns.color_palette()[2]
bin_edges = np.arange(movies_public_number['release_year'].min(), movies_public_number['release_year'].max()+1)
plt.xlabel('release_year')
plt.ylabel('number')

plt.errorbar(data=public_number , x=public_number.index,y=public_number, color=base_color)

电影数据探索分析_第5张图片


**任务3.5:**分析1968年~2015年六月电影 ComedyDrama 两类电影的数量的变化。

movie_year = np.arange(1968, 2016)
movie_df = movie_data[movie_data['release_year'].isin(movie_year)][['release_year', 'genres']]
movie_df.dropna(inplace=True)
Comedy_trendy = movie_df[movie_df['genres'].str.contains(r'Comedy')]
Comedy_sum = Comedy_trendy.groupby('release_year').agg(len)
Comedy_sum.head(2)
genres
release_year
1968 9
1969 12
Drama_trendy = movie_df[movie_df['genres'].str.contains(r'Drama')]
Dram_sum = Drama_trendy.groupby('release_year').agg(len)
Dram_sum.rename(columns={'genres': 'D_genres'}, inplace=True)
Dram_sum.head(2)
D_genres
release_year
1968 20
1969 13
trendy_cd= Comedy_sum.join(Dram_sum)
trendy_cd.head(2)
genres D_genres
release_year
1968 9 20
1969 12 13
fig = plt.figure(figsize=(15, 6)) 

plt.xticks(rotation = 90)
base_color = sns.color_palette()[1]
bin_edges = np.arange(trendy_cd.index.min(), trendy_cd.index.max()+1)
plt.xlabel('release_year')
plt.ylabel('number')

plt.errorbar(data=trendy_cd, x=trendy_cd.index,y=trendy_cd['genres'], color=base_color)
plt.errorbar(data=trendy_cd, x=trendy_cd.index,y=trendy_cd['D_genres'])

电影数据探索分析_第6张图片

# 拆分电影类型
df = movie_data
df_genres = df.drop('genres', axis=1).join(df['genres'].str.split('|', expand=True) \
                               .stack().reset_index(level=1, drop=True).rename('genres'))
fig, axes = plt.subplots(2, 1, figsize=(20, 10))

# 为了易于辨认, 只展示部分电影类型
top5_genres = df_genres['genres'].value_counts().nlargest(5).index
btm5_genres = df_genres['genres'].value_counts().nsmallest(5).index

# 作出中位数参考线
median_cir = df_genres.groupby('release_year')['genres'].value_counts().unstack().mean(axis=1)
median_cir.plot(ax=axes[0], ls='--', label='mean', legend=True)
median_cir.plot(ax=axes[1], ls='--', label='mean', legend=True)

# 按年份作图
vis_params = {'grid': True, 'marker': 'o', 'markersize': 2, 'linewidth': 1}
df_genres[df_genres['genres'].isin(top5_genres)] \
                 .groupby('release_year')['genres'].value_counts().unstack().fillna(0) \
                 .plot(ax=axes[0], title='circulation over years of top 5 genres', **vis_params)
df_genres[df_genres['genres'].isin(btm5_genres)] \
                 .groupby('release_year')['genres'].value_counts().unstack().fillna(0) \
                 .plot(ax=axes[1], title='circulation over years of bottom 5 genres', **vis_params)

plt.tight_layout()

电影数据探索分析_第7张图片

data = df
kw_expand = data['keywords'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('keywords')
df_kw_rev = data[[factor]].join(kw_expand)
word_dict = df_kw_rev.groupby('keywords')[factor].sum().to_dict()

# create wordcloud
params = {'mode': 'RGBA', 
          'background_color': 'rgba(255, 255, 255, 0)', 
          'colormap': 'Spectral'}
wordcloud = WordCloud(width=1200, height=800, **params)
wordcloud.generate_from_frequencies(word_dict)

# plot
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
(-0.5, 1199.5, 799.5, -0.5)

电影数据探索分析_第8张图片

df_genres = movie_data.drop('genres', axis=1).join(movie_data['genres'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('genres'))

# 筛选条件1 年份
sel_year = df_genres['release_year'].between(1968, 2015)

# 筛选条件2 六月
sel_June = pd.to_datetime(df_genres['release_date']).dt.month == 6

# 筛选条件3 类型
sel_genre = df_genres['genres'].isin(['Drama', 'Comedy'])

# 筛选数据并作图(参考逻辑读取部分)
plt.figure(figsize=[18, 5])
sns.countplot(data=df_genres[sel_year&sel_June&sel_genre], x='release_year', hue='genres')
plt.xticks(rotation=90)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]),
 )

电影数据探索分析_第9张图片

你可能感兴趣的:(机器学习)