7/1/2020
Investigate_Dataset_TMD
Project: Investigate a Dataset (The Movies
Database)
Table of Contents
Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions
Introduction
This dataset contains information about 10,000 movies collected from The Movie Database,
including user ratings and revenue. In this report, I will be exploring the following questions: (1)
which year has the highest patronage of movies. (2) what kinds of properties are associated with
movies that have high revenues. (3) which genres are most popular from year to year. (4) do
customers have preference for production from a particular company. (5) what factors or
variables significantly impact the popularity of the movies. (6) which year did the film industry
make the highest profit
Data Set Up
In [1]: import pandas as pd
import numpy as np
% matplotlib inline
Data Wrangling
General Properties
The data is loaded and few lines of the dataset are printed out. By inspecting the dataset,
duplicates, missing rows, and datatypes of the features are determined. It is evident that some
rows are missing in "popularity", "cast", "homepage", "director", "keywords", "overview", "genres",
and "production_companies" columns. The nature of the datatypes of each feature is displayed
below.
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
1/36
7/1/2020
Investigate_Dataset_TMD
In [2]: df_movie = pd.read_csv('tmdb_movies.csv')
## Printout of raw data
df_movie.head(-4)
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
2/36
7/1/2020
Investigate_Dataset_TMD
Out[2]:
id
imdb_id popularity
budget
revenue original_title
- tt-
Jurassic
World
Chris Pratt|Br
Dallas Howar
Khan|Vi...
76341
Mad Max:
Fury Road
Tom Hardy|C
Theron|Hugh
Byrne|Nic...
Insurgent
Shailene
Woodley|The
James|Kate
Winslet|Anse
1
tt-
- tt-
3
Star Wars:
140607 tt- The Force
Awakens
Harrison Ford
Hamill|Carrie
Fisher|Adam
168259 tt-
Vin Diesel|Pa
Walker|Jason
Statham|Mich
- Furious 7
- tt-
-
Leonardo
DiCaprio|Tom
The Revenant
Hardy|Will
Poulter|Domh
-
Terminator
Genisys
Arnold
Schwarzeneg
Clarke|Emilia
The Martian
Matt Damon|J
Chastain|Kris
Wiig|Jeff ...
6
87101
tt-
- tt-
-
211672 tt-
-
- Minions
Sandra Bulloc
Hamm|Micha
Keaton|Alliso
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
3/36
7/1/2020
Investigate_Dataset_TMD
id
imdb_id popularity
budget
revenue original_title
- tt-
Inside Out
Amy Poehler|
Smith|Richard
Ha...
-
Spectre
Daniel Craig|C
Waltz|Léa
Seydoux|Ralp
-
Jupiter
Ascending
Mila Kunis|Ch
Tatum|Sean
Bean|Eddie R
Ex Machina
Domhnall
Gleeson|Alicia
Vikander|Osc
Isaac|S...
Pixels
Adam Sandle
Monaghan|Pe
Dinklage|...
-
Avengers:
Age of Ultron
Robert Downe
Jr.|Chris
Hemsworth|M
Ruffalo...
-
The Hateful
Eight
Samuel L.
Jackson|Kurt
Russell|Jenni
...
-
- tt-
-
tt-
- tt-
-
-
- tt-
-
-
-
tt-
- tt-
-
- tt-
-
-
Taken 3
Liam Neeson
Whitaker|Mag
Grace|Famke
102899 tt-
-
Ant-Man
Paul Rudd|Mi
Douglas|Evan
Lilly|Cor...
17
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
4/36
7/1/2020
Investigate_Dataset_TMD
id
imdb_id popularity
budget
revenue original_title
- tt-
-
-
19
Cinderella
Lily James|Ca
Blanchett|Ric
Madden|Hele
Jennifer
Lawrence|Jos
Hutcherson|L
Hemswor...
131634 tt-
-
The Hunger
Games:
Mockingjay Part 2
158852 tt-
-
Britt Robertso
Tomorrowland Clooney|Raffe
Cassidy|...
20
- tt-
Southpaw
Jake Gyllenha
McAdams|Fo
Whitaker...
-
San Andreas
Dwayne
Johnson|Alex
Daddario|Car
Gugino...
-
-
Fifty Shades
of Grey
Dakota Johns
Dornan|Jenni
Ehle|Eloi...
-
Christian Bale
The Big Short Carell|Ryan
Gosling|Brad
-
Mission:
Tom Cruise|Je
Impossible Renner|Simo
Rogue Nation Pegg|Rebecc
-
-
- tt-
- tt-
- tt-
-
- tt-
- tt-
-
-
Ted 2
Mark Wahlbe
MacFarlane|A
Seyfried|...
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
5/36
7/1/2020
Investigate_Dataset_TMD
id
imdb_id popularity
budget
revenue original_title
- tt-
-
Kingsman:
The Secret
Service
Taron Egerton
Firth|Samuel
Jackson|Mi...
-
Spotlight
Mark Ruffalo|
Keaton|Rache
McAdams|Lie
-
- tt-
-
29
...
294254 tt-
-
-
Maze Runner: Dylan O'Brien
The Scorch
Scodelario|Th
Trials
Brodie-Sa...
...
...
...
...
...
...
...
-
tt-
-
0
Arabesque
Gregory Peck
Loren|Alan
Badel|Kieron
3001
tt-
0
0
How to Steal
a Million
Audrey Hepb
O'Toole|Eli
Wallach|Hugh
0
Return of the
Seven
Yul Brynner|R
Fuller|Julián
Mateos|Warre
10833
-
tt-
0
-
tt-
-
-
The Sand
Pebbles
Steve
McQueen|Ric
Attenborough
Cre...
38720
tt-
0
0
Walk Don't
Run
Cary Grant|Sa
Eggar|Jim Hu
Stan...
0
George
Peppard|Jam
The Blue Max
Mason|Ursula
Andress|Jere
10836
-
tt-
0
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
6/36
7/1/2020
Investigate_Dataset_TMD
id
imdb_id popularity
budget
revenue original_title
-
tt-
0
0
Burt Lancaste
The
Marvin|Rober
Professionals
Ryan|Woody
0
It's the Great Christopher S
Pumpkin,
Dryer|Kathy
Charlie Brown Steinberg|A...
0
Funeral in
Berlin
0
Will Hutchins|
The Shooting Perkins|Jack
Nicholson|Wa
0
Sterling
Winnie the
Holloway|Jun
Pooh and the
Matthews|Seb
Honey Tree
Ca...
-
tt-
0
-
tt-
0
-
tt-
75000
-
tt-
0
Michael Caine
Hubschmid|O
Homolka|Eva
-
tt-
0
0
Khartoum
0
James Cobur
Our Man Flint Cobb|Gila
Golan|Edward
0
Carry On
Cowboy
Sid James|Jim
Dale|Angela
Douglas|Kenn
0
Dracula:
Prince of
Darkness
Christopher
Lee|Barbara
Shelley|Andre
Keir|Fr...
-
tt-
0
-
tt-
0
-
tt-
0
Charlton
Heston|Laure
Olivier|Richar
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
7/36
7/1/2020
Investigate_Dataset_TMD
id
imdb_id popularity
budget
revenue original_title
10847
Peter Cushing
Judd|Carole
Gray|Eddie B
28763
tt-
0
0
Island of
Terror
2161
tt-
-
-
Fantastic
Voyage
Stephen Boyd
Welch|Edmon
O'Brien|Dona
Gambit
Michael Caine
MacLaine|He
Lom|Joh...
Harper
Paul Newman
Bacall|Julie
Harris|Arthur
0
Born Free
Virginia McKe
Travers|Geoff
Keen|Pe...
0
A Big Hand
for the Little
Lady
Henry Fonda|
Woodward|Ja
Robards|Pau
Alfie
Michael Caine
Winters|Millic
Martin...
0
The Chase
Marlon Brand
Fonda|Robert
Redford|E.G.
0
The Ghost &
Mr. Chicken
Don Knotts|Jo
Staley|Liam
Redmond|Dic
10848
-
tt-
0
0
-
tt-
0
0
-
tt-
0
-
tt-
0
-
tt-
0
0
-
tt-
0
-
tt-
700000
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
8/36
7/1/2020
Investigate_Dataset_TMD
id
imdb_id popularity
budget
revenue original_title
-
tt-
0
0
0
Steve McQue
Nevada Smith Malden|Brian
Keith|Arthur K
0
The Russians
Carl Reiner|E
Are Coming,
Saint|Alan Ar
The Russians
K...
Are Coming
-
tt-
0
-
tt-
0
Dean Jones|S
Pleshette|Cha
Ruggles|K...
The Ugly
Dachshund
-
tt-
0
0
Seconds
Rock Hudson
Jens|John
Randolph|Wil
5060
tt-
0
0
Carry On
Screaming!
Kenneth Willia
Dale|Harry H.
Corbett|Joa...
0
The Endless
Summer
Michael Hyns
August|Lord 'T
B...
10860
-
tt-
0
10862 rows × 21 columns
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
9/36
7/1/2020
Investigate_Dataset_TMD
In [3]: ## The goal is to check the datatypes of features in the dataset
df_movie.info()
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id
10866 non-null int64
imdb_id
10856 non-null object
popularity
10866 non-null float64
budget
10866 non-null int64
revenue
10866 non-null int64
original_title
10866 non-null object
cast
10790 non-null object
homepage
2936 non-null object
director
10822 non-null object
tagline
8042 non-null object
keywords
9373 non-null object
overview
10862 non-null object
runtime
10866 non-null int64
genres
10843 non-null object
production_companies
9836 non-null object
release_date
10866 non-null object
vote_count
10866 non-null int64
vote_average
10866 non-null float64
release_year
10866 non-null int64
budget_adj
10866 non-null float64
revenue_adj
10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
Data Cleaning
Looking at the dataset, some rows need to be dropped. Duplicates will be checked and cleaned.
No column will be dropped in this analysis, as there is no cause for it; all the columns are
relevant. Rows with null values need to be dropped so as to have compact dataset with nonmissing values in the columns only; since almost all the columns of the dataset are relevant for
this analysis. Duplicates may be due to human error, so duplicate data will negatively impact the
data analysis.
In [4]: #drop rows with any null values in the dataset
df_movie.dropna(how='any',inplace=True)
In [5]: # check if any columns have null values; it should print false
df_movie.isnull().sum().any()
Out[5]: False
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
10/36
7/1/2020
Investigate_Dataset_TMD
In [6]: # check for duplicates in the dataset. If none, it should print out 0
sum(df_movie.duplicated())
Out[6]: 0
In [7]: # The structure of the dataset after it has been cleaned
df_movie.info()
Int64Index: 1992 entries, 0 to 10819
Data columns (total 21 columns):
id
1992 non-null int64
imdb_id
1992 non-null object
popularity
1992 non-null float64
budget
1992 non-null int64
revenue
1992 non-null int64
original_title
1992 non-null object
cast
1992 non-null object
homepage
1992 non-null object
director
1992 non-null object
tagline
1992 non-null object
keywords
1992 non-null object
overview
1992 non-null object
runtime
1992 non-null int64
genres
1992 non-null object
production_companies
1992 non-null object
release_date
1992 non-null object
vote_count
1992 non-null int64
vote_average
1992 non-null float64
release_year
1992 non-null int64
budget_adj
1992 non-null float64
revenue_adj
1992 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 342.4+ KB
Exploratory Data Analysis
Research Question 1: Which year has the highest patronage of movies?
Discussion
The number of vote count will be used as a measure of patronage in this analysis. From the bar
chart plot below it is evident that in 2012 the movie industry has the highest patronage of
customers; the highest record in history.
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
11/36
7/1/2020
Investigate_Dataset_TMD
In [8]: import matplotlib.pyplot as plt
% matplotlib inline
In [9]: ## Plot visualizzation of the relationship between release year and number of
votes for each genre
df_movie.groupby('release_year')['vote_count'].sum().plot(kind='bar',figsize=(
15,15),title='Total Votes throughout the Years')
plt.xlabel('release_year',fontsize=12)
plt.ylabel('total votes',fontsize=12)
Out[9]: Text(0,0.5,'total votes')
Research Question 2: Which year did the film industry make the highest profit?
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
12/36
7/1/2020
Investigate_Dataset_TMD
Discussion:
To answer this question, I have plotted the adjusted revenue and profit index for each year. The
profit index is the major indicator that determines how profitable the industry is. The index is a
measure of the amount made on investment. From the Figure on Profit Index, the film industry
made the highest profit in 1978 although the highest revenue was made in 2011. This figure
indicates that the financial attractiveness of making a movie is reducing with time; although it has
remained relatively stable at 5 for the last decade of this survey.
In [10]: #Plot visualization of adjusted revenue per year
df_movie.groupby('release_year')['revenue_adj'].sum().plot(kind='bar',figsize=
(15,15),title='Total Revenue throughout the Years')
plt.xlabel('release_year',fontsize=12)
plt.ylabel('total revenue',fontsize=12)
Out[10]: Text(0,0.5,'total revenue')
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
13/36
7/1/2020
Investigate_Dataset_TMD
In [11]: #Plot visualization of profitability index per year
revenue = df_movie.groupby('release_year')['revenue_adj'].sum()
expense = df_movie.groupby('release_year')['budget_adj'].sum()
profit_index = (revenue)/expense
profit_index.plot(kind='bar',figsize=(15,15),title='Profitability Indices for
Film Industry')
plt.xlabel('release_year',fontsize=12)
plt.ylabel('profit_index',fontsize=12)
Out[11]: Text(0,0.5,'profit_index')
Research Question 3: Which genres are most popular from year to year?
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
14/36
7/1/2020
Investigate_Dataset_TMD
Discussion
To analyse this problem, three different indicators of the genres popularity are considered:(1)
revenue (2) profitability index (3) popularity index. The revenue indicator is used because the
popularity of the genre means many customers patronised it. From the plot below Adventure,
Animation, Science Fiction, and Western movies have received a lot of popularity over the years.
The profitability index, which is a derivative of the revenue shows a different pattern of popularity
measure among the genres. Adventure, Documentary, Horror, and Science Fiction are the most
popular.
Now following the popularity index, Action, Adventure, Science Fiction, and Western are the most
popular year to year. Thus, it implies that the use revenue and profitability index are not good
indicators to measure how popular the genres are.
In [12]: ## First, getting all the movies with more than one genres
genre_more = df_movie[df_movie['genres'].str.contains('|')]
In [13]: # making a copy
df_movie1 = genre_more.copy()
In [14]: # Using Pandas' apply function to split the column
split_columns = ['genres','cast','production_companies']
for c in split_columns:
df_movie1[c] = df_movie1[c].apply(lambda x: x.split("|")[0])
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
15/36
7/1/2020
Investigate_Dataset_TMD
In [15]: # A view of the cleaned dataset
df_movie1.info()
Int64Index: 1992 entries, 0 to 10819
Data columns (total 21 columns):
id
1992 non-null int64
imdb_id
1992 non-null object
popularity
1992 non-null float64
budget
1992 non-null int64
revenue
1992 non-null int64
original_title
1992 non-null object
cast
1992 non-null object
homepage
1992 non-null object
director
1992 non-null object
tagline
1992 non-null object
keywords
1992 non-null object
overview
1992 non-null object
runtime
1992 non-null int64
genres
1992 non-null object
production_companies
1992 non-null object
release_date
1992 non-null object
vote_count
1992 non-null int64
vote_average
1992 non-null float64
release_year
1992 non-null int64
budget_adj
1992 non-null float64
revenue_adj
1992 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 342.4+ KB
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
16/36
7/1/2020
Investigate_Dataset_TMD
In [16]: # Considering total revenue generated per genre from year to year
df_movie1.groupby('genres')['revenue_adj'].mean().plot(kind='bar',figsize=(15,
15),title='Average Revenue From Each Genre')
plt.xlabel('genre',fontsize=12)
plt.ylabel('Average revenue',fontsize=12)
Out[16]: Text(0,0.5,'Average revenue')
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
17/36
7/1/2020
Investigate_Dataset_TMD
In [17]: #Considering profitability index per genre
revenue_genre = df_movie1.groupby('genres')['revenue_adj'].mean()
expense_genre = df_movie1.groupby('genres')['budget_adj'].mean()
profit_index_genre = (revenue_genre)/expense_genre
profit_index_genre.plot(kind='bar',figsize=(15,15),title='Average Profitabilit
y Indices per Genre')
plt.xlabel('genre',fontsize=12)
plt.ylabel('profit_index',fontsize=12)
Out[17]: Text(0,0.5,'profit_index')
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
18/36
7/1/2020
Investigate_Dataset_TMD
In [18]: # Using popularity index
df_movie1.groupby('genres')['popularity'].mean().plot(kind='bar',figsize=(15,1
5),title='Genres Popularity Indices')
plt.xlabel('genre',fontsize=12)
plt.ylabel('total popularity index',fontsize=12)
Out[18]: Text(0,0.5,'total popularity index')
Research Question 4: Which of the movies are the most expensive to make?
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
19/36
7/1/2020
Investigate_Dataset_TMD
Discussion
To answer this question, the budget per genre for each year are plotted. From this plot, the
Western movies are the most expensive to make, follow by Adventure, Animation and Fantasy.
Digging deeper, to know what could have caused these movies to be expensive, I considered if
the number of casts could be a major factor. Unfortunately, from the plot below, the number of
casts seem not be a major factor, as the most expensive genre to make has the least number of
casts on the average. It thus imply that other factors may be responsible for the high cost of
making these movies; these may be further investigated.
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
20/36
7/1/2020
Investigate_Dataset_TMD
In [19]: ## Find the average cost to produce
df_movie_avg = df_movie1.groupby(['genres'],as_index=False)['budget_adj'].mean
()
df_movie_avg
Out[19]:
genres
budget_adj
0
Action
-e+07
1
Adventure
-e+07
2
Animation
-e+07
3
Comedy
-e+07
4
Crime
-e+07
5
Documentary
-e+06
6
Drama
-e+07
7
Family
-e+07
8
Fantasy
-e+07
9
History
-e+07
10 Horror
-e+07
11 Music
-e+07
12 Mystery
-e+07
13 Romance
-e+07
14 Science Fiction-e+07
15 TV Movie
-e+05
16 Thriller
-e+07
17 War
-e+07
18 Western
-e+08
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
21/36
7/1/2020
Investigate_Dataset_TMD
In [20]: df_movie_avg.plot(kind='bar',figsize=(15,15),title='Budget per Genre')
Out[20]:
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
22/36
7/1/2020
Investigate_Dataset_TMD
In [21]: # Average cast per genre
df_movie2 = df_movie1.query('release_year == "2015"')
df_movie3=df_movie1.groupby('genres')['cast'].count()/55
df_movie3.plot(kind='bar',figsize=(15,15),title='Total Number of Casts in Each
Genre')
plt.xlabel('genre',fontsize=12)
plt.ylabel('total number of casts',fontsize=12)
Out[21]: Text(0,0.5,'total number of casts')
Research Question 5: Which year has the highest number of movies
produced?
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
23/36
7/1/2020
Investigate_Dataset_TMD
Discussion
From the figure below the highest number of movies were produced in 2011, followed by 2010
and 2009 consecutively. Film production has increased exponentially from the early 1960's but
with a decline in 2012. The cause for the decline cannot be ascertained from the available data
accurately, but from the year versus budget, it is evident that the budget dropped. The drop in
budget could have resulted in low investment, hence fewer movies were produced.
In [29]: df_movie1.groupby('release_year')['genres'].count().plot(kind='bar',figsize=(1
5,15),title='Total Movie Production per Year')
plt.xlabel('release year',fontsize=12)
plt.ylabel('total count',fontsize=12)
Out[29]: Text(0,0.5,'total count')
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
24/36
7/1/2020
Investigate_Dataset_TMD
In [26]: df_movie1.groupby('release_year')['budget_adj'].sum().plot(kind='bar',figsize=
(15,15),title='Budget per Year')
plt.xlabel('release year',fontsize=12)
plt.ylabel('budget',fontsize=12)
Out[26]: Text(0,0.5,'budget')
Research Question 6: Which year of the three years with highest number of
movies produced 2009, 2010, and 2011 has the highest number of movies with
different genres?
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
25/36
7/1/2020
Investigate_Dataset_TMD
Discussion:
The essence of this question is to know the distribution of the genres in each of the years with
highest movie production. This knowledge may be an indicator of which genre is most
appreciated by the customers. In 2009 seventeen different genres were produced, making it the
year in which the highest number of genres were produced; comedy being the highest that year.
In each of the years, drama is the most produced on the average. With more features to the
dataset, more information can be determined about the demographic, cultural orientation, and
socio-economic status of these customers.
In [30]: # Distribution of genres produced in 2009
df_movie2009a = df_movie1.query('release_year == "2009"')
df_movie2009b=df_movie2009a['genres'].value_counts()
In [31]: df_movie2009b
Out[31]: Comedy
51
Drama
40
Action
23
Horror
15
Animation
13
Adventure
12
Thriller
8
Documentary
7
Science Fiction
6
Fantasy
5
Crime
4
Music
2
Romance
2
War
1
Mystery
1
History
1
Family
1
Name: genres, dtype: int64
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
26/36
7/1/2020
Investigate_Dataset_TMD
In [32]: df_movie2009b.plot(kind='bar',figsize=(15,15),title='Genres Distribution in 20
09')
plt.xlabel('genre',fontsize=12)
plt.ylabel('total number of genres',fontsize=12)
Out[32]: Text(0,0.5,'total number of genres')
In [33]: # Distribution of genres produced in 2010
df_movie2010a = df_movie1.query('release_year == "2010"')
df_movie2010b=df_movie2010a['genres'].value_counts()
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
27/36
7/1/2020
Investigate_Dataset_TMD
In [34]: df_movie2010b
Out[34]: Drama
61
Comedy
41
Action
35
Horror
19
Documentary
13
Adventure
9
Animation
6
Crime
5
Fantasy
4
Family
4
Thriller
3
Romance
3
Science Fiction
2
War
1
Name: genres, dtype: int64
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
28/36
7/1/2020
Investigate_Dataset_TMD
In [35]: df_movie2010b.plot(kind='bar',figsize=(15,15),title='Genres Distribution in 20
10')
plt.xlabel('genre',fontsize=12)
plt.ylabel('total number of genres',fontsize=12)
Out[35]: Text(0,0.5,'total number of genres')
In [36]: # Distribution of genres produced in 2011
df_movie2011a = df_movie1.query('release_year == "2011"')
df_movie2011b=df_movie2011a['genres'].value_counts()
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
29/36
7/1/2020
Investigate_Dataset_TMD
In [37]: df_movie2011b
Out[37]: Drama
53
Comedy
39
Action
38
Adventure
17
Horror
16
Thriller
14
Documentary
10
Crime
9
Animation
8
Fantasy
4
Music
3
Science Fiction
2
Romance
2
Family
2
Mystery
1
History
1
Name: genres, dtype: int64
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
30/36
7/1/2020
Investigate_Dataset_TMD
In [38]: df_movie2011b.plot(kind='bar',figsize=(15,15),title='Genres Distribution in 20
11')
plt.xlabel('genre',fontsize=12)
plt.ylabel('total number of genres',fontsize=12)
Out[38]: Text(0,0.5,'total number of genres')
Research Question 7: Which genre has the longest run time on the average?
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
31/36
7/1/2020
Investigate_Dataset_TMD
Discussion:
The essence of this question is to know the impact of run time on the popularity of the genres.
The question aims to determine if a genre with long runtime on the average will be less popular.
On the average, the war genre has the longest run time. From Questions 6 above, the genre
seems to be less popular. Unfortunately, the dataset is not sufficient to prove a direct relationship
between popularity and runtime. More features will be needed to ascertain any relationship.
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
32/36
7/1/2020
Investigate_Dataset_TMD
In [36]: df_movie1.groupby('genres')['runtime'].mean().plot(kind='bar',figsize=(15,15),
title='Average Runtime of each Genre')
plt.xlabel('genre',fontsize=12)
plt.ylabel('average runtime',fontsize=12)
Out[36]: Text(0,0.5,'average runtime')
Research Question 8: Over the years, which genres are watched the most?
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
33/36
7/1/2020
Investigate_Dataset_TMD
Discussion:
This question aims to give the distribution of the cumulative patronage of the genres; the pie
chart is plotted to give a sense of proportions of the cumulative patronage of the genres and a
bar chart to show the total cumulative patronage. The Western movies are the most watched
averagely per year, followed by Action, Science Fiction, and Adventure. On the other hand,
cumulatively the Western movies have the least vote count among likely customers.
In [25]: df_movie1.groupby('genres')['vote_count'].mean().plot(kind='pie',figsize=(16,1
6),title='Total Vote Counts per Genre')
Out[25]:
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
34/36
7/1/2020
Investigate_Dataset_TMD
In [42]: df_movie1.groupby('genres')['vote_count'].sum().plot(kind='bar',figsize=(16,16
),title='Total Vote Counts per Genre')
Out[42]:
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
35/36
7/1/2020
Investigate_Dataset_TMD
Conclusions
From the analyses above the following conclusions can be drawn: The highest patronage of
movies by customers was in the year 2012, although investment by production companies
dipped that year compared with the years 2011, 2010, and 2009. Over the years the patronage of
movies has grown exponentially. Despite the high patronage from customers in 2012, the movie
industry made highest profit in the preceding year in history.
The reason the industry made the highest profit in 2011 could be inferred that the industry
produced the highest number of movies this year; 2010 and 2009 are years with high profit also.
In 2009 the highest number of movies with different genres (seventeen) were produced, followed
by 2011.
War has the longest run time among all the genres, but one of the least watched; Western is the
least watched. The correlation between these two facts could not be ascertained drawn from the
data, but I will suggest that may be the customers do not like movies with long run time.
The most watched of the genres are Action, Science Fiction, and Adventure.
Further Works
More features are needed to ascertain the correlations among the different features in this
dataset. Nevertheless, the limited features in this dataset has been helpful in providing some
weak connclusions that can further assist in determining what kind of features will be needed in
order to make strong conclusions or inferences. For instance, to have more insights into the
force behind the patronage of the genres, the demographic distribution of the voters, socioeconomic status, cultural orientation, and others are needed. The current features could not help
in determining why drama seemed to be the most watched genre on average. Another example
is the relationsip between runtime and popularity. Could it be possible that War genre is not
highly popular because of the long runtime?
file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html
36/36