Femi Oyedokun | Freelancer Portfolio Item #252975

7/1/2020 Investigate_Dataset_TMD Project: Investigate a Dataset (The Movies Database) Table of Contents Introduction Data Wrangling Exploratory Data Analysis Conclusions Introduction This dataset contains information about 10,000 movies collected from The Movie Database, including user ratings and revenue. In this report, I will be exploring the following questions: (1) which year has the highest patronage of movies. (2) what kinds of properties are associated with movies that have high revenues. (3) which genres are most popular from year to year. (4) do customers have preference for production from a particular company. (5) what factors or variables significantly impact the popularity of the movies. (6) which year did the film industry make the highest profit Data Set Up In [1]: import pandas as pd import numpy as np % matplotlib inline Data Wrangling General Properties The data is loaded and few lines of the dataset are printed out. By inspecting the dataset, duplicates, missing rows, and datatypes of the features are determined. It is evident that some rows are missing in "popularity", "cast", "homepage", "director", "keywords", "overview", "genres", and "production_companies" columns. The nature of the datatypes of each feature is displayed below. file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 1/36 7/1/2020 Investigate_Dataset_TMD In [2]: df_movie = pd.read_csv('tmdb_movies.csv') ## Printout of raw data df_movie.head(-4) file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 2/36 7/1/2020 Investigate_Dataset_TMD Out[2]: id imdb_id popularity budget revenue original_title - tt- Jurassic World Chris Pratt|Br Dallas Howar Khan|Vi... 76341 Mad Max: Fury Road Tom Hardy|C Theron|Hugh Byrne|Nic... Insurgent Shailene Woodley|The James|Kate Winslet|Anse 1 tt- - tt- 3 Star Wars: 140607 tt- The Force Awakens Harrison Ford Hamill|Carrie Fisher|Adam 168259 tt- Vin Diesel|Pa Walker|Jason Statham|Mich - Furious 7 - tt- - Leonardo DiCaprio|Tom The Revenant Hardy|Will Poulter|Domh - Terminator Genisys Arnold Schwarzeneg Clarke|Emilia The Martian Matt Damon|J Chastain|Kris Wiig|Jeff ... 6 87101 tt- - tt- - 211672 tt- - - Minions Sandra Bulloc Hamm|Micha Keaton|Alliso file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 3/36 7/1/2020 Investigate_Dataset_TMD id imdb_id popularity budget revenue original_title - tt- Inside Out Amy Poehler| Smith|Richard Ha... - Spectre Daniel Craig|C Waltz|LÃ©a Seydoux|Ralp - Jupiter Ascending Mila Kunis|Ch Tatum|Sean Bean|Eddie R Ex Machina Domhnall Gleeson|Alicia Vikander|Osc Isaac|S... Pixels Adam Sandle Monaghan|Pe Dinklage|... - Avengers: Age of Ultron Robert Downe Jr.|Chris Hemsworth|M Ruffalo... - The Hateful Eight Samuel L. Jackson|Kurt Russell|Jenni ... - - tt- - tt- - tt- - - - tt- - - - tt- - tt- - - tt- - - Taken 3 Liam Neeson Whitaker|Mag Grace|Famke 102899 tt- - Ant-Man Paul Rudd|Mi Douglas|Evan Lilly|Cor... 17 file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 4/36 7/1/2020 Investigate_Dataset_TMD id imdb_id popularity budget revenue original_title - tt- - - 19 Cinderella Lily James|Ca Blanchett|Ric Madden|Hele Jennifer Lawrence|Jos Hutcherson|L Hemswor... 131634 tt- - The Hunger Games: Mockingjay Part 2 158852 tt- - Britt Robertso Tomorrowland Clooney|Raffe Cassidy|... 20 - tt- Southpaw Jake Gyllenha McAdams|Fo Whitaker... - San Andreas Dwayne Johnson|Alex Daddario|Car Gugino... - - Fifty Shades of Grey Dakota Johns Dornan|Jenni Ehle|Eloi... - Christian Bale The Big Short Carell|Ryan Gosling|Brad - Mission: Tom Cruise|Je Impossible Renner|Simo Rogue Nation Pegg|Rebecc - - - tt- - tt- - tt- - - tt- - tt- - - Ted 2 Mark Wahlbe MacFarlane|A Seyfried|... file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 5/36 7/1/2020 Investigate_Dataset_TMD id imdb_id popularity budget revenue original_title - tt- - Kingsman: The Secret Service Taron Egerton Firth|Samuel Jackson|Mi... - Spotlight Mark Ruffalo| Keaton|Rache McAdams|Lie - - tt- - 29 ... 294254 tt- - - Maze Runner: Dylan O'Brien The Scorch Scodelario|Th Trials Brodie-Sa... ... ... ... ... ... ... ... - tt- - 0 Arabesque Gregory Peck Loren|Alan Badel|Kieron 3001 tt- 0 0 How to Steal a Million Audrey Hepb O'Toole|Eli Wallach|Hugh 0 Return of the Seven Yul Brynner|R Fuller|JuliÃ¡n Mateos|Warre 10833 - tt- 0 - tt- - - The Sand Pebbles Steve McQueen|Ric Attenborough Cre... 38720 tt- 0 0 Walk Don't Run Cary Grant|Sa Eggar|Jim Hu Stan... 0 George Peppard|Jam The Blue Max Mason|Ursula Andress|Jere 10836 - tt- 0 file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 6/36 7/1/2020 Investigate_Dataset_TMD id imdb_id popularity budget revenue original_title - tt- 0 0 Burt Lancaste The Marvin|Rober Professionals Ryan|Woody 0 It's the Great Christopher S Pumpkin, Dryer|Kathy Charlie Brown Steinberg|A... 0 Funeral in Berlin 0 Will Hutchins| The Shooting Perkins|Jack Nicholson|Wa 0 Sterling Winnie the Holloway|Jun Pooh and the Matthews|Seb Honey Tree Ca... - tt- 0 - tt- 0 - tt- 75000 - tt- 0 Michael Caine Hubschmid|O Homolka|Eva - tt- 0 0 Khartoum 0 James Cobur Our Man Flint Cobb|Gila Golan|Edward 0 Carry On Cowboy Sid James|Jim Dale|Angela Douglas|Kenn 0 Dracula: Prince of Darkness Christopher Lee|Barbara Shelley|Andre Keir|Fr... - tt- 0 - tt- 0 - tt- 0 Charlton Heston|Laure Olivier|Richar file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 7/36 7/1/2020 Investigate_Dataset_TMD id imdb_id popularity budget revenue original_title 10847 Peter Cushing Judd|Carole Gray|Eddie B 28763 tt- 0 0 Island of Terror 2161 tt- - - Fantastic Voyage Stephen Boyd Welch|Edmon O'Brien|Dona Gambit Michael Caine MacLaine|He Lom|Joh... Harper Paul Newman Bacall|Julie Harris|Arthur 0 Born Free Virginia McKe Travers|Geoff Keen|Pe... 0 A Big Hand for the Little Lady Henry Fonda| Woodward|Ja Robards|Pau Alfie Michael Caine Winters|Millic Martin... 0 The Chase Marlon Brand Fonda|Robert Redford|E.G. 0 The Ghost & Mr. Chicken Don Knotts|Jo Staley|Liam Redmond|Dic 10848 - tt- 0 0 - tt- 0 0 - tt- 0 - tt- 0 - tt- 0 0 - tt- 0 - tt- 700000 file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 8/36 7/1/2020 Investigate_Dataset_TMD id imdb_id popularity budget revenue original_title - tt- 0 0 0 Steve McQue Nevada Smith Malden|Brian Keith|Arthur K 0 The Russians Carl Reiner|E Are Coming, Saint|Alan Ar The Russians K... Are Coming - tt- 0 - tt- 0 Dean Jones|S Pleshette|Cha Ruggles|K... The Ugly Dachshund - tt- 0 0 Seconds Rock Hudson Jens|John Randolph|Wil 5060 tt- 0 0 Carry On Screaming! Kenneth Willia Dale|Harry H. Corbett|Joa... 0 The Endless Summer Michael Hyns August|Lord 'T B... 10860 - tt- 0 10862 rows × 21 columns file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 9/36 7/1/2020 Investigate_Dataset_TMD In [3]: ## The goal is to check the datatypes of features in the dataset df_movie.info() RangeIndex: 10866 entries, 0 to 10865 Data columns (total 21 columns): id 10866 non-null int64 imdb_id 10856 non-null object popularity 10866 non-null float64 budget 10866 non-null int64 revenue 10866 non-null int64 original_title 10866 non-null object cast 10790 non-null object homepage 2936 non-null object director 10822 non-null object tagline 8042 non-null object keywords 9373 non-null object overview 10862 non-null object runtime 10866 non-null int64 genres 10843 non-null object production_companies 9836 non-null object release_date 10866 non-null object vote_count 10866 non-null int64 vote_average 10866 non-null float64 release_year 10866 non-null int64 budget_adj 10866 non-null float64 revenue_adj 10866 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 1.7+ MB Data Cleaning Looking at the dataset, some rows need to be dropped. Duplicates will be checked and cleaned. No column will be dropped in this analysis, as there is no cause for it; all the columns are relevant. Rows with null values need to be dropped so as to have compact dataset with nonmissing values in the columns only; since almost all the columns of the dataset are relevant for this analysis. Duplicates may be due to human error, so duplicate data will negatively impact the data analysis. In [4]: #drop rows with any null values in the dataset df_movie.dropna(how='any',inplace=True) In [5]: # check if any columns have null values; it should print false df_movie.isnull().sum().any() Out[5]: False file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 10/36 7/1/2020 Investigate_Dataset_TMD In [6]: # check for duplicates in the dataset. If none, it should print out 0 sum(df_movie.duplicated()) Out[6]: 0 In [7]: # The structure of the dataset after it has been cleaned df_movie.info() Int64Index: 1992 entries, 0 to 10819 Data columns (total 21 columns): id 1992 non-null int64 imdb_id 1992 non-null object popularity 1992 non-null float64 budget 1992 non-null int64 revenue 1992 non-null int64 original_title 1992 non-null object cast 1992 non-null object homepage 1992 non-null object director 1992 non-null object tagline 1992 non-null object keywords 1992 non-null object overview 1992 non-null object runtime 1992 non-null int64 genres 1992 non-null object production_companies 1992 non-null object release_date 1992 non-null object vote_count 1992 non-null int64 vote_average 1992 non-null float64 release_year 1992 non-null int64 budget_adj 1992 non-null float64 revenue_adj 1992 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 342.4+ KB Exploratory Data Analysis Research Question 1: Which year has the highest patronage of movies? Discussion The number of vote count will be used as a measure of patronage in this analysis. From the bar chart plot below it is evident that in 2012 the movie industry has the highest patronage of customers; the highest record in history. file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 11/36 7/1/2020 Investigate_Dataset_TMD In [8]: import matplotlib.pyplot as plt % matplotlib inline In [9]: ## Plot visualizzation of the relationship between release year and number of votes for each genre df_movie.groupby('release_year')['vote_count'].sum().plot(kind='bar',figsize=( 15,15),title='Total Votes throughout the Years') plt.xlabel('release_year',fontsize=12) plt.ylabel('total votes',fontsize=12) Out[9]: Text(0,0.5,'total votes') Research Question 2: Which year did the film industry make the highest profit? file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 12/36 7/1/2020 Investigate_Dataset_TMD Discussion: To answer this question, I have plotted the adjusted revenue and profit index for each year. The profit index is the major indicator that determines how profitable the industry is. The index is a measure of the amount made on investment. From the Figure on Profit Index, the film industry made the highest profit in 1978 although the highest revenue was made in 2011. This figure indicates that the financial attractiveness of making a movie is reducing with time; although it has remained relatively stable at 5 for the last decade of this survey. In [10]: #Plot visualization of adjusted revenue per year df_movie.groupby('release_year')['revenue_adj'].sum().plot(kind='bar',figsize= (15,15),title='Total Revenue throughout the Years') plt.xlabel('release_year',fontsize=12) plt.ylabel('total revenue',fontsize=12) Out[10]: Text(0,0.5,'total revenue') file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 13/36 7/1/2020 Investigate_Dataset_TMD In [11]: #Plot visualization of profitability index per year revenue = df_movie.groupby('release_year')['revenue_adj'].sum() expense = df_movie.groupby('release_year')['budget_adj'].sum() profit_index = (revenue)/expense profit_index.plot(kind='bar',figsize=(15,15),title='Profitability Indices for Film Industry') plt.xlabel('release_year',fontsize=12) plt.ylabel('profit_index',fontsize=12) Out[11]: Text(0,0.5,'profit_index') Research Question 3: Which genres are most popular from year to year? file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 14/36 7/1/2020 Investigate_Dataset_TMD Discussion To analyse this problem, three different indicators of the genres popularity are considered:(1) revenue (2) profitability index (3) popularity index. The revenue indicator is used because the popularity of the genre means many customers patronised it. From the plot below Adventure, Animation, Science Fiction, and Western movies have received a lot of popularity over the years. The profitability index, which is a derivative of the revenue shows a different pattern of popularity measure among the genres. Adventure, Documentary, Horror, and Science Fiction are the most popular. Now following the popularity index, Action, Adventure, Science Fiction, and Western are the most popular year to year. Thus, it implies that the use revenue and profitability index are not good indicators to measure how popular the genres are. In [12]: ## First, getting all the movies with more than one genres genre_more = df_movie[df_movie['genres'].str.contains('|')] In [13]: # making a copy df_movie1 = genre_more.copy() In [14]: # Using Pandas' apply function to split the column split_columns = ['genres','cast','production_companies'] for c in split_columns: df_movie1[c] = df_movie1[c].apply(lambda x: x.split("|")[0]) file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 15/36 7/1/2020 Investigate_Dataset_TMD In [15]: # A view of the cleaned dataset df_movie1.info() Int64Index: 1992 entries, 0 to 10819 Data columns (total 21 columns): id 1992 non-null int64 imdb_id 1992 non-null object popularity 1992 non-null float64 budget 1992 non-null int64 revenue 1992 non-null int64 original_title 1992 non-null object cast 1992 non-null object homepage 1992 non-null object director 1992 non-null object tagline 1992 non-null object keywords 1992 non-null object overview 1992 non-null object runtime 1992 non-null int64 genres 1992 non-null object production_companies 1992 non-null object release_date 1992 non-null object vote_count 1992 non-null int64 vote_average 1992 non-null float64 release_year 1992 non-null int64 budget_adj 1992 non-null float64 revenue_adj 1992 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 342.4+ KB file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 16/36 7/1/2020 Investigate_Dataset_TMD In [16]: # Considering total revenue generated per genre from year to year df_movie1.groupby('genres')['revenue_adj'].mean().plot(kind='bar',figsize=(15, 15),title='Average Revenue From Each Genre') plt.xlabel('genre',fontsize=12) plt.ylabel('Average revenue',fontsize=12) Out[16]: Text(0,0.5,'Average revenue') file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 17/36 7/1/2020 Investigate_Dataset_TMD In [17]: #Considering profitability index per genre revenue_genre = df_movie1.groupby('genres')['revenue_adj'].mean() expense_genre = df_movie1.groupby('genres')['budget_adj'].mean() profit_index_genre = (revenue_genre)/expense_genre profit_index_genre.plot(kind='bar',figsize=(15,15),title='Average Profitabilit y Indices per Genre') plt.xlabel('genre',fontsize=12) plt.ylabel('profit_index',fontsize=12) Out[17]: Text(0,0.5,'profit_index') file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 18/36 7/1/2020 Investigate_Dataset_TMD In [18]: # Using popularity index df_movie1.groupby('genres')['popularity'].mean().plot(kind='bar',figsize=(15,1 5),title='Genres Popularity Indices') plt.xlabel('genre',fontsize=12) plt.ylabel('total popularity index',fontsize=12) Out[18]: Text(0,0.5,'total popularity index') Research Question 4: Which of the movies are the most expensive to make? file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 19/36 7/1/2020 Investigate_Dataset_TMD Discussion To answer this question, the budget per genre for each year are plotted. From this plot, the Western movies are the most expensive to make, follow by Adventure, Animation and Fantasy. Digging deeper, to know what could have caused these movies to be expensive, I considered if the number of casts could be a major factor. Unfortunately, from the plot below, the number of casts seem not be a major factor, as the most expensive genre to make has the least number of casts on the average. It thus imply that other factors may be responsible for the high cost of making these movies; these may be further investigated. file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 20/36 7/1/2020 Investigate_Dataset_TMD In [19]: ## Find the average cost to produce df_movie_avg = df_movie1.groupby(['genres'],as_index=False)['budget_adj'].mean () df_movie_avg Out[19]: genres budget_adj 0 Action -e+07 1 Adventure -e+07 2 Animation -e+07 3 Comedy -e+07 4 Crime -e+07 5 Documentary -e+06 6 Drama -e+07 7 Family -e+07 8 Fantasy -e+07 9 History -e+07 10 Horror -e+07 11 Music -e+07 12 Mystery -e+07 13 Romance -e+07 14 Science Fiction-e+07 15 TV Movie -e+05 16 Thriller -e+07 17 War -e+07 18 Western -e+08 file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 21/36 7/1/2020 Investigate_Dataset_TMD In [20]: df_movie_avg.plot(kind='bar',figsize=(15,15),title='Budget per Genre') Out[20]: file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 22/36 7/1/2020 Investigate_Dataset_TMD In [21]: # Average cast per genre df_movie2 = df_movie1.query('release_year == "2015"') df_movie3=df_movie1.groupby('genres')['cast'].count()/55 df_movie3.plot(kind='bar',figsize=(15,15),title='Total Number of Casts in Each Genre') plt.xlabel('genre',fontsize=12) plt.ylabel('total number of casts',fontsize=12) Out[21]: Text(0,0.5,'total number of casts') Research Question 5: Which year has the highest number of movies produced? file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 23/36 7/1/2020 Investigate_Dataset_TMD Discussion From the figure below the highest number of movies were produced in 2011, followed by 2010 and 2009 consecutively. Film production has increased exponentially from the early 1960's but with a decline in 2012. The cause for the decline cannot be ascertained from the available data accurately, but from the year versus budget, it is evident that the budget dropped. The drop in budget could have resulted in low investment, hence fewer movies were produced. In [29]: df_movie1.groupby('release_year')['genres'].count().plot(kind='bar',figsize=(1 5,15),title='Total Movie Production per Year') plt.xlabel('release year',fontsize=12) plt.ylabel('total count',fontsize=12) Out[29]: Text(0,0.5,'total count') file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 24/36 7/1/2020 Investigate_Dataset_TMD In [26]: df_movie1.groupby('release_year')['budget_adj'].sum().plot(kind='bar',figsize= (15,15),title='Budget per Year') plt.xlabel('release year',fontsize=12) plt.ylabel('budget',fontsize=12) Out[26]: Text(0,0.5,'budget') Research Question 6: Which year of the three years with highest number of movies produced 2009, 2010, and 2011 has the highest number of movies with different genres? file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 25/36 7/1/2020 Investigate_Dataset_TMD Discussion: The essence of this question is to know the distribution of the genres in each of the years with highest movie production. This knowledge may be an indicator of which genre is most appreciated by the customers. In 2009 seventeen different genres were produced, making it the year in which the highest number of genres were produced; comedy being the highest that year. In each of the years, drama is the most produced on the average. With more features to the dataset, more information can be determined about the demographic, cultural orientation, and socio-economic status of these customers. In [30]: # Distribution of genres produced in 2009 df_movie2009a = df_movie1.query('release_year == "2009"') df_movie2009b=df_movie2009a['genres'].value_counts() In [31]: df_movie2009b Out[31]: Comedy 51 Drama 40 Action 23 Horror 15 Animation 13 Adventure 12 Thriller 8 Documentary 7 Science Fiction 6 Fantasy 5 Crime 4 Music 2 Romance 2 War 1 Mystery 1 History 1 Family 1 Name: genres, dtype: int64 file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 26/36 7/1/2020 Investigate_Dataset_TMD In [32]: df_movie2009b.plot(kind='bar',figsize=(15,15),title='Genres Distribution in 20 09') plt.xlabel('genre',fontsize=12) plt.ylabel('total number of genres',fontsize=12) Out[32]: Text(0,0.5,'total number of genres') In [33]: # Distribution of genres produced in 2010 df_movie2010a = df_movie1.query('release_year == "2010"') df_movie2010b=df_movie2010a['genres'].value_counts() file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 27/36 7/1/2020 Investigate_Dataset_TMD In [34]: df_movie2010b Out[34]: Drama 61 Comedy 41 Action 35 Horror 19 Documentary 13 Adventure 9 Animation 6 Crime 5 Fantasy 4 Family 4 Thriller 3 Romance 3 Science Fiction 2 War 1 Name: genres, dtype: int64 file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 28/36 7/1/2020 Investigate_Dataset_TMD In [35]: df_movie2010b.plot(kind='bar',figsize=(15,15),title='Genres Distribution in 20 10') plt.xlabel('genre',fontsize=12) plt.ylabel('total number of genres',fontsize=12) Out[35]: Text(0,0.5,'total number of genres') In [36]: # Distribution of genres produced in 2011 df_movie2011a = df_movie1.query('release_year == "2011"') df_movie2011b=df_movie2011a['genres'].value_counts() file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 29/36 7/1/2020 Investigate_Dataset_TMD In [37]: df_movie2011b Out[37]: Drama 53 Comedy 39 Action 38 Adventure 17 Horror 16 Thriller 14 Documentary 10 Crime 9 Animation 8 Fantasy 4 Music 3 Science Fiction 2 Romance 2 Family 2 Mystery 1 History 1 Name: genres, dtype: int64 file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 30/36 7/1/2020 Investigate_Dataset_TMD In [38]: df_movie2011b.plot(kind='bar',figsize=(15,15),title='Genres Distribution in 20 11') plt.xlabel('genre',fontsize=12) plt.ylabel('total number of genres',fontsize=12) Out[38]: Text(0,0.5,'total number of genres') Research Question 7: Which genre has the longest run time on the average? file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 31/36 7/1/2020 Investigate_Dataset_TMD Discussion: The essence of this question is to know the impact of run time on the popularity of the genres. The question aims to determine if a genre with long runtime on the average will be less popular. On the average, the war genre has the longest run time. From Questions 6 above, the genre seems to be less popular. Unfortunately, the dataset is not sufficient to prove a direct relationship between popularity and runtime. More features will be needed to ascertain any relationship. file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 32/36 7/1/2020 Investigate_Dataset_TMD In [36]: df_movie1.groupby('genres')['runtime'].mean().plot(kind='bar',figsize=(15,15), title='Average Runtime of each Genre') plt.xlabel('genre',fontsize=12) plt.ylabel('average runtime',fontsize=12) Out[36]: Text(0,0.5,'average runtime') Research Question 8: Over the years, which genres are watched the most? file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 33/36 7/1/2020 Investigate_Dataset_TMD Discussion: This question aims to give the distribution of the cumulative patronage of the genres; the pie chart is plotted to give a sense of proportions of the cumulative patronage of the genres and a bar chart to show the total cumulative patronage. The Western movies are the most watched averagely per year, followed by Action, Science Fiction, and Adventure. On the other hand, cumulatively the Western movies have the least vote count among likely customers. In [25]: df_movie1.groupby('genres')['vote_count'].mean().plot(kind='pie',figsize=(16,1 6),title='Total Vote Counts per Genre') Out[25]: file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 34/36 7/1/2020 Investigate_Dataset_TMD In [42]: df_movie1.groupby('genres')['vote_count'].sum().plot(kind='bar',figsize=(16,16 ),title='Total Vote Counts per Genre') Out[42]: file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 35/36 7/1/2020 Investigate_Dataset_TMD Conclusions From the analyses above the following conclusions can be drawn: The highest patronage of movies by customers was in the year 2012, although investment by production companies dipped that year compared with the years 2011, 2010, and 2009. Over the years the patronage of movies has grown exponentially. Despite the high patronage from customers in 2012, the movie industry made highest profit in the preceding year in history. The reason the industry made the highest profit in 2011 could be inferred that the industry produced the highest number of movies this year; 2010 and 2009 are years with high profit also. In 2009 the highest number of movies with different genres (seventeen) were produced, followed by 2011. War has the longest run time among all the genres, but one of the least watched; Western is the least watched. The correlation between these two facts could not be ascertained drawn from the data, but I will suggest that may be the customers do not like movies with long run time. The most watched of the genres are Action, Science Fiction, and Adventure. Further Works More features are needed to ascertain the correlations among the different features in this dataset. Nevertheless, the limited features in this dataset has been helpful in providing some weak connclusions that can further assist in determining what kind of features will be needed in order to make strong conclusions or inferences. For instance, to have more insights into the force behind the patronage of the genres, the demographic distribution of the voters, socioeconomic status, cultural orientation, and others are needed. The current features could not help in determining why drama seemed to be the most watched genre on average. Another example is the relationsip between runtime and popularity. Could it be possible that War genre is not highly popular because of the long runtime? file:///C:/Users/User/Documents/Documents/FST-TAMU/ANTAEUS/Personal Documents/Upwork/Data Analysis Project with Python.html 36/36