We were trying to venture into the film industry.
4/13/25, 1:01 AM
Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i…
Building a Winning Movie Studio
Strategy
Overview
This analysis explores data from Box Office Mojo. Starting with an overview of the
data, we'll clean the data to ensure accuracy, create new metrics through feature
engineering, and use visualizations to highlight key insights. The goal is to identify
areas of focus for our company as we venture into the film industry.
1.Business Understanding
In a world where original video content dominates the entertainment industry, our
company is ready to take the leap into filmmaking. However, we currently lack
experience in the world of cinema. To ensure our success, we must make datadriven decisions about the types of films we produce. This analysis will help us
understand what makes a movie thrive in the box office.
Objectives
Uncover trends and insights by understanding the market.
Discover success factors in terms of genres and studios.
Provide recommendations and actionable insights.
By leveraging insights from the box office, we aim to create a roadmap for success
in this competitive industry. Let’s turn this ambitious vision into a blockbuster
reality!
2.Data Understanding
This dataset was sourced from Box Office Mojo. It includes different movie title
data released from 2010 to 2018 including the studios they were produced in and
how much gross they were able to generate.
The key features include:
-Title: Name of the movie.
-Studio: Studio in which the movie was produced.
-Domestic Gross: Amount of money earned within the country.
-Foreign Gross: Amount of money earned outside the domestic market.
-Year: The year in which the movie was released.
https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb
2/10
4/13/25, 1:01 AM
Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i…
Purpose:
This dataset is going to enable us to analyse Box Office Movies for trends and
patterns in the movie filmmaking industry which will provide insights that are
going to guide us in decision making.
Tools: Python(Pandas, Matplotlib)
In [1]:
#import the necessary libraries and load the data
import pandas as pd
movie_gross_df = pd.read_csv('bom.movie_gross.csv.gz')
movie_gross_df.head()
title
studio
domestic_gross
foreign_gross
year
0
Toy Story 3
BV
-
-
2010
1
Alice in Wonderland (2010)
BV
-
-
2010
2
Harry Potter and the Deathly
Hallows Part 1
WB
-
-
2010
3
Inception
WB
-
-
2010
4
Shrek Forever After
P/DW
-
-
2010
Out[1]:
In [2]:
#check the structure of the data
movie_gross_df.info()
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
#
Column
Non-Null Count Dtype
--- ------------------- ----0
title
3387 non-null
object
1
studio
3382 non-null
object
2
domestic_gross 3359 non-null
float64
3
foreign_gross
2037 non-null
object
4
year
3387 non-null
int64
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB
3. Data Preparation
Since the provided tables lack significant meaning on their own, it's important to
address issues like missing and duplicated data to ensure the analysis is accurate
and reliable.
Data Cleaning
Step 1: Check for Missing Values
We begin by identifying columns with missing values in the dataset:
https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb
3/10
4/13/25, 1:01 AM
In [3]:
Out[3]:
Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i…
#check if there is any nulll values
movie_gross_df.isnull().sum()
title
studio
domestic_gross
foreign_gross
year
dtype: int64
-
Step 2: Fill missing values
Now that we have identified what columns have missing values we are going to fill
them in order to have a more accurate dataset we can work with. Since the
studio column only has 5 missing values we can fill them with the placeholder
Unknown . For domestic_gross , we can use the mean of the other values that
we have to fill the 28 values that are missing. Since foreign_gross has so much
data missing, we are going to drop all the rows that have missing values in that
column. Filling them could make our data a little less accurate than it should be.
In [4]:
#fill domestic gross null values with the mean of the values
movie_gross_df['domestic_gross'] = movie_gross_df['domestic_gross'].fillna
In [5]:
In [6]:
#fill missing studio values with 'Unknown'
movie_gross_df['studio'] = movie_gross_df['studio'].fillna('Unknown')
#drop missing values in the foreign gross column
movie_gross_df = movie_gross_df.dropna(subset=['foreign_gross'])
After cleaning, we need to check again just to verify that we have dealt with all the
missing values:
In [7]:
Out[7]:
#inspect if there are still any missing values
movie_gross_df.isnull().sum()
title
studio
domestic_gross
foreign_gross
year
dtype: int64
0
0
0
0
0
Great! Now that we have no missing values in our dataset, we can remove any
duplicated rows.
In [8]:
#drop any duplicate vales
movie_gross_df = movie_gross_df.drop_duplicates()
Step 3: Converting data types
https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb
4/10
4/13/25, 1:01 AM
Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i…
One of the columns we are working with foreign_gross is a string. We are
going to have to convert it to a numeric type to ensure proper calculations. We are
going to have to create a new column total_gross just to see how much
grossin total a movie was able to generate. We can only do that by adding
domestic_gross and foreign_gross together and that is possible only if they
are both numeric types.
In [9]:
#change foreign gross from a string to a number
movie_gross_df['foreign_gross'] = pd.to_numeric(movie_gross_df['foreign_gr
In [10]:
#create new column 'total gross'
movie_gross_df['total_gross'] = movie_gross_df['foreign_gross'] + movie_gr
movie_gross_df.head()
title
studio
domestic_gross
foreign_gross
year
total_gross
0
Toy Story 3
BV
-
-
2010
-e+09
1
Alice in
Wonderland
(2010)
BV
-
-
2010
-e+09
2
Harry Potter and
the Deathly
Hallows Part 1
WB
-
-
2010
-e+08
3
Inception
WB
-
-
2010
-e+08
4
Shrek Forever
After
P/DW
-
-
2010
-e+08
Out[10]:
Awesome! Our data is now clean and ready for further analysis.
4. Analysis and Visualization
This section focuses on analyzing and visualizing the data to gain insights into
studio performance and box office trends over time.
Studio Performance by Total Gross Revenue
We analyze how different studios have performed in terms of total revenue. We can
filter it to the top 10 studios by total gross revenue and visualized it in a bar chart.
In [11]:
#how different studios have performed in terms of total revenue
import matplotlib.pyplot as plt
studio_gross = movie_gross_df.groupby('studio')['total_gross'].sum().sort_
studio_gross.plot(kind='bar', figsize=(10, 6), color='blue')
plt.title('Top 10 studios by total gross revenue')
l
l b l('
di ')
https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb
5/10
4/13/25, 1:01 AM
Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i…
plt.xlabel('Studio')
plt.ylabel('Total gross revenue')
plt.show()
Key Insight:
From the bar chart above, we can clearly see that BV , Fox , and WB are the top
three studios, generating significantly higher total revenue compared to others.
Yearly trend in Average Gross Revenue
To understand the average box office revenue trends over time, we analyze the
yearly average gross revenue and visualize it with a line chart.
In [12]:
#yearly trends in box office
yearly_gross = movie_gross_df.groupby('year')['total_gross'].mean()
yearly_gross.plot(kind='line', figsize=(10, 6), marker='o', color='green')
plt.title('Average Gross Revenue Over time')
plt.xlabel('Year')
plt.ylabel('Average Gross Revenue')
plt.show()
https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb
6/10
4/13/25, 1:01 AM
Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i…
Key Insight:
There is a noticeable upward trend in average gross revenue over the years,
indicating growth in the film industry.
Genre Analysis and Runtime
Distribution
This section focuses on analyzing movie genres and their average ratings, as well as
examining the runtime distribution of genres by region. These insights will help us
better understand the characteristics of popular genres and their performance
across different regions.
In [13]:
In [14]:
In [15]:
#import necessary libraries
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
#load the data
conn = sqlite3.connect('im.db')
#check out the most popular genres in terms of number of votes
q = """
SELECT mb.genres, mr.numvotes, mr.averagerating
FROM movie_basics mb
JOIN movie_ratings mr ON mb.movie_id = mr.movie_id
ORDER BY numvotes DESC
LIMIT 10
;
"""
df_sql = pd.read_sql(q, conn)
df_sql
Out[15]:
0
genres
numvotes
averagerating
Action,Adventure,Sci-Fi
-
8.8
https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb
7/10
4/13/25, 1:01 AM
In [16]:
Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i…
1
Action,Thriller
-
8.4
2
Adventure,Drama,Sci-Fi
-
8.6
3
Drama,Western
-
8.4
4
Action,Adventure,Sci-Fi
-
8.1
5
Biography,Crime,Drama
-
8.2
6
Mystery,Thriller
-
8.1
7
Action,Adventure,Comedy
948394
8.1
8
Action,Adventure,Comedy
820847
8.0
9
Action,Adventure,Sci-Fi
795227
7.2
df_sql.groupby('genres').size().plot(kind='barh', color=sns.color_palette(
plt.gca().spines[['top', 'right',]].set_visible(False)
In [17]:
# check runtime distribution of different genres by region
q = """
SELECT mb.genres, mb.runtime_minutes, ma.region
FROM movie_basics mb
JOIN movie_akas ma ON mb.movie_id = ma.movie_id
ORDER BY runtime_minutes DESC
LIMIT 10
;
"""
df_sql = pd.read_sql(q, conn)
df_sql
genres
runtime_minutes
region
0
Documentary
51420.0
None
1
Documentary
51420.0
SE
2
Documentary
51420.0
SE
Out[17]:
https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb
8/10
4/13/25, 1:01 AM
In [13]:
Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i…
3
Documentary
51420.0
US
4
Documentary
14400.0
FI
5
Documentary
6017.0
IN
6
Documentary,Drama
6000.0
IT
7
Comedy,Drama,Mystery
5460.0
GB
8
Drama,Western
4980.0
DE
9
Comedy,Drama
4200.0
None
#plot a bar graph to vizualize runtime by genres i different regions
plt.figure(figsize=(10,8))
colors = sns.color_palette("Dark2")
bar_plot = plt.bar(df_sql['genres'], df_sql['runtime_minutes'], color=colo
plt.xticks(rotation=90)
plt.xlabel('genres')
plt.ylabel('runtime_minutes')
plt.title('Distribution of Runtimes of Genres by Region')
plt.legend(title="Region")
plt.show()
https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb
9/10