Frankline Kipchumba

We were trying to venture into the film industry.

4/13/25, 1:01 AM Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i… Building a Winning Movie Studio Strategy Overview This analysis explores data from Box Office Mojo. Starting with an overview of the data, we'll clean the data to ensure accuracy, create new metrics through feature engineering, and use visualizations to highlight key insights. The goal is to identify areas of focus for our company as we venture into the film industry. 1.Business Understanding In a world where original video content dominates the entertainment industry, our company is ready to take the leap into filmmaking. However, we currently lack experience in the world of cinema. To ensure our success, we must make datadriven decisions about the types of films we produce. This analysis will help us understand what makes a movie thrive in the box office. Objectives Uncover trends and insights by understanding the market. Discover success factors in terms of genres and studios. Provide recommendations and actionable insights. By leveraging insights from the box office, we aim to create a roadmap for success in this competitive industry. Let’s turn this ambitious vision into a blockbuster reality! 2.Data Understanding This dataset was sourced from Box Office Mojo. It includes different movie title data released from 2010 to 2018 including the studios they were produced in and how much gross they were able to generate. The key features include: -Title: Name of the movie. -Studio: Studio in which the movie was produced. -Domestic Gross: Amount of money earned within the country. -Foreign Gross: Amount of money earned outside the domestic market. -Year: The year in which the movie was released. https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb 2/10 4/13/25, 1:01 AM Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i… Purpose: This dataset is going to enable us to analyse Box Office Movies for trends and patterns in the movie filmmaking industry which will provide insights that are going to guide us in decision making. Tools: Python(Pandas, Matplotlib) In [1]: #import the necessary libraries and load the data import pandas as pd movie_gross_df = pd.read_csv('bom.movie_gross.csv.gz') movie_gross_df.head() title studio domestic_gross foreign_gross year 0 Toy Story 3 BV - - 2010 1 Alice in Wonderland (2010) BV - - 2010 2 Harry Potter and the Deathly Hallows Part 1 WB - - 2010 3 Inception WB - - 2010 4 Shrek Forever After P/DW - - 2010 Out[1]: In [2]: #check the structure of the data movie_gross_df.info() RangeIndex: 3387 entries, 0 to 3386 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------------------- ----0 title 3387 non-null object 1 studio 3382 non-null object 2 domestic_gross 3359 non-null float64 3 foreign_gross 2037 non-null object 4 year 3387 non-null int64 dtypes: float64(1), int64(1), object(3) memory usage: 132.4+ KB 3. Data Preparation Since the provided tables lack significant meaning on their own, it's important to address issues like missing and duplicated data to ensure the analysis is accurate and reliable. Data Cleaning Step 1: Check for Missing Values We begin by identifying columns with missing values in the dataset: https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb 3/10 4/13/25, 1:01 AM In [3]: Out[3]: Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i… #check if there is any nulll values movie_gross_df.isnull().sum() title studio domestic_gross foreign_gross year dtype: int64 - Step 2: Fill missing values Now that we have identified what columns have missing values we are going to fill them in order to have a more accurate dataset we can work with. Since the studio column only has 5 missing values we can fill them with the placeholder Unknown . For domestic_gross , we can use the mean of the other values that we have to fill the 28 values that are missing. Since foreign_gross has so much data missing, we are going to drop all the rows that have missing values in that column. Filling them could make our data a little less accurate than it should be. In [4]: #fill domestic gross null values with the mean of the values movie_gross_df['domestic_gross'] = movie_gross_df['domestic_gross'].fillna  In [5]: In [6]:  #fill missing studio values with 'Unknown' movie_gross_df['studio'] = movie_gross_df['studio'].fillna('Unknown') #drop missing values in the foreign gross column movie_gross_df = movie_gross_df.dropna(subset=['foreign_gross']) After cleaning, we need to check again just to verify that we have dealt with all the missing values: In [7]: Out[7]: #inspect if there are still any missing values movie_gross_df.isnull().sum() title studio domestic_gross foreign_gross year dtype: int64 0 0 0 0 0 Great! Now that we have no missing values in our dataset, we can remove any duplicated rows. In [8]: #drop any duplicate vales movie_gross_df = movie_gross_df.drop_duplicates() Step 3: Converting data types https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb 4/10 4/13/25, 1:01 AM Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i… One of the columns we are working with foreign_gross is a string. We are going to have to convert it to a numeric type to ensure proper calculations. We are going to have to create a new column total_gross just to see how much grossin total a movie was able to generate. We can only do that by adding domestic_gross and foreign_gross together and that is possible only if they are both numeric types. In [9]: #change foreign gross from a string to a number movie_gross_df['foreign_gross'] = pd.to_numeric(movie_gross_df['foreign_gr  In [10]:  #create new column 'total gross' movie_gross_df['total_gross'] = movie_gross_df['foreign_gross'] + movie_gr movie_gross_df.head()   title studio domestic_gross foreign_gross year total_gross 0 Toy Story 3 BV - - 2010 -e+09 1 Alice in Wonderland (2010) BV - - 2010 -e+09 2 Harry Potter and the Deathly Hallows Part 1 WB - - 2010 -e+08 3 Inception WB - - 2010 -e+08 4 Shrek Forever After P/DW - - 2010 -e+08 Out[10]: Awesome! Our data is now clean and ready for further analysis. 4. Analysis and Visualization This section focuses on analyzing and visualizing the data to gain insights into studio performance and box office trends over time. Studio Performance by Total Gross Revenue We analyze how different studios have performed in terms of total revenue. We can filter it to the top 10 studios by total gross revenue and visualized it in a bar chart. In [11]: #how different studios have performed in terms of total revenue import matplotlib.pyplot as plt studio_gross = movie_gross_df.groupby('studio')['total_gross'].sum().sort_ studio_gross.plot(kind='bar', figsize=(10, 6), color='blue') plt.title('Top 10 studios by total gross revenue') l l b l(' di ') https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb 5/10 4/13/25, 1:01 AM Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i… plt.xlabel('Studio') plt.ylabel('Total gross revenue') plt.show()   Key Insight: From the bar chart above, we can clearly see that BV , Fox , and WB are the top three studios, generating significantly higher total revenue compared to others. Yearly trend in Average Gross Revenue To understand the average box office revenue trends over time, we analyze the yearly average gross revenue and visualize it with a line chart. In [12]: #yearly trends in box office yearly_gross = movie_gross_df.groupby('year')['total_gross'].mean() yearly_gross.plot(kind='line', figsize=(10, 6), marker='o', color='green') plt.title('Average Gross Revenue Over time') plt.xlabel('Year') plt.ylabel('Average Gross Revenue') plt.show()  https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb  6/10 4/13/25, 1:01 AM Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i… Key Insight: There is a noticeable upward trend in average gross revenue over the years, indicating growth in the film industry. Genre Analysis and Runtime Distribution This section focuses on analyzing movie genres and their average ratings, as well as examining the runtime distribution of genres by region. These insights will help us better understand the characteristics of popular genres and their performance across different regions. In [13]: In [14]: In [15]: #import necessary libraries import pandas as pd import sqlite3 import matplotlib.pyplot as plt import seaborn as sns #load the data conn = sqlite3.connect('im.db') #check out the most popular genres in terms of number of votes q = """ SELECT mb.genres, mr.numvotes, mr.averagerating FROM movie_basics mb JOIN movie_ratings mr ON mb.movie_id = mr.movie_id ORDER BY numvotes DESC LIMIT 10 ; """ df_sql = pd.read_sql(q, conn) df_sql Out[15]: 0 genres numvotes averagerating Action,Adventure,Sci-Fi - 8.8 https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb 7/10 4/13/25, 1:01 AM In [16]: Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i… 1 Action,Thriller - 8.4 2 Adventure,Drama,Sci-Fi - 8.6 3 Drama,Western - 8.4 4 Action,Adventure,Sci-Fi - 8.1 5 Biography,Crime,Drama - 8.2 6 Mystery,Thriller - 8.1 7 Action,Adventure,Comedy 948394 8.1 8 Action,Adventure,Comedy 820847 8.0 9 Action,Adventure,Sci-Fi 795227 7.2 df_sql.groupby('genres').size().plot(kind='barh', color=sns.color_palette( plt.gca().spines[['top', 'right',]].set_visible(False)  In [17]:  # check runtime distribution of different genres by region q = """ SELECT mb.genres, mb.runtime_minutes, ma.region FROM movie_basics mb JOIN movie_akas ma ON mb.movie_id = ma.movie_id ORDER BY runtime_minutes DESC LIMIT 10 ; """ df_sql = pd.read_sql(q, conn) df_sql genres runtime_minutes region 0 Documentary 51420.0 None 1 Documentary 51420.0 SE 2 Documentary 51420.0 SE Out[17]: https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb 8/10 4/13/25, 1:01 AM In [13]: Getting-started-in-the-movie-industry/Phase 2 project-checkpoint.ipynb at main · Kipchumba254/Getting-started-in-the-movie-i… 3 Documentary 51420.0 US 4 Documentary 14400.0 FI 5 Documentary 6017.0 IN 6 Documentary,Drama 6000.0 IT 7 Comedy,Drama,Mystery 5460.0 GB 8 Drama,Western 4980.0 DE 9 Comedy,Drama 4200.0 None #plot a bar graph to vizualize runtime by genres i different regions plt.figure(figsize=(10,8)) colors = sns.color_palette("Dark2") bar_plot = plt.bar(df_sql['genres'], df_sql['runtime_minutes'], color=colo plt.xticks(rotation=90) plt.xlabel('genres') plt.ylabel('runtime_minutes') plt.title('Distribution of Runtimes of Genres by Region') plt.legend(title="Region") plt.show()  https://github.com/Kipchumba254/Getting-started-in-the-movie-industry/blob/main/Phase 2 project-checkpoint.ipynb  9/10

Scheduled maintenance