Shankar Rajput | Freelancer Netflix Data Analysis Project

Netflix Data Analysis Project

1/5/25, 3:44 PM swati movies data project In [1]: # importing lib. import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns In [2]: df = pd.read_csv('mymoviedb.csv', lineterminator='\n') df.head() Out[2]: 0 1 2 3 4 Release_Date Title Overview Popularity Vote_Count Vote_Average - SpiderMan: No Way Home Peter Parker is unmasked and no longer able to... - 8940 8.3 The Batman In his second year of fighting crime, Batman u... - 1151 8.1 No Exit Stranded at a rest stop in the mountains durin... - 122 6.3 - Encanto The tale of an extraordinary family, the Madri... - 5076 7.7 - The King's Man As a collection of history's worst tyrants and... - 1793 7.0 - - Original_ In [3]: # viewing dataset info df.info() file:///C:/Users/swati/Downloads/swati movies data project.html 1/13 1/5/25, 3:44 PM swati movies data project RangeIndex: 9827 entries, 0 to 9826 Data columns (total 9 columns): # Column Non-Null Count --- ------------------0 Release_Date 9827 non-null 1 Title 9827 non-null 2 Overview 9827 non-null 3 Popularity 9827 non-null 4 Vote_Count 9827 non-null 5 Vote_Average 9827 non-null 6 Original_Language 9827 non-null 7 Genre 9827 non-null 8 Poster_Url 9827 non-null dtypes: float64(2), int64(1), object(6) memory usage: 691.1+ KB Dtype ----object object object float64 int64 float64 object object object • looks like our dataset has no NaNs! • Overview, Original_Language and Poster-Url wouldn't be so useful during analysis • Release_Date column needs to be casted into date time and to extract only the year value In [8]: # exploring genres column df['Genre'].head() Out[8]: 0 Action, Adventure, Science Fiction 1 Crime, Mystery, Thriller 2 Thriller 3 Animation, Comedy, Family, Fantasy 4 Action, Adventure, Thriller, War Name: Genre, dtype: object • genres are saperated by commas followed by whitespaces. In [11]: # check for duplicated rows df.duplicated().sum() Out[11]: 0 • our dataset has no duplicated rows either. In [15]: # exploring summary statistics df.describe() file:///C:/Users/swati/Downloads/swati movies data project.html 2/13 1/5/25, 3:44 PM swati movies data project Popularity Vote_Count Vote_Average count - - - mean - - - std - - - min - - - 25% - - - 50% - - - 75% - - - max - - - Out[15]: In [ ]: • Exploration Summary • • • • • • • we have a dataframe consisting of 9827 rows and 9 columns. our dataset looks a bit tidy with no NaNs nor duplicated values. Release_Date column needs to be casted into date time and to extract only the Overview, Original_Languege and Poster-Url wouldn't be so useful during analys there is noticable outliers in Popularity column Vote_Average bettter be categorised for proper analysis. Genre column has comma saperated values and white spaces that needs to be hand In [18]: # Data Cleaning Casting Release_Date column and extracing year values In [21]: df.head() file:///C:/Users/swati/Downloads/swati movies data project.html 3/13 1/5/25, 3:44 PM swati movies data project Out[21]: 0 1 2 3 4 Release_Date Title Overview Popularity Vote_Count Vote_Average - SpiderMan: No Way Home Peter Parker is unmasked and no longer able to... - 8940 8.3 The Batman In his second year of fighting crime, Batman u... - 1151 8.1 No Exit Stranded at a rest stop in the mountains durin... - 122 6.3 - Encanto The tale of an extraordinary family, the Madri... - 5076 7.7 - The King's Man As a collection of history's worst tyrants and... - 1793 7.0 - - Original_ In [23]: # casting column a df['Release_Date'] = pd.to_datetime(df['Release_Date']) # confirming changes print(df['Release_Date'].dtypes) datetime64[ns] In [25]: df['Release_Date'] = df['Release_Date'].dt.year df['Release_Date'].dtypes Out[25]: dtype('int32') In [27]: df.info() file:///C:/Users/swati/Downloads/swati movies data project.html 4/13 1/5/25, 3:44 PM swati movies data project RangeIndex: 9827 entries, 0 to 9826 Data columns (total 9 columns): # Column Non-Null Count --- ------------------0 Release_Date 9827 non-null 1 Title 9827 non-null 2 Overview 9827 non-null 3 Popularity 9827 non-null 4 Vote_Count 9827 non-null 5 Vote_Average 9827 non-null 6 Original_Language 9827 non-null 7 Genre 9827 non-null 8 Poster_Url 9827 non-null dtypes: float64(2), int32(1), int64(1), memory usage: 652.7+ KB Dtype ----int32 object object float64 int64 float64 object object object object(5) In [29]: df.head() Out[29]: 0 1 2 3 4 Release_Date Title Overview Popularity Vote_Count Vote_Average 2021 SpiderMan: No Way Home Peter Parker is unmasked and no longer able to... - 8940 8.3 The Batman In his second year of fighting crime, Batman u... - 1151 8.1 No Exit Stranded at a rest stop in the mountains durin... - 122 6.3 2021 Encanto The tale of an extraordinary family, the Madri... - 5076 7.7 2021 The King's Man As a collection of history's worst tyrants and... - 1793 7.0 2022 2022 Original_ Dropping Overview, Original_Languege and Poster-Url In [32]: # making list of column to be dropped cols = ['Overview', 'Original_Language', 'Poster_Url'] file:///C:/Users/swati/Downloads/swati movies data project.html 5/13 1/5/25, 3:44 PM swati movies data project # dropping columns and confirming changes df.drop(cols, axis = 1, inplace = True) df.columns Out[32]: Index(['Release_Date', 'Title', 'Popularity', 'Vote_Count', 'Vote_Average', 'Genre'], dtype='object') In [34]: df.head() Release_Date Title Popularity Vote_Count Vote_Average Genre 0 2021 SpiderMan: No Way Home - 8940 8.3 Action, Adventure, Science Fiction 1 2022 The Batman - 1151 8.1 Crime, Mystery, Thriller 2 2022 No Exit - 122 6.3 Thriller 7.7 Animation, Comedy, Family, Fantasy 7.0 Action, Adventure, Thriller, War Out[34]: 3 4 2021 Encanto 2021 The King's Man - - 5076 1793 categorizing Vote_Average column We would cut the Vote_Average values and make 4 categories: popular average below_avg not_popular to describe it more using catigorize_col() function provided above. In [37]: def catigorize_col (df, col, labels): """ catigorizes a certain column based on its quartiles Args: (df) df - dataframe we are proccesing (col) str - to be catigorized column's name (labels) list - list of labels from min to max Returns: (df) """ df - dataframe with the categorized col # setting the edges to cut the column accordingly edges = [df[col].describe()['min'], df[col].describe()['25%'], df[col].describe()['50%'], df[col].describe()['75%'], df[col].describe()['max']] file:///C:/Users/swati/Downloads/swati movies data project.html 6/13 1/5/25, 3:44 PM swati movies data project df[col] = pd.cut(df[col], edges, labels = labels, duplicates='drop') return df In [39]: # define labels for edges labels = ['not_popular', 'below_avg', 'average', 'popular'] # categorize column based on labels and edges catigorize_col(df, 'Vote_Average', labels) # confirming changes df['Vote_Average'].unique() Out[39]: ['popular', 'below_avg', 'average', 'not_popular', NaN] Categories (4, object): ['not_popular' < 'below_avg' < 'average' < 'popular'] In [41]: df.head() Release_Date Title Popularity Vote_Count Vote_Average Genre 0 2021 SpiderMan: No Way Home - 8940 popular Action, Adventure, Science Fiction 1 2022 The Batman - 1151 popular Crime, Mystery, Thriller 2 2022 No Exit - 122 below_avg Thriller popular Animation, Comedy, Family, Fantasy average Action, Adventure, Thriller, War Out[41]: 3 4 2021 Encanto 2021 The King's Man - - 5076 1793 In [43]: # exploring column df['Vote_Average'].value_counts() Out[43]: Vote_Average not_popular 2467 popular 2450 average 2412 below_avg 2398 Name: count, dtype: int64 In [45]: # dropping NaNs df.dropna(inplace = True) # confirming df.isna().sum() Out[45]: Release_Date Title Popularity Vote_Count Vote_Average Genre dtype: int64 0 0 0 0 0 0 file:///C:/Users/swati/Downloads/swati movies data project.html 7/13 1/5/25, 3:44 PM swati movies data project In [47]: df.head() Release_Date Title Popularity Vote_Count Vote_Average Genre 0 2021 SpiderMan: No Way Home - 8940 popular Action, Adventure, Science Fiction 1 2022 The Batman - 1151 popular Crime, Mystery, Thriller 2 2022 No Exit - 122 below_avg Thriller 3 2021 Encanto - 5076 popular Animation, Comedy, Family, Fantasy 4 2021 The King's Man - 1793 average Action, Adventure, Thriller, War Out[47]: we'd split genres into a list and then explode our dataframe to have only one genre per row for ezch movie In [52]: # split the strings into lists df['Genre'] = df['Genre'].str.split(', ') # explode the lists df = df.explode('Genre').reset_index(drop=True) df.head() Release_Date Title Popularity Vote_Count Vote_Average Genre 0 2021 Spider-Man: No Way Home - 8940 popular Action 1 2021 Spider-Man: No Way Home - 8940 popular Adventure 2 2021 Spider-Man: No Way Home - 8940 popular Science Fiction 3 2022 The Batman - 1151 popular Crime 4 2022 The Batman - 1151 popular Mystery Out[52]: In [55]: # casting column into category df['Genre'] = df['Genre'].astype('category') # confirming changes df['Genre'].dtypes file:///C:/Users/swati/Downloads/swati movies data project.html 8/13 1/5/25, 3:44 PM Out[55]: swati movies data project CategoricalDtype(categories=['Action', 'Adventure', 'Animation', 'Comedy', 'Cri me', 'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western'], , ordered=False, categories_dtype=object) In [57]: df.info() RangeIndex: 25552 entries, 0 to 25551 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------------------- ----0 Release_Date 25552 non-null int32 1 Title 25552 non-null object 2 Popularity 25552 non-null float64 3 Vote_Count 25552 non-null int64 4 Vote_Average 25552 non-null category 5 Genre 25552 non-null category dtypes: category(2), float64(1), int32(1), int64(1), object(1) memory usage: 749.6+ KB In [59]: df.nunique() Out[59]: Release_Date Title Popularity Vote_Count Vote_Average Genre dtype: int64 - Now that our dataset is clean and tidy, we are left with a total of 6 columns and 25551 rows to dig into during our analysis Data Visualization here, we'd use Matplotlib and seaborn for making some informative visuals to gain insights abut our data. In [62]: # setting up seaborn configurations sns.set_style('whitegrid') Q1: What is the most frequent genre in the dataset? In [65]: # showing stats. on genre column df['Genre'].describe() file:///C:/Users/swati/Downloads/swati movies data project.html 9/13 1/5/25, 3:44 PM Out[65]: swati movies data project count 25552 unique 19 top Drama freq 3715 Name: Genre, dtype: object In [67]: # visualizing genre column sns.catplot(y = 'Genre', data = df, kind = 'count', order = df['Genre'].value_counts().index, color = '#4287f5') plt.title('genre column distribution') plt.show() we can notice from the above visual that Drama genre is the most frequent genre in our dataset and has appeared more than 14% of the times among 19 other genres. Q2: What genres has highest votes ? In [71]: # visualizing vote_average column sns.catplot(y = 'Vote_Average', data = df, kind = 'count', order = df['Vote_Average'].value_counts().index, color = '#4287f5') plt.title('votes destribution') plt.show() file:///C:/Users/swati/Downloads/swati movies data project.html 10/13 1/5/25, 3:44 PM swati movies data project Q3: What movie got the highest popularity ? what's its genre ? In [74]: # checking max popularity in dataset df[df['Popularity'] == df['Popularity'].max()] Release_Date Title Popularity Vote_Count Vote_Average Genre 0 2021 Spider-Man: No Way Home - 8940 popular Action 1 2021 Spider-Man: No Way Home - 8940 popular Adventure 2 2021 Spider-Man: No Way Home - 8940 popular Science Fiction Out[74]: Q4: What movie got the lowest popularity? what's its genre? In [86]: # checking max popularity in dataset df[df['Popularity'] == df['Popularity'].min()] file:///C:/Users/swati/Downloads/swati movies data project.html 11/13 1/5/25, 3:44 PM swati movies data project Release_Date Title Popularity Vote_Count Vote_Average Genre 25546 2021 The United States vs. Billie Holiday 13.354 152 average Music 25547 2021 The United States vs. Billie Holiday 13.354 152 average Drama 25548 2021 The United States vs. Billie Holiday 13.354 152 average History 25549 1984 Threads 13.354 186 popular War 25550 1984 Threads 13.354 186 popular Drama 25551 1984 Threads 13.354 186 popular Science Fiction Out[86]: Q5: Which year has the most filmmed movies? In [82]: df['Release_Date'].hist() plt.title('Release_Date column distribution') plt.show() Conclusion file:///C:/Users/swati/Downloads/swati movies data project.html 12/13 1/5/25, 3:44 PM swati movies data project Q1: What is the most frequent genre in the dataset? Drama genre is the most frequent genre in our dataset and has appeared more than 14% of the times among 19 other genres. Q2: What genres has highest votes ? we have 25.5% of our dataset with popular vote (6520 rows). Drama again gets the highest popularity among fans by being having more than 18.5% of movies popularities. Q3: What movie got the highest popularity ? what's its genre ? Spider-Man: No Way Home has the highest popularity rate in our dataset and it has genres of Action , Adventure and Sience Fiction . Q3: What movie got the lowest popularity ? what's its genre ? The united states, thread' has the highest lowest rate in our dataset and it has genres of music , drama , 'war', 'sci-fi' and history`. Q4: Which year has the most filmmed movies? year 2020 has the highest filmming rate in our dataset. In [ ]: file:///C:/Users/swati/Downloads/swati movies data project.html 13/13