Sangram Keshari Patro

Business Report - 3 PG Program in Data Science and Business Analytics submitted by Sangram Keshari Patro BATCH:PGPDSBA.O.AUG24.B Contents 1 Objective 4 2 Data Description 4 3 Data Overview 4 3.1 Importing necessary libraries and the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Structure and type of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 Statistical summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 Exploratory Data Analysis 4.1 4.2 Univariate Analysis 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1.1 Numerical columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1.2 Categorical columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Bivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Numerical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Categorical vs numerical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Data preprocessing and Model building 9 9 10 14 5.1 Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Testing the assumptions of linear regression model 14 . . . . . . . . . . . . . . . . . . . . . . 15 5.2.1 No Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.2.2 TEST FOR LINEARITY AND INDEPENDENCE . . . . . . . . . . . . . . . . . . . 16 5.2.3 TEST FOR NORMALITY for Model 3 . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.2.4 TEST FOR HOMOSCEDASTICITY . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.3 Model 3 5.4 Final model(model 4) performance evaluation 6 Actionable Insights & Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . 21 22 1 List of Figures 1 Dataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Table depicting the datatype and Non-Null values in each column. 5 3 Statistical summary of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 Histogram and boxplot of 'views_content' column . . . . . . . . . . . . . . . . . . . . . . . 5 5 Histogram and boxplot of 'views_trailer' column 6 6 Histogram and boxplot of visitors column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . 7 . . . . . . 8 8 ad_impression' column . . . . . . . . . . . . . . . . Barchart of 'genre' , 'season', 'dayofweek' and 'major_sports_event' column 9 Heatmap of all numerical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 10 Pairplot of all numerical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 11 Boxplot and histplot for various aspects across dierent 12 Boxplot and histplot for various aspects across dierent 13 Boxplot and histplot for various aspects across dierent . . . . . . . . . . . . 13 14 Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 15 Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 16 Residual plots 17 Test for normality for Model 3 18 19 20 Test for 7 Histogram and boxplot of ' genre . . . . days of week season . . . . . . . . . . . . . . . 11 . . . . . . . . . . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Model 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Model 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 . . . . . . . . . . . . . . . . . . . 22 normality and homoscedacity for nal model 4 2 List of Tables 1 Comparison of Models based on AIC and BIC . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Comparison of Models based on AIC and BIC . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 1 Objective ShowTime is an OTT service provider and oers a wide variety of content (movies, web shows, etc.) for its users. They want to determine the driver variables for rst-day content viewership so that they can take necessary measures to improve the viewership of the content on their platform. Some of the reasons for the decline in viewership of content would be the decline in the number of people coming to the platform, decreased marketing spend, content timing clashes, weekends and holidays, etc. They have hired you as a Data Scientist, shared the data of the current content in their platform, and asked you to analyze the data and come up with a linear regression model to determine the driving factors for rst-day viewership. 2 Data Description The data contains the dierent factors to analyze for the content. The detailed data dictionary is ws, in millions, of the content 1. visitors: Average number of visitors, in millions, to the platform in the past week. 2. ad_impressions: Number of ad impressions, in millions, across all ad campaigns for the content (both running and completed). 3. major_sports_event: Indicates if there was any major sports event on the day. 4. genre: Genre of the content. 5. dayofweek: Day of the week on which the content was released. 6. season: Season during which the content was released. 7. views_trailer: Number of views, in millions, of the content trailer. 8. views_content: Number of rst-day views, in millions, of the content. 3 Data Overview 3.1 Importing necessary libraries and the dataset The dataframe is printed. It has 1000 rows & 8 columns. Figure 1: Dataframe 3.2 Structure and type of data Data is explored further. Data doesn't have any duplicate rows. 4 Figure 2: Table depicting the datatype and Non-Null values in each column. 3.3 Statistical summary Figure 3: Statistical summary of the data From this table we can observe that there are outliers in the columns. 4 Exploratory Data Analysis 4.1 Univariate Analysis 4.1.1 Numerical columns views_content Figure 4: Histogram and boxplot of 'views_content' column 5 Observations: The distribution of rst-day views (views_content) is approximately normal with a slight positive skew. The mean rst-day views are around 0.47 million. The median is close to the mean, indicating a roughly symmetric distribution. The range of rst-day views spans from approximately 0.2 million to 0.9 million, with some content achieving exceptionally high viewership. Insights: Most content has moderate viewership on the rst day, with a few notable outliers. ShowTime could analyze high-performing content to identify factors driving higher viewership, such as genre or marketing, to apply these insights to other content. views_trailer Figure 5: Histogram and boxplot of 'views_trailer' column Observations: The distribution of trailer views is positive skewed. Mean trailer views are around 67 million, with the median around 54 million which is lower than the mean. Trailer views range from 30 million to about 200 million, with most values between 40 and 70 million. Insights: The consistency in trailer views suggests that trailers generate stable interest among viewers. Increasing the visibility or promotion of trailers may help drive higher anticipation and, consequently, rst-day views. visitors 6 Figure 6: Histogram and boxplot of visitors column Observations: The visitors distribution has a mean of approximately 1.7 million to the platform in the past week. The median is close to the mean. Visitor counts range from 1.4 million to 2 million, suggesting consistent platform trac in the past week. There are no outliers in the dataset. Insights: ShowTime has a steady visitor base, but not all visitors engage with new content. Encouraging these visitors to view new releases, possibly through notications or recommendations, could signicantly increase rst-day views. ad_impression Figure 7: Histogram and boxplot of ' ad_impression' column Observations: The ad impression distribution has a mean of approximately 1411 million to the platform in the past week. The median is close to the mean. There are no outliers in the dataset. 7 4.1.2 Categorical columns (a) ' genre (c) ' Figure 8: Barchart of ' ' (b) ' dayofweek ' (d) ' season ' major_sports_event ' genre' , 'season', 'dayofweek' and 'major_sports_event' column 8 Observations day of the week bar chart indicates that Friday has the highest count, followed by Wednesday. Tuesday and Monday have the lowest counts i.e. Most of the contents are released on Friday andTuesday. 1. The 2. The season bar chart shows relatively even distribution across all seasons, with Winter Fall, Spring, and Summer. having the highest count, closely followed by 3. The contents by ShowTime feature various genres, all of which have equal weightage, with Comedy content being released the most frequently. Insights 1. The company may consider focusing marketing eorts more heavily on weekends too apart Fridays to maximize audience engagement. 2. The even distribution of counts across seasons indicates that activities or events are spread throughout the year. This suggests that seasonal trends have minimal impact on the data, so the company may maintain a consistent approach throughout the year. 3. The content released on days when major sports events occur should be minimized, as it currently accounts for 40 4.2 Bivariate Analysis 4.2.1 Numerical variables Heatmap Figure 9: Heatmap of all numerical variables 9 Key Insights: (a) - Views_content has a strong positive correlation with views_trailer (0.75), suggesting that customers who watch trailers are highly likely to watch the content as well. visitors to the platform last week shows a moderate positive correlation Views_content. Interestingly, ad_impressions does not show a strong correlation with Views_content, high- (b) The average number of (0.24) with (c) lighting the need to reconsider the type or platform of advertisements being used. Pairplot Figure 10: Pairplot of all numerical variables This graph explains that customers watching trailer mostly prefer to watch the content too as the correlation is high. 4.2.2 Categorical vs numerical variables 'genre' vs 'views_content' 10 Figure 11: Boxplot and histplot for various aspects across dierent genre Observations: The distributions across dierent genres are quite similar, exhibiting a right-skewed pattern. These distributions generally peak around 0.4 to 0.5 million view counts, with a range between 0.3 and 0.8 million view counts. Among all genres, Sci-Fi has the highest mean and median content views, while the Thriller genre has the lowest content views. Insights: ShowTime should focus on adding more Sci-Fi content to their platform, as it attracts higher view- 11 ership, and consider producing similar content. Additionally, since ad impressions are lowest for the Thriller genre, resulting in minimal content views, ShowTime should work on strategies to boost the viewership of Thriller movies on their platform. Ad impressions for Sci-Fi are signicantly higher compared to other genres, resulting in increased view counts. The pattern of median content views aligns closely with ad impressions across genres, except for Comedy and Romance. This suggests that increasing ad impressions for Comedy genre as compared to Romance genre could help boost content views for Comdedy genre more eectively. 'dayofweek' vs 'views_content' Figure 12: Boxplot and histplot for various aspects across dierent 12 days of week Observations: The distributions across dierent days of the week vary signicantly, with a right-skewed pattern. Content views are higher on Wednesdays, Saturdays, and Sundays. Ad impressions are higher on Wednesdays, Fridays, Mondays, and Sundays. Insights: ShowTime should consider increasing advertisements on Saturdays, as these days attract higher viewership. Additionally, ad impressions are lowest on Thursdays, leading to minimal content views. While content views on Fridays are relatively low, a signicant amount of money is being spent on advertisements on that day. This should be adjusted, with more advertisements allocated to Wednesdays, Saturdays, and Sundays to optimize impact. 'season' vs 'views_content' Figure 13: Boxplot and histplot for various aspects across dierent season Observations: The distributions across dierent season are quite similar, exhibiting a right-skewed pattern. 13 Content views are higher on summer and winter seasons. Ad impressions are higher on fall, summer and winter seasons. Insights: ShowTime should consider increasing advertisements on Spring seaons, as it attracts higher viewership. While content views on Fall are relatively low, a signicant amount of money is being spent on advertisements on that season. This should be adjusted, with more advertisements allocated to summer and winter seasons to optimize impact. The number of visitors during summer in the past week was lower than in winter, yet content views remained high. This suggests that allocating more budget to advertisements in summer compared to winter could be benecial. 5 Data preprocessing and Model building The dataset contains no missing or duplicate values. Outliers in the "visitors" and "ad impressions" columns have been addressed, but outliers in "trailer views" and "content views" have not been treated, as doing so would negatively impact the model's performance. Removing outliers from these two columns resulted in an an R 2 R2 value of approximately 0.5, whereas retaining the outliers produced value of around 0.78. We have splited the data for training and testing purposes. The model summary is as follows. 5.1 Model 1 (a) Model summary 14 (b) Model Performance on train data (c) Model Performance on test data Figure 14: Model 1 The R-squared value tells us that our model can explain 77.5% of the variance in the training set. The coecients tell us how one unit change in X can aect y. The sign of the coecient indicates if the relationship is positive or negative. In this data set, for example, an increase of 1 visitor occurs with a 0.1548 increase in view count 1 .1548 ≈ 6.45 ∼ 6.5 million visitors visiting the platform in the past week and a unit increase in major sports event occurs with a or in other words 1 million view counts can be increased by 0.0568 million decrease in the view count. Similarly, the same explanation applies to the other coecients as well. Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent. If the collinearity between variables is high, we might not be able to trust the p-values to identify independent variables that are statistically signicant. When we have multicollinearity in the linear model, the coecients that the model suggests are unreliable. (a) (b) Null hypothesis - Contribution of the column i.e. the coecient is zero. Alternate hypothesis - Contribution of the column i.e. the coecient is not zero. If p-value is 5.2 > 0.05 then we accept our null hypothesis. Testing the assumptions of linear regression model We will be checking the following Linear Regression assumptions: No Multicollinearity Linearity of variables Independence of error terms Normality of error terms No Heteroscedasticity 5.2.1 No Multicollinearity Variance ination factors measure the ination in the variances of the regression parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coecient βk is "inated" by the existence of correlation among the predictor variables in the model. We can clearly observe the vif of all the factors are less than 3. Hence we can look into the p-values and then drop the columns which are not statistically signicant. 15 (a) Model summary (b) Model Performance on train data (c) Model Performance on test data Figure 15: Model 2 Dropping the high p-value predictor variables has not adversely aected the model performance as R2 is unchanged. This shows that these variables do not signicantly impact the target variable. 5.2.2 TEST FOR LINEARITY AND INDEPENDENCE Why the test? Linearity describes a straight-line relationship between two variables, predictor variables must have a linear relation with the dependent variable. How to check linearity? Make a plot of tted values vs residuals. If they don't follow any pattern (the curve is a straight line), then we say the model is linear otherwise model is showing signs of non-linearity. How to x if this assumption is not followed? We can try dierent transformations. I have plotted the pairplot between all the parameters to observe any pattern in the data and observed that 'views_content' is non-linear to 'views_trailer'. we try taking squareroot of 1 column and re-consider the model again. 16 Hence (a) Model 2 (b) Model 3 Figure 16: Residual plots Residual Analysis Model 2: The residuals show a visible curve (non-linear pattern), especially towards higher tted values. This suggests that Model 2 might be missing some non-linear relationships or has some other specication issues. Model 3: The residuals appear more randomly scattered with a lesser tendency to show a trend. There is slight heteroscedasticity, but it is not very pronounced. AIC and BIC 1. Akaike Information Criterion (AIC) Denition: Measures model quality, penalizing complexity to avoid overtting. Formula: AIC = −2 ln(L) + 2k Where: ln(L): k: Log-likelihood of the model. Number of parameters (including intercept). Interpretation: Lower AIC is better. Penalizes complex models to ensure simplicity. 17 2. Bayesian Information Criterion (BIC) Denition: Formula: Similar to AIC but applies a stronger penalty for complexity, based on Bayesian principles. BIC = −2 ln(L) + k ln(n) Where: ln(L): Log-likelihood. k: Number of parameters. n: Number of observations. Interpretation: Lower BIC is better. Stronger penalty than AIC, especially for large n. 3. Comparison AIC: Penalizes complexity less, works well with small datasets. BIC: Stronger penalty, prefers simpler models with large datasets. 4. When to Use AIC: Focus on prediction; small sample size. BIC: Focus on simplicity; large sample size. Model Comparison Model AIC BIC Model 2 -2016.42 -1962.70 Model 3 -2021.22 (lower AIC) -1963.02 (lower BIC) Table 1: Comparison of Models based on AIC and BIC Conclusion: Model 3 is preferred due to its lower AIC and BIC, indicating a better trade-o between t and complexity. 5.2.3 TEST FOR NORMALITY for Model 3 What is the test? Error terms/residuals should be normally distributed. If the error terms are not normally distributed, condence intervals may become too wide or narrow. Once condence interval becomes unstable, it leads to diculty in estimating coecients based on minimization of least squares. What does non-normality indicate? It suggests that there are a few unusual data points which must be studied closely to make a better model. How to check the Normality? It can be checked via QQ Plot - residuals following normal distribution will make a straight line plot, otherwise not. Another test to check for normality is the Shapiro-Wilk test. How to Make residuals normal? We can apply transformations like log, exponential, arcsinh, etc as per our data. 18 Figure 17: Test for normality for Model 3 The Shapiro-Wilk test can also be used for checking the normality. The null and alternate hypotheses of the test are as follows: Null hypothesis - Data is normally distributed. Alternate hypothesis - Data is not normally distributed. The Shapiro-Wilk test yields a p-value of 0.295, which is greater than 0.05. Thus, we can conclude that the residuals follow a normal distribution. 5.2.4 TEST FOR HOMOSCEDASTICITY Homoscedacity - If the variance of the residuals are symmetrically distributed across the re- gression line , then the data is said to homoscedastic. Heteroscedacity - If the variance is unequal for the residuals across the regression line, then the data is said to be heteroscedastic. In this case the residuals can form an arrow shape or any other non symmetrical shape. Why the test? The presence of non-constant variance in the error terms results in heteroscedasticity. Generally, non-constant variance arises in presence of outliers. How to check if model has Heteroscedasticity? Can use the goldfeldquandt test. If we get p-value > 0.05 we can say that the residuals are homoscedastic, otherwise they are heteroscedastic. How to deal with Heteroscedasticity? Can be xed via adding other important features or making transformations. The p-value for the Goldfeld-Quandt test is 0.134, indicating that we fail to reject the null hypothesis. This suggests that the residuals exhibit homoscedasticity. 19 5.3 Model 3 (a) Model summary (b) Model Performance on train data (c) Model Performance on test data (d) Equation of the model Figure 18: Model 3 Despite the high VIF values for 'views_trailer_sq' and 'views_trailer' both variables have been retained in the model. This decision considers the observed patterns in the pairplot, the lower AIC and BIC values of Model 3, and the overall improvement in model performance metrics such as RMSE, MAE, MAPE, and Adjusted R2 .We also note that 'views_trailer' has a p-value of 0.483. we exclude this column to develop Model 4. 20 Therefore, 5.4 Final model(model 4) performance evaluation (a) Model summary (b) Model Performance on train data (c) Model Performance on test data (d) Equation of the model Figure 19: Model 4 21 Figure 20: Test for Model normality and homoscedacity for nal model 4 AIC BIC Model 2 -2016.42 -1962.70 Model 3 -2021.22 (lower AIC than Model 2) -1963.02 (lower BIC than Model 2) Model 4 -2022.72 (lowest AIC) -1968.99 (lowest BIC) Table 2: Comparison of Models based on AIC and BIC Conclusion Model 4 is the preferred model as it achieves the lowest AIC and BIC values, indicating the best balance between model t and complexity among the compared models. This is our nal model with every condition satised along with the model assumptions. This decision considers the observed patterns in the pairplot, the lower AIC and BIC values of Model 4, and the overall improvement in model performance metrics such as RMSE, MAE, MAPE, and Adjusted R2 .This model saties both Shapiro and Goldfeld-Quandt test as well. We can see that RMSE on the train and test sets are comparable. So, our model is not suering from overtting. MAE indicates that our current model is able to predict view counts within a mean error of 0.04 units on the test data. Hence, we can conclude the model "Model4" is good for prediction as well as inference purposes. 6 Actionable Insights & Recommendations (a) With our linear regression model, we have been able to capture ~77.4% of the variation in our data. (b) The model indicates that the most signicant predictors of the number of rst-day views, in millions, of the content are the following: visitors 22 existance of major_sports_event (on the day of content release). dayofweek (Monday,Wednesday,Thursday ,Saturday and Sunday) Season (Spring,Summer,Winter) views_trailer The p-values for these predictors are less than 0.05 in our nal model.) ( (a) 3. an increase of 1 visitor occurs with a 0.1532 increase in view count or in other words 1 million view counts can be increased by 1 .1532 ≈ 6.53 ∼ 6.5 million visitors visiting the platform in the past week It is important to note here that the predicted values are square(views_trailer) and therefore coecients have to be converted accordingly to understand their inuence on view counts. It is important to note here that correlation is not equal to causation. 4. If there is any one major sports event on the day of content release then there is 0.063 million decrease in view counts 5.The categorical variables are a little hard to interpret. It can be seen that all the dayofweek_category variables in the dataset have a positive relationship with the view counts, and the magnitude of this positive relationship is high for Wednesday,Sunday and Saturday as already visualized through a boxplot(12). 6.It can be seen that all the season_category variables in the dataset have a positive relationship with the view counts, and the magnitude of this positive relationship is high for Summer,Winter and Spring respectively following the same pattern as already visualized through the boxplot( ??). 4. As the number of views, in millions, of the content trailer increases, the number of rst-day views, in millions, of the content also increases. Our nal Linear Regression model has a MAPE of 23% on the test data, which means that we are able to predict within 23% of the content views. This is a very good model and we can use this model for the benet of the company. 23

Scheduled maintenance