Business Report - 3
PG Program in Data Science and
Business Analytics
submitted by
Sangram Keshari Patro
BATCH:PGPDSBA.O.AUG24.B
Contents
1 Objective
4
2 Data Description
4
3 Data Overview
4
3.1
Importing necessary libraries and the dataset . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.2
Structure and type of data
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.3
Statistical summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
4 Exploratory Data Analysis
4.1
4.2
Univariate Analysis
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
4.1.1
Numerical columns
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
4.1.2
Categorical columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Bivariate Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1
Numerical variables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2
Categorical vs numerical variables
. . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Data preprocessing and Model building
9
9
10
14
5.1
Model 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Testing the assumptions of linear regression model
14
. . . . . . . . . . . . . . . . . . . . . .
15
5.2.1
No Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
5.2.2
TEST FOR LINEARITY AND INDEPENDENCE
. . . . . . . . . . . . . . . . . . .
16
5.2.3
TEST FOR NORMALITY for Model 3 . . . . . . . . . . . . . . . . . . . . . . . . .
18
5.2.4
TEST FOR HOMOSCEDASTICITY
. . . . . . . . . . . . . . . . . . . . . . . . . .
19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
5.3
Model 3
5.4
Final model(model 4) performance evaluation
6 Actionable Insights & Recommendations
. . . . . . . . . . . . . . . . . . . . . . . . .
21
22
1
List of Figures
1
Dataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2
Table depicting the datatype and Non-Null values in each column.
5
3
Statistical summary of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
4
Histogram and boxplot of 'views_content' column . . . . . . . . . . . . . . . . . . . . . . .
5
5
Histogram and boxplot of 'views_trailer' column
6
6
Histogram and boxplot of visitors column
. . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
7
. . . . . .
7
. . . . . .
8
8
ad_impression' column . . . . . . . . . . . . . . . .
Barchart of 'genre' , 'season', 'dayofweek' and 'major_sports_event' column
9
Heatmap of all numerical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
10
Pairplot of all numerical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
11
Boxplot and histplot for various aspects across dierent
12
Boxplot and histplot for various aspects across dierent
13
Boxplot and histplot for various aspects across dierent
. . . . . . . . . . . .
13
14
Model 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
15
Model 2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
16
Residual plots
17
Test for normality for Model 3
18
19
20
Test for
7
Histogram and boxplot of '
genre . . . .
days of week
season . . .
. . . . . . . . . . . .
11
. . . . . . . . . . . .
12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Model 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Model 4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
. . . . . . . . . . . . . . . . . . .
22
normality and homoscedacity for nal model 4
2
List of Tables
1
Comparison of Models based on AIC and BIC
. . . . . . . . . . . . . . . . . . . . . . . . .
18
2
Comparison of Models based on AIC and BIC
. . . . . . . . . . . . . . . . . . . . . . . . .
22
3
1 Objective
ShowTime is an OTT service provider and oers a wide variety of content (movies, web shows, etc.) for
its users. They want to determine the driver variables for rst-day content viewership so that they can take
necessary measures to improve the viewership of the content on their platform. Some of the reasons for
the decline in viewership of content would be the decline in the number of people coming to the platform,
decreased marketing spend, content timing clashes, weekends and holidays, etc. They have hired you as a
Data Scientist, shared the data of the current content in their platform, and asked you to analyze the data
and come up with a linear regression model to determine the driving factors for rst-day viewership.
2 Data Description
The data contains the dierent factors to analyze for the content. The detailed data dictionary is ws, in
millions, of the content
1. visitors: Average number of visitors, in millions, to the platform in the past week.
2. ad_impressions: Number of ad impressions, in millions, across all ad campaigns for the content (both
running and completed).
3. major_sports_event: Indicates if there was any major sports event on the day.
4. genre: Genre of the content.
5. dayofweek: Day of the week on which the content was released.
6. season: Season during which the content was released.
7. views_trailer: Number of views, in millions, of the content trailer.
8. views_content: Number of rst-day views, in millions, of the content.
3 Data Overview
3.1
Importing necessary libraries and the dataset
The dataframe is printed. It has 1000 rows & 8 columns.
Figure 1: Dataframe
3.2
Structure and type of data
Data is explored further. Data doesn't have any duplicate rows.
4
Figure 2: Table depicting the datatype and Non-Null values in each column.
3.3
Statistical summary
Figure 3: Statistical summary of the data
From this table we can observe that there are outliers in the columns.
4 Exploratory Data Analysis
4.1
Univariate Analysis
4.1.1 Numerical columns
views_content
Figure 4: Histogram and boxplot of 'views_content' column
5
Observations:
The distribution of rst-day views (views_content) is approximately normal with a slight positive skew.
The mean rst-day views are around 0.47 million. The median is close to the mean, indicating a roughly
symmetric distribution. The range of rst-day views spans from approximately 0.2 million to 0.9 million,
with some content achieving exceptionally high viewership.
Insights:
Most content has moderate viewership on the rst day, with a few notable outliers. ShowTime could
analyze high-performing content to identify factors driving higher viewership, such as genre or marketing,
to apply these insights to other content.
views_trailer
Figure 5: Histogram and boxplot of 'views_trailer' column
Observations:
The distribution of trailer views is positive skewed. Mean trailer views are around 67 million, with the
median around 54 million which is lower than the mean. Trailer views range from 30 million to about 200
million, with most values between 40 and 70 million.
Insights:
The consistency in trailer views suggests that trailers generate stable interest among viewers. Increasing
the visibility or promotion of trailers may help drive higher anticipation and, consequently, rst-day views.
visitors
6
Figure 6: Histogram and boxplot of visitors column
Observations:
The visitors distribution has a mean of approximately 1.7 million to the platform in the past week.
The median is close to the mean. Visitor counts range from 1.4 million to 2 million, suggesting consistent
platform trac in the past week. There are no outliers in the dataset.
Insights:
ShowTime has a steady visitor base, but not all visitors engage with new content. Encouraging these
visitors to view new releases, possibly through notications or recommendations, could signicantly increase
rst-day views.
ad_impression
Figure 7: Histogram and boxplot of '
ad_impression' column
Observations:
The ad impression distribution has a mean of approximately 1411 million to the platform in the past
week. The median is close to the mean. There are no outliers in the dataset.
7
4.1.2 Categorical columns
(a) '
genre
(c) '
Figure 8: Barchart of '
'
(b) '
dayofweek
'
(d) '
season
'
major_sports_event
'
genre' , 'season', 'dayofweek' and 'major_sports_event' column
8
Observations
day of the week bar chart indicates that Friday has the highest count, followed by Wednesday.
Tuesday and Monday have the lowest counts i.e. Most of the contents are released on Friday
andTuesday.
1. The
2. The
season bar chart shows relatively even distribution across all seasons, with Winter
Fall, Spring, and Summer.
having the
highest count, closely followed by
3. The contents by ShowTime feature various genres, all of which have equal weightage, with Comedy
content being released the most frequently.
Insights
1. The company may consider focusing marketing eorts more heavily on weekends too apart Fridays to
maximize audience engagement.
2. The even distribution of counts across seasons indicates that activities or events are spread throughout
the year. This suggests that seasonal trends have minimal impact on the data, so the company may
maintain a consistent approach throughout the year.
3. The content released on days when major sports events occur should be minimized, as it currently
accounts for 40
4.2
Bivariate Analysis
4.2.1 Numerical variables
Heatmap
Figure 9: Heatmap of all numerical variables
9
Key Insights:
(a)
-
Views_content
has a strong positive correlation with
views_trailer
(0.75), suggesting that
customers who watch trailers are highly likely to watch the content as well.
visitors to the platform last week shows a moderate positive correlation
Views_content.
Interestingly, ad_impressions does not show a strong correlation with Views_content, high-
(b) The average number of
(0.24) with
(c)
lighting the need to reconsider the type or platform of advertisements being used.
Pairplot
Figure 10: Pairplot of all numerical variables
This graph explains that customers watching trailer mostly prefer to watch the content too as the
correlation is high.
4.2.2 Categorical vs numerical variables
'genre' vs 'views_content'
10
Figure 11: Boxplot and histplot for various aspects across dierent
genre
Observations:
The distributions across dierent genres are quite similar, exhibiting a right-skewed pattern. These
distributions generally peak around 0.4 to 0.5 million view counts, with a range between 0.3 and 0.8
million view counts. Among all genres, Sci-Fi has the highest mean and median content views, while
the Thriller genre has the lowest content views.
Insights:
ShowTime should focus on adding more Sci-Fi content to their platform, as it attracts higher view-
11
ership, and consider producing similar content. Additionally, since ad impressions are lowest for the
Thriller genre, resulting in minimal content views, ShowTime should work on strategies to boost the
viewership of Thriller movies on their platform.
Ad impressions for Sci-Fi are signicantly higher
compared to other genres, resulting in increased view counts. The pattern of median content views
aligns closely with ad impressions across genres, except for Comedy and Romance. This suggests that
increasing ad impressions for Comedy genre as compared to Romance genre could help boost content
views for Comdedy genre more eectively.
'dayofweek' vs 'views_content'
Figure 12: Boxplot and histplot for various aspects across dierent
12
days of week
Observations:
The distributions across dierent days of the week vary signicantly, with a right-skewed pattern.
Content views are higher on Wednesdays, Saturdays, and Sundays.
Ad impressions are higher on
Wednesdays, Fridays, Mondays, and Sundays.
Insights:
ShowTime should consider increasing advertisements on Saturdays, as these days attract higher viewership. Additionally, ad impressions are lowest on Thursdays, leading to minimal content views. While
content views on Fridays are relatively low, a signicant amount of money is being spent on advertisements on that day. This should be adjusted, with more advertisements allocated to Wednesdays,
Saturdays, and Sundays to optimize impact.
'season' vs 'views_content'
Figure 13: Boxplot and histplot for various aspects across dierent
season
Observations:
The distributions across dierent season are quite similar, exhibiting a right-skewed pattern.
13
Content views are higher on summer and winter seasons.
Ad impressions are higher on fall,
summer and winter seasons.
Insights:
ShowTime should consider increasing advertisements on Spring seaons, as it attracts higher
viewership.
While content views on Fall are relatively low, a signicant amount of money is being spent on
advertisements on that season. This should be adjusted, with more advertisements allocated to
summer and winter seasons to optimize impact.
The number of visitors during summer in the past week was lower than in winter, yet content
views remained high. This suggests that allocating more budget to advertisements in summer
compared to winter could be benecial.
5 Data preprocessing and Model building
The dataset contains no missing or duplicate values. Outliers in the "visitors" and "ad impressions"
columns have been addressed, but outliers in "trailer views" and "content views" have not been
treated, as doing so would negatively impact the model's performance. Removing outliers from these
two columns resulted in an
an
R
2
R2
value of approximately 0.5, whereas retaining the outliers produced
value of around 0.78.
We have splited the data for training and testing purposes. The model summary is as follows.
5.1
Model 1
(a) Model summary
14
(b) Model Performance on train data
(c) Model Performance on test data
Figure 14: Model 1
The R-squared value tells us that our model can explain 77.5% of the variance in the training set.
The coecients tell us how one unit change in X can aect y.
The sign of the coecient indicates if the relationship is positive or negative.
In this data set, for example, an increase of 1 visitor occurs with a 0.1548 increase in view count
1
.1548 ≈ 6.45 ∼ 6.5 million visitors
visiting the platform in the past week and a unit increase in major sports event occurs with a
or in other words 1 million view counts can be increased by
0.0568 million decrease in the view count. Similarly, the same explanation applies to the other
coecients as well.
Multicollinearity occurs when predictor variables in a regression model are correlated.
This
correlation is a problem because predictor variables should be independent. If the collinearity
between variables is high, we might not be able to trust the p-values to identify independent
variables that are statistically signicant.
When we have multicollinearity in the linear model, the coecients that the model suggests are
unreliable.
(a)
(b)
Null hypothesis - Contribution of the column i.e. the coecient is zero.
Alternate hypothesis - Contribution of the column i.e. the coecient is not zero.
If p-value is
5.2
> 0.05
then we accept our null hypothesis.
Testing the assumptions of linear regression model
We will be checking the following Linear Regression assumptions:
No Multicollinearity
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
5.2.1 No Multicollinearity
Variance ination factors measure the ination in the variances of the regression parameter estimates
due to collinearities that exist among the predictors. It is a measure of how much the variance of the
estimated regression coecient
βk
is "inated" by the existence of correlation among the predictor
variables in the model. We can clearly observe the vif of all the factors are less than 3. Hence we can
look into the p-values and then drop the columns which are not statistically signicant.
15
(a) Model summary
(b) Model Performance on train data
(c) Model Performance on test data
Figure 15: Model 2
Dropping the high p-value predictor variables has not adversely aected the model performance
as
R2
is unchanged.
This shows that these variables do not signicantly impact the target variable.
5.2.2 TEST FOR LINEARITY AND INDEPENDENCE
Why the test?
Linearity describes a straight-line relationship between two variables, predictor variables must
have a linear relation with the dependent variable. How to check linearity?
Make a plot of tted values vs residuals. If they don't follow any pattern (the curve is a straight
line), then we say the model is linear otherwise model is showing signs of non-linearity. How to
x if this assumption is not followed?
We can try dierent transformations. I have plotted the pairplot between all the parameters to observe
any pattern in the data and observed that
'views_content' is non-linear to 'views_trailer'.
we try taking squareroot of 1 column and re-consider the model again.
16
Hence
(a) Model 2
(b) Model 3
Figure 16: Residual plots
Residual Analysis
Model 2:
The residuals show a visible curve (non-linear pattern), especially towards higher tted values.
This suggests that Model 2 might be missing some non-linear relationships or has some other
specication issues.
Model 3:
The residuals appear more randomly scattered with a lesser tendency to show a trend.
There is slight heteroscedasticity, but it is not very pronounced.
AIC and BIC
1. Akaike Information Criterion (AIC)
Denition:
Measures model quality, penalizing complexity to avoid overtting.
Formula:
AIC = −2 ln(L) + 2k
Where:
ln(L):
k:
Log-likelihood of the model.
Number of parameters (including intercept).
Interpretation:
Lower AIC is better. Penalizes complex models to ensure simplicity.
17
2. Bayesian Information Criterion (BIC)
Denition:
Formula:
Similar to AIC but applies a stronger penalty for complexity, based on Bayesian principles.
BIC = −2 ln(L) + k ln(n)
Where:
ln(L):
Log-likelihood.
k:
Number of parameters.
n:
Number of observations.
Interpretation:
Lower BIC is better. Stronger penalty than AIC, especially for large
n.
3. Comparison
AIC:
Penalizes complexity less, works well with small datasets.
BIC:
Stronger penalty, prefers simpler models with large datasets.
4. When to Use
AIC:
Focus on prediction; small sample size.
BIC:
Focus on simplicity; large sample size.
Model Comparison
Model
AIC
BIC
Model 2
-2016.42
-1962.70
Model 3
-2021.22 (lower AIC)
-1963.02 (lower BIC)
Table 1: Comparison of Models based on AIC and BIC
Conclusion:
Model 3 is preferred due to its lower AIC and BIC, indicating a better trade-o between
t and complexity.
5.2.3 TEST FOR NORMALITY for Model 3
What is the test?
Error terms/residuals should be normally distributed.
If the error terms are not normally distributed, condence intervals may become too wide or
narrow. Once condence interval becomes unstable, it leads to diculty in estimating coecients
based on minimization of least squares.
What does non-normality indicate?
It suggests that there are a few unusual data points which must be studied closely to make a
better model. How to check the Normality?
It can be checked via QQ Plot - residuals following normal distribution will make a straight line
plot, otherwise not.
Another test to check for normality is the Shapiro-Wilk test.
How to Make residuals normal?
We can apply transformations like log, exponential, arcsinh, etc as per our data.
18
Figure 17: Test for normality for Model 3
The Shapiro-Wilk test can also be used for checking the normality. The null and alternate hypotheses
of the test are as follows:
Null hypothesis - Data is normally distributed.
Alternate hypothesis - Data is not normally distributed.
The Shapiro-Wilk test yields a p-value of 0.295, which is greater than 0.05. Thus, we can conclude
that the residuals follow a normal distribution.
5.2.4 TEST FOR HOMOSCEDASTICITY
Homoscedacity
- If the variance of the residuals are symmetrically distributed across the re-
gression line , then the data is said to homoscedastic.
Heteroscedacity
- If the variance is unequal for the residuals across the regression line, then
the data is said to be heteroscedastic. In this case the residuals can form an arrow shape or any
other non symmetrical shape.
Why the test?
The presence of non-constant variance in the error terms results in heteroscedasticity. Generally,
non-constant variance arises in presence of outliers. How to check if model has Heteroscedasticity?
Can use the goldfeldquandt test. If we get p-value > 0.05 we can say that the residuals are
homoscedastic, otherwise they are heteroscedastic. How to deal with Heteroscedasticity?
Can be xed via adding other important features or making transformations.
The p-value for the Goldfeld-Quandt test is 0.134, indicating that we fail to reject the null hypothesis.
This suggests that the residuals exhibit homoscedasticity.
19
5.3
Model 3
(a) Model summary
(b) Model Performance on train data
(c) Model Performance on test data
(d) Equation of the model
Figure 18: Model 3
Despite the high VIF values for
'views_trailer_sq'
and
'views_trailer'
both variables have been
retained in the model. This decision considers the observed patterns in the pairplot, the lower AIC and
BIC values of Model 3, and the overall improvement in model performance metrics such as RMSE,
MAE, MAPE, and Adjusted
R2 .We also note that 'views_trailer' has a p-value of 0.483.
we exclude this column to develop Model 4.
20
Therefore,
5.4
Final model(model 4) performance evaluation
(a) Model summary
(b) Model Performance on train data
(c) Model Performance on test data
(d) Equation of the model
Figure 19: Model 4
21
Figure 20: Test for
Model
normality and homoscedacity for nal model 4
AIC
BIC
Model 2
-2016.42
-1962.70
Model 3
-2021.22 (lower AIC than Model 2)
-1963.02 (lower BIC than Model 2)
Model 4
-2022.72 (lowest AIC)
-1968.99 (lowest BIC)
Table 2: Comparison of Models based on AIC and BIC
Conclusion
Model 4 is the preferred model as it achieves the lowest AIC and BIC values, indicating the best
balance between model t and complexity among the compared models.
This is our nal model with every condition satised along with the model assumptions. This decision
considers the observed patterns in the pairplot, the lower AIC and BIC values of Model 4, and
the overall improvement in model performance metrics such as RMSE, MAE, MAPE, and Adjusted
R2 .This
model saties both Shapiro and Goldfeld-Quandt test as well.
We can see that RMSE on the train and test sets are comparable. So, our model is not suering
from overtting.
MAE indicates that our current model is able to predict view counts within a mean error of 0.04
units on the test data.
Hence, we can conclude the model "Model4" is good for prediction as well as inference purposes.
6 Actionable Insights & Recommendations
(a) With our linear regression model, we have been able to capture ~77.4% of the variation in our
data.
(b) The model indicates that the most signicant predictors of the number of rst-day views, in
millions, of the content are the following:
visitors
22
existance of major_sports_event (on the day of content release).
dayofweek (Monday,Wednesday,Thursday ,Saturday and Sunday)
Season (Spring,Summer,Winter)
views_trailer
The p-values for these predictors are less than 0.05 in our nal model.)
(
(a) 3. an increase of 1 visitor occurs with a 0.1532 increase in view count or in other words 1 million
view counts can be increased by
1
.1532
≈ 6.53 ∼ 6.5
million visitors visiting the platform in the
past week
It is important to note here that the predicted values are square(views_trailer) and therefore
coecients have to be converted accordingly to understand their inuence on view counts.
It is important to note here that correlation is not equal to causation.
4. If there is any one major sports event on the day of content release then there is 0.063 million
decrease in view counts
5.The categorical variables are a little hard to interpret. It can be seen that all the dayofweek_category
variables in the dataset have a positive relationship with the view counts, and the magnitude of this
positive relationship is high for Wednesday,Sunday and Saturday as already visualized through a
boxplot(12).
6.It can be seen that all the season_category variables in the dataset have a positive relationship
with the view counts, and the magnitude of this positive relationship is high for Summer,Winter and
Spring respectively following the same pattern as already visualized through the boxplot(
??).
4. As the number of views, in millions, of the content trailer increases, the number of rst-day views,
in millions, of the content also increases.
Our nal Linear Regression model has a MAPE of 23% on the test data, which means that we are
able to predict within 23% of the content views. This is a very good model and we can use this model
for the benet of the company.
23