Collins Macharia Muturi | Freelancer Time Series Arima Model

Time Series ARIMA Model

Forecasting Time Series Data Using the ARIMA Model Project Synopsis Introduction In this project, I aim to analyze and forecast time series data. Time series data is a sequence of data points collected or recorded at specific time intervals, such as monthly or daily. My goal is to understand the patterns in this data and make future predictions based on these patterns. I start by generating synthetic time series data that includes both a trend and some random noise. This allows us to test and evaluate different forecasting methods. We then use statistical tests to check if the data is stationary, which means it has consistent statistical properties over time. Next, I apply differencing techniques to make the data stationary if needed. This involves subtracting the previous data point from the current one to remove trends and seasonality. Finally, I compare the original data with our forecasts to see how well our predictions match the actual values. I visualize both the original and predicted data to assess the accuracy of our forecasts and understand how well our model performs. Generate Random Time Series Data This code generates synthetic time series data with a linear trend and random noise, useful for testing and analysis. The function generate_random_data creates a dataset with a specified number of monthly data points, starting from a given date. Each data point is computed by adding a linearly increasing trend to a base value of 3.5, along with random noise to introduce variability. The result is stored in a Pandas DataFrame with columns for dates and values. The generated dataset is printed and saved to a CSV file named random_data.csv. import pandas as pd import numpy as np import random # Function to generate random time series data def generate_random_data(start_date, periods, trend_slope=0.1, noise_level=1.0): """ Generate random data similar to the provided dataset. Parameters: start_date (str): Start date of the time series (YYYY-MM-DD). periods (int): Number of monthly data points to generate. trend_slope (float): Controls the overall increase in the values over time. noise_level (float): Controls the variability in the values. Returns: pd.DataFrame: DataFrame with generated dates and values. """ # Generate a date range dates = pd.date_range(start=start_date, periods=periods, freq='MS') # 'MS' gives start of the month # Initialize the starting value base_value = 3.5 # Starting value close to the provided dataset values = [] for i in range(periods): # Add a trend and noise to the base value trend = base_value + (i * trend_slope) # Linearly increasing trend noise = random.uniform(-noise_level, noise_level) # Random noise to add variability value = trend + noise values.append(value) # Create a DataFrame data = pd.DataFrame({'date': dates, 'value': values}) return data # Generate random data random_data = generate_random_data(start_date='-', periods=200, trend_slope=0.15, noise_level=2.0) # Display the first few rows print(random_data.head()) # Save to a CSV file if needed random_data.to_csv('random_data.csv', index=False) 0 1 2 3 4 date- value- random_data date value - - - - - - - - - - ... ... ... - 200 rows × 2 columns This code performs the Augmented Dickey-Fuller (ADF) test, a statistical test used to check whether a time series is stationary (i.e., it has a constant mean and variance over time): The code imports the adfuller function from statsmodels.tsa.stattools to conduct an Augmented Dickey-Fuller test on the generated time series data (random_data). It drops any missing values from the value column and runs the test to determine if the time series is stationary. The ADF test results include the ADF statistic and the p-value, which are printed to help assess whether the null hypothesis (that the series has a unit root, indicating non-stationarity) can be rejected. from statsmodels.tsa.stattools import adfuller from numpy import log result = adfuller(random_data.value.dropna()) print('ADF Statistic: %f' % result[0]) print('p-value: %f' % result[1]) ADF Statistic:- p-value:- The results of the Augmented Dickey-Fuller (ADF) test indicate that the ADF statistic is-, and the p-value is-. Since the p-value is significantly higher than common significance levels (e.g., 0.05 or 0.01), we fail to reject the null hypothesis. This suggests that the time series data is non-stationary, meaning it likely has a changing mean or variance over time. Therefore, transformations or differencing might be needed to make the data stationary for time series modeling. Differencing For Stationarity This code performs a first-order differencing on the generated time series data to transform it into a stationary series and then conducts an Augmented Dickey-Fuller (ADF) test on the differenced data. Here's a summary: The code begins by calculating the first difference of the time series (diff_1) to remove any trends, making the data more stationary. After dropping the initial NaN value resulting from differencing, it performs an ADF test on the differenced data to check for stationarity. The ADF statistic and p-value are printed to evaluate the results. Additionally, the code plots two graphs: one showing the original time series and another displaying the first-differenced series, allowing a visual comparison of the transformations. from statsmodels.tsa.stattools import adfuller import matplotlib.pyplot as plt # First differencing random_data['diff_1'] = random_data['value'].diff() # Drop the first row (NaN after differencing) data_diff = random_data.dropna() # Perform ADF test on differenced data result = adfuller(data_diff['diff_1']) adf_statistic = result[0] p_value = result[1] print(f'ADF Statistic: {adf_statistic}') print(f'p-value: {p_value}') # Plot original and differenced series plt.figure(figsize=(10,6)) plt.subplot(2,1,1) plt.plot(random_data['date'], random_data['value'], label='Original Data') plt.legend() plt.subplot(2,1,2) plt.plot(data_diff['date'], data_diff['diff_1'], label='First Difference', color='orange') plt.legend() plt.tight_layout() plt.show() ADF Statistic: - p-value:-e-11 This code performs both first-order and second-order differencing on the time series data to progressively eliminate any trends and make the data stationary. Here's a summary: The code first calculates the first difference of the original time series (diff_1) and removes any resulting NaN values. Next, it calculates the second difference (diff_2) on the first-differenced data using .loc to avoid a Pandas warning and again drops any NaN values. An Augmented Dickey-Fuller (ADF) test is then conducted on the second-order differenced data to assess its stationarity, and the ADF statistic and p-value are printed for evaluation. Finally, the code visualizes the original series, the first differencing, and the second differencing in three separate subplots to help visually compare the transformations and their effects on the data's stationarity. import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.stattools import adfuller # Assuming 'random_data' is already defined and contains 'date' and 'value' # First differencing random_data['diff_1'] = random_data['value'].diff() # Drop the first row (NaN after first differencing) data_diff_1 = random_data.dropna() # Second differencing using .loc to avoid the warning data_diff_1.loc[:, 'diff_2'] = data_diff_1['diff_1'].diff() # Drop NaN values after second differencing data_diff_2 = data_diff_1.dropna() # Perform ADF test on second-order differenced data result_2 = adfuller(data_diff_2['diff_2']) adf_statistic_2 = result_2[0] p_value_2 = result_2[1] print(f'ADF Statistic (2nd order): {adf_statistic_2}') print(f'p-value (2nd order): {p_value_2}') # Plot original, first differencing, and second differencing series plt.figure(figsize=(10,8)) # Plot Original Data plt.subplot(3, 1, 1) plt.plot(random_data['date'], random_data['value'], label='Original Data') plt.legend() # Plot First Differencing plt.subplot(3, 1, 2) plt.plot(data_diff_1['date'], data_diff_1['diff_1'], label='First Difference', color='orange') plt.legend() # Plot Second Differencing plt.subplot(3, 1, 3) plt.plot(data_diff_2['date'], data_diff_2['diff_2'], label='Second Difference', color='green') plt.legend() plt.tight_layout() plt.show() C:\Users\Collins PC\AppData\Local\Temp\ipykernel_9288\-.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retu rning-a-view-versus-a-copy data_diff_1.loc[:, 'diff_2'] = data_diff_1['diff_1'].diff() ADF Statistic (2nd order): - p-value (2nd order):-e-13 The results of the Augmented Dickey-Fuller (ADF) test on the second-order differenced data show an ADF statistic of -8.5017 and a p- value of approximately 1.23e-13. Since the p-value is extremely low (much smaller than common significance levels such as 0.05 or 0.01), we can reject the null hypothesis that the series has a unit root. This indicates that the second-order differenced data is stationary, meaning the data now has a constant mean and variance over time, making it suitable for time series modeling. import pandas as pd import matplotlib.pyplot as plt from statsmodels.graphics.tsaplots import plot_acf from statsmodels.tsa.stattools import adfuller # Assuming 'random_data' is already defined and contains 'date' and 'value' # First differencing random_data['diff_1'] = random_data['value'].diff() # Drop the first row (NaN after first differencing) data_diff_1 = random_data.dropna() # Second differencing using .loc to avoid the warning data_diff_1.loc[:, 'diff_2'] = data_diff_1['diff_1'].diff() # Drop NaN values after second differencing data_diff_2 = data_diff_1.dropna() # Perform ADF test on second-order differenced data result_2 = adfuller(data_diff_2['diff_2']) adf_statistic_2 = result_2[0] p_value_2 = result_2[1] print(f'ADF Statistic (2nd order): {adf_statistic_2}') print(f'p-value (2nd order): {p_value_2}') # Plot original, first differencing, and second differencing series plt.figure(figsize=(10,10)) # Plot Original Data plt.subplot(3, 1, 1) plt.plot(random_data['date'], random_data['value'], label='Original Data') plt.legend() # Plot First Differencing plt.subplot(3, 1, 2) plt.plot(data_diff_1['date'], data_diff_1['diff_1'], label='First Difference', color='orange') plt.legend() # Plot Second Differencing plt.subplot(3, 1, 3) plt.plot(data_diff_2['date'], data_diff_2['diff_2'], label='Second Difference', color='green') plt.legend() plt.tight_layout() plt.show() # Plot ACF for the first differencing fig, axes = plt.subplots(2, 1, figsize=(10,8)) axes[0].set_title("ACF - First Differencing") plot_acf(data_diff_1['diff_1'].dropna(), ax=axes[0]) # Plot ACF for the second differencing axes[1].set_title("ACF - Second Differencing") plot_acf(data_diff_2['diff_2'].dropna(), ax=axes[1]) plt.tight_layout() plt.show() C:\Users\Collins PC\AppData\Local\Temp\ipykernel_9288\-.py:15: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retu rning-a-view-versus-a-copy data_diff_1.loc[:, 'diff_2'] = data_diff_1['diff_1'].diff() ADF Statistic (2nd order): - p-value (2nd order):-e-13 This code performs time series differencing and plots the Partial Autocorrelation Function (PACF) to identify the order of autoregression (AR) terms. First, it computes the first difference of the value column in a dataset, then computes the second difference of the already differenced values. After handling missing values that result from differencing, the code visualizes the PACF for both the first and second differenced series. This analysis helps determine the number of AR terms needed in a time series model by examining the PACF plots at different differencing stages. # Importing the required libraries from statsmodels.graphics.tsaplots import plot_acf, plot_pacf import matplotlib.pyplot as plt # Assuming 'random_data' is already defined and contains 'date' and 'value' # First differencing random_data['diff_1'] = random_data['value'].diff() # Drop the first row (NaN after first differencing) data_diff_1 = random_data.dropna() # Second differencing using .loc to avoid the warning data_diff_1.loc[:, 'diff_2'] = data_diff_1['diff_1'].diff() # Drop NaN values after second differencing data_diff_2 = data_diff_1.dropna() # Plot PACF for the first and second differencing fig, axes = plt.subplots(2, 1, figsize=(10,8)) # First Differencing PACF axes[0].set_title("PACF - First Differencing") plot_pacf(data_diff_1['diff_1'].dropna(), ax=axes[0]) # Second Differencing PACF axes[1].set_title("PACF - Second Differencing") plot_pacf(data_diff_2['diff_2'].dropna(), ax=axes[1]) plt.tight_layout() plt.show() C:\Users\Collins PC\AppData\Local\Temp\ipykernel_9288\-.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retu rning-a-view-versus-a-copy data_diff_1.loc[:, 'diff_2'] = data_diff_1['diff_1'].diff() The Partial Autocorrelation Function (PACF) is used to identify the number of autoregressive (AR) terms required in a time series model. Specifically, the PACF plot shows the direct correlation of a time series with its lagged values after removing the influence of any intermediary lags. By analyzing the PACF for both the first and second differenced series, the code helps in determining how many lags have significant partial autocorrelations, which indicates how many AR terms should be included when building a time series model like ARIMA. This is useful for understanding the structure and dependencies within the differenced series. # PACF plot of 1st differenced series plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120}) fig, axes = plt.subplots(1, 2, sharex=True) axes[0].plot(random_data.value.diff()); axes[0].set_title('1st Differencing') axes[1].set(ylim=(0,5)) plot_pacf(random_data.value.diff().dropna(), ax=axes[1]) plt.show() First Differencing Plot (Left): This plot displays the values of the random_data['value'] series after applying first differencing. The process of first differencing removes trends or seasonality in the data by subtracting each data point from its previous value. The resulting plot shows fluctuations around zero, suggesting that the data has been transformed into a stationary time series. This is a common preprocessing step in time series modeling to meet the stationarity assumption. Partial Autocorrelation Function (PACF) Plot (Right): The PACF plot illustrates the partial autocorrelation of the first-differenced series at various lags. In this case, there is significant partial autocorrelation at lag 1, which suggests that there may be an autoregressive (AR) component in the data. The rest of the lags seem to have very low values, close to zero, implying that no further AR terms are strongly needed beyond the first lag. Overall, this analysis indicates that after first differencing, the time series may be stationary and might have an AR(1) structure due to the significant lag 1 in the PACF. This is useful for selecting the appropriate autoregressive terms when building time series models such as ARIMA. fig, axes = plt.subplots(1, 2, sharex=True) axes[0].plot(random_data.value.diff()); axes[0].set_title('1st Differencing') axes[1].set(ylim=(0,1.2)) plot_acf(random_data.value.diff().dropna(), ax=axes[1]) plt.show() # Import the new ARIMA class from statsmodels.tsa.arima.model import ARIMA # 1,1,2 ARIMA Model model = ARIMA(random_data['value'], order=(1,1,1)) model_fit = model.fit() # Print the model summary print(model_fit.summary()) SARIMAX Results ============================================================================== Dep. Variable: value No. Observations: 200 Model: ARIMA(1, 1, 1) Log Likelihood -347.667 Date: Wed, 28 Aug 2024 AIC 701.335 Time: 05:45:53 BIC 711.215 Sample: 0 HQIC 705.333 - 200 Covariance Type: opg ============================================================================== coef std err z P>|z| [-] -----------------------------------------------------------------------------ar.L1 - - - ma.L1 - - -0.763 -0.397 sigma- =================================================================================== Ljung-Box (L1) (Q): 2.30 Jarque-Bera (JB): 7.37 Prob(Q): 0.13 Prob(JB): 0.03 Heteroskedasticity (H): 0.93 Skew: 0.03 Prob(H) (two-sided): 0.78 Kurtosis: 2.06 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). # Plot residual errors residuals = pd.DataFrame(model_fit.resid) fig, ax = plt.subplots(1,2) residuals.plot(title="Residuals", ax=ax[0]) residuals.plot(kind='kde', title='Density', ax=ax[1]) plt.show() Residuals Plot (Left): This plot visualizes the residuals, which are the differences between the actual values and the fitted values from a model. The residuals appear to fluctuate randomly around zero, which suggests that the model has captured the underlying structure of the data well. There are no obvious patterns, indicating that the residuals are white noise and the model assumptions hold. Density Plot (Right): This plot shows the kernel density estimate (KDE) of the residuals, which provides a smoothed probability distribution of the residual values. The distribution is approximately centered around zero and appears fairly symmetric, suggesting that the residuals follow a somewhat normal distribution. In summary, while the residuals are centered around zero, the two peaks suggest that they do not follow a perfect normal distribution, indicating potential room for model improvement or further investigation into the data. The residual errors seem fine with near zero mean and uniform variance. Let’s plot the actuals against the fitted values using plot_predict(). predict=model_fit.predict() plt.plot(predict) [] predict 0 1 2 3 4 - ..- Name: predicted_mean, Length: 200, dtype: float64 # Convert 'predict' to a DataFrame predict_df = pd.DataFrame(predict) # If 'random_data' has the dates in its index, use that index for the 'date' column predict_df['date'] = random_data['date'] predict_df predicted_mean date 0 - 1 - 2 - 3 - 4 - ... ... ... 195 - 196 - 197 - 198 - 199 - 200 rows × 2 columns The code below converts the predict series into a DataFrame and sets the date column as the index of the DataFrame. # Set 'date' column as the index predict_df.set_index('date', inplace=True) predict_df predicted_mean date- - - - - - - - - - ... ... - - - - - - - - - - 200 rows × 1 columns random_data date value diff_1 - - NaN - - - - - - - - - - - - ... ... ... ... - - - - - - - - 200 rows × 3 columns # Specify the filename and sheet name filename = 'predictions_df.xlsx' predict_df.to_excel(filename, sheet_name='Predictions') predictions = pd.read_excel('C:/Users/Collins PC/Downloads/predictions_df.xlsx') predictions date predicted_mean- - - - - - - - - - ... ... ... - - - - - - - - - - 200 rows × 2 columns Ploting the Predictions Vs Actual This code creates a line plot comparing the original time series data with its predictions. It plots the original data in blue and the predicted values in red with a dashed line for easy differentiation. The plot includes a title, axis labels, and a legend to clearly identify each series. By visualizing the original data alongside the predictions, this plot helps assess how well the predicted values align with the actual data over time. import matplotlib.pyplot as plt # Assuming you have the original data stored in a variable called `original_data` # and the predictions stored in `predict` plt.figure(figsize=(10, 6)) # Plot original data plt.plot(random_data['date'], random_data['value'], label='Original Data',color='blue') # Plot predicted data plt.plot(predictions['date'],predictions['predicted_mean'], label='Predicted Data', color='red', linestyle='--') # Add title and labels plt.title('Original Data vs Predicted Data') plt.xlabel('Index') plt.ylabel('Value') # Add legend plt.legend() # Show plot plt.show() The plot displayed shows a comparison between the original data and the predicted data from a time series forecasting model. The blue line represents the original data, while the red dashed line indicates the predicted values. The plot suggests that the model has captured the overall trend of the data well, as the predicted line closely follows the upward trajectory of the original data. There are some fluctuations and variations in the original data that are also reflected in the predictions, indicating that the model is effectively accounting for both the trend and the seasonality in the data. The close alignment between the original and predicted values suggests that the model is performing well in forecasting future data points. from statsmodels.tsa.stattools import acf # Create Training and Test train = random_data.value[:85] test = random_data.value[85:] import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.arima.model import ARIMA # Assuming train and test are defined DataFrames or Series # Build Model model = ARIMA(train, order=(3, 2, 1)) fitted = model.fit() print(fitted.summary()) # Forecast forecast_object = fitted.get_forecast(steps=115) # Get forecast object for next 15 steps # Extract forecast mean and confidence intervals fc_series = forecast_object.predicted_mean # Forecasted values conf_int = forecast_object.conf_int(alpha=0.05) # 95% confidence intervals lower_series = conf_int.iloc[:, 0] # Lower bound of the confidence interval upper_series = conf_int.iloc[:, 1] # Upper bound of the confidence interval # Plot plt.figure(figsize=(12, 5), dpi=100) plt.plot(train, label='training') plt.plot(test, label='actual') plt.plot(fc_series, label='forecast') plt.fill_between(fc_series.index, lower_series, upper_series, color='k', alpha=.15) plt.title('Forecast vs Actuals') plt.legend(loc='upper left', fontsize=8) plt.show() SARIMAX Results ============================================================================== Dep. Variable: value No. Observations: 85 Model: ARIMA(3, 2, 1) Log Likelihood -148.285 Date: Wed, 28 Aug 2024 AIC 306.570 Time: 05:45:56 BIC 318.664 Sample: 0 HQIC 311.428 - 85 Covariance Type: opg ============================================================================== coef std err z P>|z| [-] -----------------------------------------------------------------------------ar.L1 - - -0.948 -0.429 ar.L2 - - -0.718 -0.124 ar.L3 - - - ma.L1 - - - sigma- - =================================================================================== Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 2.22 Prob(Q): 0.94 Prob(JB): 0.33 Heteroskedasticity (H): 1.15 Skew: -0.08 Prob(H) (two-sided): 0.71 Kurtosis: 2.21 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). Arima Model Stats report The summary provided is for a SARIMAX model (specifically an ARIMA model with three autoregressive terms, two differencing steps, and one moving average term: ARIMA(3, 2, 1)) applied to a time series with 85 observations. Key points include: Model Fit: The Log Likelihood is -148.285, and the model's Akaike Information Criterion (AIC) is 306.570, with a Bayesian Information Criterion (BIC) of 318.664. These criteria help compare the fit of different models, with lower values generally indicating a better fit. Coefficients: The first autoregressive (AR) term (ar.L1) has a coefficient of -0.6885, which is statistically significant (p-value < 0.001). The second AR term (ar.L2) has a coefficient of -0.4211, also statistically significant (p-value = 0.005). The third AR term (ar.L3) has a coefficient of -0.1579, which is not statistically significant (p-value = 0.189). The moving average (MA) term (ma.L1) has a coefficient of -0.9996, which is not statistically significant (p-value = 0.913). The variance of the error term (sigma2) is estimated to be 1.9266, but it is also not statistically significant (p-value = 0.913). Diagnostics: The Ljung-Box test statistic for the first lag (Q) is 0.01 with a p-value of 0.94, indicating that there is no significant autocorrelation in the residuals at lag 1. The Jarque-Bera test for normality of the residuals has a test statistic of 2.22 with a p-value of 0.33, suggesting that the residuals are normally distributed. The Heteroskedasticity (H) test indicates no significant evidence of heteroskedasticity (p-value = 0.71). Overall, the model seems to fit the data reasonably well, with significant AR terms, no significant autocorrelation in residuals, and normally distributed residuals. However, the moving average term and the variance of the error term are not statistically significant. Plot Summary The plot displayed is a "Forecast vs Actuals" graph, which compares the forecasted data with actual data over time. The graph includes three lines: "training" ( representing historical data used to train the model), "actual" (the real observed values), and "forecast" (the model's predictions). The forecasted values closely follow the actual data, indicating that the model is performing well. The plot also features a shaded area, which could represent confidence intervals or prediction uncertainty. import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.arima.model import ARIMA # Assuming train and test are defined DataFrames or Series # Build Model model = ARIMA(train, order=(1, 2, 1)) fitted = model.fit() print(fitted.summary()) # Forecast forecast = fitted.get_forecast(steps=115) forecast_df = forecast.summary_frame(alpha=0.05) # 95% confidence interval # Extract forecast mean and confidence intervals fc_series = forecast_df['mean'] # Forecasted values lower_series = forecast_df['mean_ci_lower'] # Lower bound of confidence interval upper_series = forecast_df['mean_ci_upper'] # Upper bound of confidence interval # Plot plt.figure(figsize=(12, 5), dpi=100) plt.plot(train, label='training') plt.plot(test, label='actual') plt.plot(fc_series, label='forecast') plt.fill_between(fc_series.index, lower_series, upper_series, color='k', alpha=.15) plt.title('Forecast vs Actuals') plt.legend(loc='upper left', fontsize=8) plt.show() SARIMAX Results ============================================================================== Dep. Variable: value No. Observations: 85 Model: ARIMA(1, 2, 1) Log Likelihood -153.612 Date: Wed, 28 Aug 2024 AIC 313.225 Time: 07:54:04 BIC 320.481 Sample: 0 HQIC 316.140 - 85 Covariance Type: opg ============================================================================== coef std err z P>|z| [-] -----------------------------------------------------------------------------ar.L1 - - -0.704 -0.264 ma.L1 - - - sigma- - =================================================================================== Ljung-Box (L1) (Q): 1.19 Jarque-Bera (JB): 1.97 Prob(Q): 0.28 Prob(JB): 0.37 Heteroskedasticity (H): 1.10 Skew: -0.04 Prob(H) (two-sided): 0.80 Kurtosis: 2.25 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). Arima Model Stats Report This output presents the results of a SARIMAX model, specifically an ARIMA(1, 2, 1) model, applied to a time series with 85 observations. Key points from the summary include: Model Summary: The model's Log Likelihood is -153.612, which indicates how well the model fits the data. The Akaike Information Criterion (AIC) is 313.225, and the Bayesian Information Criterion (BIC) is 320.481. These criteria are used to compare models, with lower values generally indicating a better fit. The Hannan-Quinn Information Criterion (HQIC) is 316.140. Coefficients: The autoregressive term (ar.L1) has a coefficient of -0.4839, which is statistically significant with a p-value < 0.001, indicating that it contributes meaningfully to the model. The moving average term (ma.L1) has a coefficient of -0.9995, but it is not statistically significant (p-value = 0.909), suggesting it may not contribute significantly to the model's performance. The variance of the error term (sigma2) is estimated at 2.2213, which is also not statistically significant (p-value = 0.909). Diagnostics: The Ljung-Box test statistic for the first lag (Q) is 1.19 with a p-value of 0.28, indicating no significant autocorrelation in the residuals at lag 1. The Jarque-Bera test for normality of the residuals has a test statistic of 1.97 with a p-value of 0.37, suggesting that the residuals are approximately normally distributed. The Heteroskedasticity (H) test indicates no significant evidence of heteroskedasticity, with a pvalue of 0.80. Overall, the model's AR(1) term is statistically significant, while the MA(1) term and the variance of the error term are not, implying that the AR(1) component is the main driver of the model's performance. The diagnostics suggest that the model residuals do not exhibit significant autocorrelation, non-normality, or heteroskedasticity, indicating a reasonably well-fitting model. However, the relatively high AIC and BIC values suggest that this model may not be the most efficient one available. Testing The Model Accuracy The provided Python code defines a function forecast_accuracy that calculates several accuracy metrics to evaluate the performance of a forecasting model. The function takes in two inputs: forecast (the predicted values) and actual (the observed values), both of which are converted to NumPy arrays for processing. It computes various metrics such as Mean Absolute Percentage Error (MAPE), Mean Error (ME), Mean Absolute Error (MAE), Mean Percentage Error (MPE), Root Mean Square Error (RMSE), and the correlation coefficient between the forecasted and actual values. Additionally, the function calculates the MinMax Error, which quantifies the error relative to the range of values, and the first lag of the autocorrelation function (ACF1) of the residuals. These metrics provide a comprehensive assessment of the model's forecasting accuracy. Finally, the function is applied to evaluate the forecast (fc_series.values) against the actual test data (test.values). import numpy as np from statsmodels.tsa.stattools import acf def forecast_accuracy(forecast, actual): forecast = np.array(forecast) # Convert to NumPy array actual = np.array(actual) # Convert to NumPy array mape = np.mean(np.abs(forecast - actual)/np.abs(actual)) # MAPE me = np.mean(forecast - actual) # ME mae = np.mean(np.abs(forecast - actual)) # MAE mpe = np.mean((forecast - actual)/actual) rmse = np.mean((forecast - actual)**2)**.5 corr = np.corrcoef(forecast, actual)[0,1] # MPE # RMSE # Correlation # Calculate minmax after converting to NumPy array mins = np.amin(np.hstack([forecast[:, None], actual[:, None]]), axis=1) maxs = np.amax(np.hstack([forecast[:, None], actual[:, None]]), axis=1) minmax = 1 - np.mean(mins / maxs) # MinMax Error acf1 = acf(forecast - actual)[1] return { 'mape': mape, 'me': me, 'mae': mae, 'mpe': mpe, 'rmse': rmse, 'acf1': acf1, 'corr': corr, 'minmax': minmax } # ACF1 forecast_accuracy(fc_series.values, test.values) {'mape':-, 'me': -, 'mae':-, 'mpe': -, 'rmse':-, 'acf1': -, 'corr':-, 'minmax':-} Forecast Report The forecast accuracy report presents various statistical measures that assess the performance of the forecasted data compared to the actual values. Here is a summary: Mean Absolute Percentage Error (MAPE): 4.43% – This indicates that, on average, the forecasted values are off by 4.43% from the actual values, which represents a relatively low forecasting error. Mean Error (ME): -0.64 – The negative value suggests that, on average, the forecasted values are slightly lower than the actual values. Mean Absolute Error (MAE): 1.08 – This indicates that, on average, the absolute difference between the forecast and the actual values is 1.08 units. Mean Percentage Error (MPE): -2.46% – The negative value shows a slight tendency towards under-forecasting, with forecasts being, on average, 2.46% lower than the actual values. Root Mean Squared Error (RMSE): 1.31 – This represents the standard deviation of the forecast errors, indicating that the average difference between the forecasted and actual values is 1.31 units. Autocorrelation of Errors at Lag 1 (ACF1): -0.20 – This indicates a slight negative autocorrelation in the forecast errors at lag 1, suggesting that the errors are slightly negatively correlated with the previous period's errors. Correlation (Corr): 0.97 – A high correlation between the forecasted and actual values (close to 1) suggests that the forecasts closely follow the actual data trend. MinMax Error: 4.39% – This indicates that the average ratio of the minimum to the maximum value between the forecasted and actual data points is 95.61%, implying a low overall error in the forecast. Overall Conclusion: The forecast shows good accuracy, with a low error rate across various metrics. The high correlation (0.97) suggests that the forecast aligns well with the actual values, while the low MAPE and MinMax Error reinforce the reliability of the forecast. The negative ME and MPE indicate a slight underestimation, but the overall performance is robust. Loading [MathJax]/extensions/Safe.js