Mohamed Ihab Khalifa | Freelancer Machine Learning For Predictive Pricing

Machine Learning for Predictive Pricing

Machine Learning for Predictive Pricing (Predicting House Prices) This project utilizes machine learning algorithms to predict house prices based on a pool of characteristics, ranging from the house square footage, to number of bathrooms, to how the house was graded by the relevant authorities. It was originally completed as part of the final project for my course, 'Data Analysis with Python', offered online by IBM, howeve I expanded upon it to showcase more skills and techniques learnt through the course and make it wider in scope. The dataset being used here was taken from Kaggle.com, a popular website for finding and publishing dataset. You can quickly access it by clicking here. It presents house sales in Seattle-King County made between May 2014 to May 2015, and consisting of different house characteristics and the corresponding sale price for each house. You can view each coloumn and its description in the table below: Variable Description id Unique ID for each house sold date Date of the house sale price Price of each house sold bedrooms Number of bedrooms bathrooms Number of bathrooms sqft_living Square footage of the house interior living space sqft_lot Square footage of the lot (land space) floors Number of house floors waterfront Whether a house is overlooking a waterfront (1) or not (0) view Rating of how good the house view is condition Rating of the overall house condition grade Overall grade given to the housing unit, based on King County grading system sqft_above Square footage of the interior housing space that is above ground level sqft_basement Square footage of the interior housing space that is below ground level yr_built Year the house was built yr_renovated Year when house was last renovated zipcode Zip code lat Latitude coordinate long Longitude coordinate sqft_living15 Square footage of the interior housing living space for the closest 15 houses sqft_lot15 Square footage of the lot (land space) for the closest 15 houses To build a machine learning model that can predict house prices, the house attributes that are most associated with price are identified, prepared and preprocessed, and finally used to train the model. More specifically, different models are developed with the data, trained, evaluated, and improved, before selecting the model that best accounts for the data, and therefore proves to be the best at producing valid and reliable price predictions. Each model is tested and verified through in-sample evaluation metrics to evaluate the model's performance in reference to the data fed to it, out-ofsample evaluations to estimate how the model is likely to perform in the real world, with novel datasets, and through visualizations to compare the distributions of the predicted prices to the actual prices in the dataset. Finally, the best model is selected and used to generate predictions. Overall, the project is broken down into five parts: 1) Loading, Inspecting, and Cleaning the Data 2) Data Preparation and Preprocessing 3) Model Development and Evaluation 4) Hyperparameter Tuning 5) Model Prediction The aim of this project is to demonstrate my abilities and coding skills to build, evaluate, and deploy machine learning models for tasks such as predictive pricing. In [ ]: #If you're using the executable notebook version, please run this cell first #to install the necessary Python libraries for the task !pip install numpy !pip install pandas !pip install matplotlib !pip install seaborn !pip install scikit-learn !pip install scikit-learn-intelex In [2]: #Enabling Intel's support for scikit-learn to speed up machine learning algorithms from sklearnex import patch_sklearn patch_sklearn() import warnings warnings.simplefilter("ignore") Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-inte lex) In [3]: #Importing the modules for use import numpy as np import pandas as pd import seaborn as sns import matplotlib as mpl import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression, Lasso from from from from sklearn.preprocessing import StandardScaler, PolynomialFeatures sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV sklearn.metrics import mean_squared_error sklearn.pipeline import Pipeline #Adjusting data display options pd.set_option('display.max_columns', 100) pd.set_option('display.float_format', lambda x: '%.3f' % x) Part One: Loading, Inspecting, and Cleaning the Data 1. Loading and reading the dataset In [4]: #Accessing the file df = pd.read_excel("House Sales in King County.xlsx") #Previewing the first 10 enteries off the dataset df.head(10) id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view 0 - -T000000 221900 3.000 1.000 1180 5650 1.000 0 0 1 - -T000000 538000 3.000 2.250 2570 7242 2.000 0 0 2 - -T000000 180000 2.000 1.000 770 10000 1.000 0 0 3 - -T000000 604000 4.000 3.000 1960 5000 1.000 0 0 4 - -T000000 510000 3.000 2.000 1680 8080 1.000 0 0 5 - -T000000 - 4.000 4.500 5420 101930 1.000 0 0 6 - -T000000 257500 3.000 2.250 1715 6819 2.000 0 0 7 - -T000000 291850 3.000 1.500 1060 9711 1.000 0 0 8 - -T000000 229500 3.000 1.000 1780 7470 1.000 0 0 9 - -T000000 323000 3.000 2.500 1890 6560 2.000 0 0 Out[4]: 2. Inspecting the Data In [5]: shape = df.shape print('Number of coloumns:', shape[1]) print('Number of rows:', shape[0]) Number of coloumns: 21 Number of rows: 21613 In [6]: #Inspecting the coloumn headers, data type, and entries count df.info() RangeIndex: 21613 entries, 0 to 21612 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------------------- ----0 id 21613 non-null int64 1 date 21613 non-null object 2 price 21613 non-null int64 3 bedrooms 21600 non-null float64 4 bathrooms 21603 non-null float64 5 sqft_living 21613 non-null int64 6 sqft_lot 21613 non-null int64 7 floors 21613 non-null float64 8 waterfront 21613 non-null int64 9 view 21613 non-null int64 10 condition 21613 non-null int64 11 grade 21613 non-null int64 12 sqft_above 21613 non-null int64 13 sqft_basement 21613 non-null int64 14 yr_built 21613 non-null int64 15 yr_renovated 21613 non-null int64 16 zipcode 21613 non-null int64 17 lat 21613 non-null float64 18 long 21613 non-null float64 19 sqft_living- non-null int64 20 sqft_lot- non-null int64 dtypes: float64(5), int64(15), object(1) memory usage: 3.5+ MB In [7]: #Get statistical summary of the dataset df.describe() id price bedrooms bathrooms sqft_living sqft_lot floors waterfront count - - - - - - - - 21613 mean - - 3.373 2.116 - - 1.494 0.008 0 std - - 0.927 0.769 918.441 - 0.540 0.087 0 min - - 1.000 0.500 290.000 520.000 1.000 0.000 0 25% - - 3.000 1.750 - - 1.000 0.000 0 50% - - 3.000 2.250 - - 1.500 0.000 0 75% - - 4.000 2.500 - - 2.000 0.000 0 max - - 33.000 8.000 - - 3.500 1.000 4 Out[7]: In [8]: #Show distribution of values across the dataset df.hist(figsize=(20, 12)) Out[8]: array([[, , , ], [, , , ], [, , , ], [, , , ], [, , , ]], dtype=object) 3. Cleaning Up and Updating the Data We can see some coloumns have differing entries count, which means some data are missing or inappropriate. Thus, I will be checking for any empty or inappropriate (non-numerical) entries in the data and adjusting them. In [9]: #Reporting total sum of empty/NaN (not-a-number) values for each coloumn print('Number of empty/NaN enteries per coloumn:') df.isnull().sum() Number of empty/NaN enteries per coloumn: id 0 Out[9]: date 0 price 0 bedrooms 13 bathrooms 10 sqft_living 0 sqft_lot 0 floors 0 waterfront 0 view 0 condition 0 grade 0 sqft_above 0 sqft_basement 0 yr_built 0 yr_renovated 0 zipcode 0 lat 0 long 0 sqft_living15 0 sqft_lot15 0 dtype: int64 In [10]: #Now replacing null enteries with the mean value for the necessary coloumns #calculate mean value for coloumn 'bedrooms' mean_bedrooms = df['bedrooms'].mean() #replacing null enteries with the mean value df['bedrooms'].replace(np.nan, mean_bedrooms, inplace=True) #calculate mean value for coloumn 'bathrooms' mean_bathrooms = df['bathrooms'].mean() #replacing null enteries with mean value df['bathrooms'].replace(np.nan, mean_bathrooms, inplace=True) #Previewing the coloumns that had null values again print("Number of null values for the column \'bedrooms\':", df['bedrooms'].isnull().sum( print("Number of null values for the column \'bathrooms\':", df['bathrooms'].isnull().sum Number of null values for the column 'bedrooms': 0 Number of null values for the column 'bathrooms': 0 Part Two: Data Preparation and Preprocessing In this section, I will identify the subset of data to be used for building and evaluating the model and making the necessary adjustments to prepare them for use and analysis. 1. Identifying the Variables First, we want to identify the house attributes that best predict a house's price; these will be the features for our model, the features by which we will train the model to generate price predictions. One way of doing so is performing correlational analysis to identify and select the attributes that are most correlated with price. In [11]: #Checking the correlations between all variables in the dataset correlations_table = df.corr() correlations_table id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition id 1.000 -0.017 0.001 0.005 -0.012 -0.132 0.019 -0.003 0.012 -0.024 price -0.017 1.000 0.309 0.526 0.702 0.090 0.257 0.266 0.397 0.036 bedrooms 0.001 0.309 1.000 0.514 0.578 0.032 0.178 -0.007 0.080 0.027 bathrooms 0.005 0.526 0.514 1.000 0.755 0.088 0.502 0.064 0.188 -0.126 sqft_living -0.012 0.702 0.578 0.755 1.000 0.173 0.354 0.104 0.285 -0.059 sqft_lot -0.132 0.090 0.032 0.088 0.173 1.000 -0.005 0.022 0.075 -0.009 floors 0.019 0.257 0.178 0.502 0.354 -0.005 1.000 0.024 0.029 -0.264 waterfront -0.003 0.266 -0.007 0.064 0.104 0.022 0.024 1.000 0.402 0.017 view 0.012 0.397 0.080 0.188 0.285 0.075 0.029 0.402 1.000 0.046 condition -0.024 0.036 0.027 -0.126 -0.059 -0.009 -0.264 0.017 0.046 1.000 grade 0.008 0.667 0.357 0.665 0.763 0.114 0.458 0.083 0.251 -0.145 sqft_above -0.011 0.606 0.479 0.686 0.877 0.184 0.524 0.072 0.168 -0.158 sqft_basement -0.005 0.324 0.303 0.283 0.435 0.015 -0.246 0.081 0.277 0.174 yr_built 0.021 0.054 0.156 0.507 0.318 0.053 0.489 -0.026 -0.053 -0.361 Out[11]: yr_renovated -0.017 0.126 0.018 0.051 0.055 0.008 0.006 0.093 0.104 -0.061 zipcode -0.008 -0.053 -0.154 -0.205 -0.199 -0.130 -0.059 0.030 0.085 0.003 lat -0.002 0.307 -0.010 0.024 0.053 -0.086 0.050 -0.014 0.006 -0.015 long 0.021 0.022 0.131 0.225 0.240 0.230 0.125 -0.042 -0.078 -0.107 sqft_living15 -0.003 0.585 0.393 0.569 0.756 0.145 0.280 0.086 0.280 -0.093 sqft_lot15 -0.139 0.082 0.030 0.088 0.183 0.719 -0.011 0.031 0.073 -0.003 In [12]: #Showing the correlations with house price only (from highest to lowest) correlations_ByPrice = df.corr()['price'].sort_values(ascending=False) correlations_ByPrice Out[12]: price 1.000 sqft_living 0.702 grade 0.667 sqft_above 0.606 sqft_living15 0.585 bathrooms 0.526 view 0.397 sqft_basement 0.324 bedrooms 0.309 lat 0.307 waterfront 0.266 floors 0.257 yr_renovated 0.126 sqft_lot 0.090 sqft_lot15 0.082 yr_built 0.054 condition 0.036 long 0.022 id -0.017 zipcode -0.053 Name: price, dtype: float64 Based on correlational analysis, I'll use the top 10 variables most correlated with Price to develop and train the model; these are: 'sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms', 'view', 'sqft_basement', 'bedrooms', 'lat', and 'waterfront'. These will be the independent or 'predictor' variables whilst house price will be the dependent or 'target' variable. The model will use the predictors to predict the target. Selecting the predictor and target variables In [13]: #Now selecting the variables for training the model #specifying the independent/predictor variables predictors = ['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms', 'view' 'sqft_basement', 'bedrooms', 'lat', 'waterfront'] #assigning the variables to 'x_data' x_data = df[predictors] #specifying the dependent/target variable and assigning it to 'y_data' y_data = df['price'] 2. Data Splitting Now that we have our data, the next step is to split into a training set, for developing and training the model, and a testing set, for testing the model. Data splitting will allow us to accurately evaluate the model and estimate how likely it is to perform in the real world with novel datasets. As such, I will be preserving 75% of the dataset for training and 25% for testing. The testing set will not be used for any training so that it is considered to be novel, previously unseen data which we use to estimate how the model is likely to function in the real world, thus giving us an estimate of the model's generalization error. In [14]: #Performing data splitting to obtain a training set (75%) and testing set (25%) x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, train_size=0.75, ran #We can check the sizes of the training and testing sets print('Number of training samples:', x_train.shape[0]) print('Number of testing samples:', x_test.shape[0]) Number of training samples: 16209 Number of testing samples: 5404 3. Feature Scaling: Standardizing the Scales Lastly, given that the variables being used have varying scales with values falling into vastly different ranges, some of which showing skewed distribution (see histogram graphs above), we need to ensure that this diversity of scales doesn't influence the analysis and/or the model predictions. One of the best ways to control for this issue is to standardize the scales such that the data would have the properties of a standard normal distribution. This means that the values from across the different variables will all fall into the same value range with a mean of 0 and a standard deviation of 1. In [15]: #Standardizing the scales #Get a scaler object Scaler = StandardScaler() #fitting and scaling the training set x_train = Scaler.fit_transform(x_train) #scaling the testing set x_test = Scaler.transform(x_test) Now the data is ready for processing and model building... Part Three: Model Development and Evaluation In trying to develop and select the best model for the data, I will compare here two models: a linear regression model and a non-linear or polynomial regression one. Whilst the former looks at the linear relationship between the independent variables and the dependent variable, the latter examines a non-linear or curvilinear relationship between the predictors and target. Thus, I will develop two models to assess which type of relationship is more appropriate for the data, the first is multiple linear regression model whilst the second is a multivariate polynomial regression one, and accordingly select the model that is best fitted to the data. I will first train each model using the training set we obtained earlier and then assess it separately using the testing set, before using the model to generate predictions. To evaluate the performance of the models, I will the R-squared metric, also known as coefficient of determination, which will tell us how much of the variance in the dependent/target variable (price) is accounted for and explained by the model (i.e. the independent/predictor variables). MODEL ONE: MULTIPLE LINEAR REGRESSION MODEL A multiple linear regression model is a type of model that depicts the linear relationship between multiple independent or predictor variables and the dependent or target variable. It tries to capture and explain all the variance in the data via a linear function. In [16]: #Creating a regression object multireg_model = LinearRegression() #Training the model with training data (i.e. fitting the model) multireg_model.fit(x_train, y_train) #Evaluating the model with the testing set using R-squared R2_test = multireg_model.score(x_test, y_test) print(f'The R-squared score for the multiple regression model is: r2={round(R2_test,3)}' The R-squared score for the multiple regression model is: r2=0.644 As indicated by the output, we can conclude that approximately 64% of the variance in house price is explained by the model, i.e. the house attributes selected. Now that we have fitted the model, we can generate price predictions In [17]: #Generating predictions using the testing set Y_pred = multireg_model.predict(x_test) #We can compare the actual prices vs. predicted prices Actual_vs_Predicted = pd.concat([pd.Series(y_test.values), pd.Series(Y_pred)], axis=1, i Actual_vs_Predicted['Actual Prices'] = Actual_vs_Predicted['Actual Prices'].apply(lambda Actual_vs_Predicted['Predicted Prices'] = Actual_vs_Predicted['Predicted Prices'].apply( #Previewing the first 10 price comparisons Actual_vs_Predicted.head(10) Actual Prices Predicted Prices 0 $297,000.00 $483,512.81 1 $1,578,000.00 $1,390,213.10 2 $562,100.00 $469,294.88 3 $631,500.00 $463,164.22 4 $780,000.00 $1,082,358.68 5 $485,000.00 $461,257.02 6 $340,000.00 $311,981.57 7 $335,606.00 $438,062.19 8 $425,000.00 $601,306.92 9 $490,000.00 $1,294,538.81 Out[17]: We can see from the price comparisons, some price predictions are close to the actual prices, some differ by quite a large margin. We can get the exact value for how much the predicted prices deviate from the actual prices on average by calculating the root mean squared error. Model Evaluation: Root Mean Squared Error In [18]: #First, calculating the mean squared error (MSE) MSE = mean_squared_error(y_test, Y_pred) #Calculating the root MSE (RMSE) RMSE = np.sqrt(MSE) print(f'The root mean squared error is: RMSE={round(RMSE,3)}') The root mean squared error is: RMSE=- The resulting RMSE score indicates that the predicted prices deviate from the actual prices by approximately $217,566 on average. We can do much better still. Model Evaluation: Distribution Plot We can also visualize the discrepancy between the actual prices and the predicted prices using a distribution plot (based on kernel density estimation) to get better insight and understanding of where our model falls short. In [19]: #Visualizing the distribution of actual vs. predicted prices #Creating the distribution plot ax1 = sns.distplot(y_test, hist=False, label='Actual Values') sns.distplot(Y_pred, ax=ax1, hist=False, label='Predicted Values') #Adding a title and labeling the axes plt.title('Actual vs. Predicted Values for House Prices') plt.xlabel('House Price (in USD)', fontsize=12) plt.ylabel('Distribution density of price values', fontsize=12) plt.legend(loc='best') #Adjusting the x-axis to display the prices in a reader-friendly format plt.gcf().axes[0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}')) plt.xticks(rotation=90) #Displaying the distribution plot plt.show() As can be gathered from the plot, the model is able to track the actual house prices it's trying to predict fairly okay but only in the lower price ranges, and falls short completely with higher price ranges, especially with houses above $2.5 million. Thus, it seems that this model is quite underfitted to the data. Let's see how a non-linear/polynomial model would fit the data. MODEL TWO: Multivariate Polynomial Regression Model A polynomial regression model depicts a non-linear or 'curvilinear' relationship between the independent or predictor variables and the dependent or target variable. It tries to capture and explain the variance in the data via a non-linear function. Note however, polynomial models have different polynomial 'orders' or 'degrees'. These polynomial orders control the degree of 'curvature' of the regression line for the model; some models are too complex, fitting data with high or very high variance, and thus requiring higher polynomial orders to capture all the variance in the data; others are simple and take a small order to fit the data. The task here, thus, is to figure out just the optimal polynomial order for the model, particularly in reference to the testing set; after all we ultimately need to ensure that the model is performing well in the real world not just with our particular training set. As such, I will employ a loop as well as k-fold cross validation to iterate over different models and test out different polynomial orders in order to select the model with the most optimal order. I will use the training set for training each model and for cross validation, before estimating the model's final performance separately using the testing set. I will again use the r-squared metric to evaluate the models performances. In [20]: #First, specifying the polynomial orders to test out poly_orders = [2,3,4,5] #up to five polynomials #Now testing out different orders using cross validation to select the best cv_scores = {} for order in poly_orders: #creating polynomial features object poly_features = PolynomialFeatures(degree=order) #transforming predictor variables to polynomial features x_train_poly = poly_features.fit_transform(x_train) #creating a regression object polyreg_model = LinearRegression() #Now using 10-fold cross validation to determine the best polynomial order r2_scores = cross_val_score(polyreg_model, x_train_poly, y_train, cv=10) #Retrieving mean R-squared score for a given polynomial order cv_scores[order] = np.mean(r2_scores) In [21]: #Selecting the best polynomial order best_order, best_score = None, None for order,score in cv_scores.items(): if best_score is None or best_score < score: if score > 0: best_score = score best_order = order #Reporting the best model with the most optimal polynomial print(f'The best model for the data has a polynomial order of {best_order}, and R-square The best model for the data has a polynomial order of 2, and R-squared score of: r2=0.73 3 Based on the cross validation results, it seems the best polynomial order for the model is 2, corresponding to an r-squared score of about 0.73, which means that approximately 73% of the variance in house prices is explained and accounted for this model. This is much better than the score obtained earlier with the multiple linear regression model. Let's see how the model performs with the testing set to get the best estimate of how reliable the model is. Model Testing Now having trained and validated the model, I will train the model again with the best polynomial order, 2, and generate predictions, before testing the model again using the testing set to get the best estimate of its performance in the real world. In [22]: #Creating a polynomial features object poly_features = PolynomialFeatures(degree=best_order) #transforming predictor variables to polynomial features x_train_poly = poly_features.fit_transform(x_train) x_test_poly = poly_features.fit_transform(x_test) #fitting the model polyreg_model = LinearRegression() polyreg_model.fit(x_train_poly, y_train) #Testing the model with the test set (using r-squared) R2_test = polyreg_model.score(x_test_poly, y_test) print(f'The r-squared score for the testing set is: r2={round(R2_test,3)}') The r-squared score for the testing set is: r2=0.733 Evaluating the model using the testing data producted identical results. The resulting r-squared score indicates that approximately 73% of the variance in house prices can be related back to and explained by the house attributes choosen as the predictors. Still much better than the 0.64 obtained with the linear model. Let's do further evaluations. Model Evaluation: Root Mean Squared Error In [23]: #Generating predictions using both sets Y_pred_train = polyreg_model.predict(x_train_poly) Y_pred_test = polyreg_model.predict(x_test_poly) #Calculating root mean squared error for both sets to compare them MSE_train = mean_squared_error(y_train, Y_pred_train) RMSE_train = np.sqrt(MSE_train) MSE_test = mean_squared_error(y_test, Y_pred_test) RMSE_test = np.sqrt(MSE_test) print('RMSE for training set: {:,.3f}'.format(RMSE_train)) print('RMSE for testing set: {:,.3f}'.format(RMSE_test)) RMSE for training set: 183,534.011 RMSE for testing set: 188,428.897 We can see from the RMSE scores that the discrepancy between the predicted prices and the actual prices are quite similar for the in-sample training data and out-of-sample, testing data. This indicates that the model is producing reliable predictions. Further, for house prices that can range to millions, a difference of about $188,428 (RMSE for test set) is not unrealistic. We can visualize the discrepancy in prices again using a distribution plot. Model Evaluation: Distribution Plot In [24]: #Setting the characteristics of the plots fig, axes = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(10, 5)) #Visualizing model fitting for the training set ax1 = sns.distplot(y_train, hist=False, ax=axes[0], label='Actual Values (Training)') sns.distplot(Y_pred_train, hist=False, ax=ax1, label='Predicted Values (Training)') #Visualizing model fitting for testing set ax2 = sns.distplot(y_test, hist=False, ax=axes[1], label='Actual Values (Testing)') sns.distplot(Y_pred_test, hist=False, ax=ax2, label='Predicted Values (Testing') #Adding titles and labeling the axes fig.suptitle('Model performance in-sample vs. out-of-sample') axes[0].set_title('Model fitting with training set') axes[0].set_xlabel('House Price (in USD)') axes[0].set_ylabel('Distribution density of price values') axes[0].legend(loc='best') axes[1].set_title('Model fitting with testing set') axes[1].set_xlabel('House Price (in USD)') axes[1].set_ylabel('Distribution density of price values') axes[1].legend(loc='best') #Adjusting the x-axis to display the prices in a reader-friendly format ax1.set_xticklabels(ax1.get_xticklabels(), rotation=90) ax2.set_xticklabels(ax2.get_xticklabels(), rotation=90) plt.gcf().axes[0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}')) plt.gcf().axes[1].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}')) #show plot plt.show() The distribution densities look very similar for both sets. This indeed indicates that the the model is able to perform well in the real world as with the in-sample data. However, we can see from the graphs that the model seems to be over-predicting house prices in the low range (0 - ~$1,500,000), but isn't very reliable in predicting prices in the higher ranges, between \$4,000,000 to $7,000,000. Let's see if we can improve the performance of the model even further. To do so, I will perform hyperparameter tuning, particularly L1 Regularization with Lasso regression. Part Four: Hyperparameter Tuning To improve the model predictions further, I will perform L1 regularization using lasso regression. Lasso regression introduces a new hyperparameter to the model, 'alpha', which regularizes the coefficients by shrinking them. The coefficients are shrunk enough to optimize the model for better predictions across different data sets, which is done through cross validation. Thus, regularizing the coefficents will lead to better model performance in the real world with new datasets, thus decreasing the generalization error. Alpha can take different values to regularize the coefficients: the higher the alpha value, the more shrunk the coefficients are. One advantage of lasso regression is that tuning alpha appropriately allows us to optimize the model whilst avoiding both underfitting and overfitting simultaneously. Further, lasso regression can be particularly useful for controlling the potential problem of 'multicollinearity'. Multicollinearity arises when the predictor variables being used to train the model exhibit high correlations among themselves. This is evident from the correlations table presented above. Multicollinearity can potentially hinder the performance of the model. Lasso regression ensures that multicollinearity doesn't affect the model's prediction validity. We can take a second look at the correlations between the current predictors using a heatmap. In [25]: #Plotting a heatmap #Specifying the figure size plt.figure(figsize=(8,4)) mask = np.triu(np.ones_like(df[predictors].corr(), dtype=bool)) sns.heatmap(df[predictors].corr(), annot=True, mask=mask, cmap='Blues', vmin=-1, vmax=1) plt.title('Correlation Coefficient Of Predictors', fontsize=14) plt.show() We can see from the heatmap high correlations among multiple predictors. Particularly, the correlation coefficient is greater than 0.7 among each of the variables: house square foot, house grade, living room square foot with innovation, and house square foot without the basement, and between house square foot and number of bathrooms. Accordingly, we can suspect multicollinearity in the data. Performing lasso regression will eliminate this problem by reducing model complexity, making sure that the correlation between the predictors doesn't affect the model's reliability. MODEL DEVELOPMENT: Polynomial Lasso Regression Model As mentioned, lasso regression can regularize coefficients using different alpha values. The task here is to find the most optimal value for alpha. To do so, I will apply a cross validation technique, 'grid search', which allows us to test out different values for the model's hyperparameters across different iterations and reports back the model that performed best along with the corresponding hyperparameter values. These will be the best hyperparameters. As such, I will use it to determine the best alpha value as well as the best polynomial order for this new model. Furthermore, to facilitate processing this time around, I will build a 'pipeline' which takes the data, performs polynomial transform, and fits a lasso regression model automatically without having to write the code for each step separately. I will again build the model and perform cross validation with the training data, before running a final test using the testing data. In [26]: #Creating a pipeline to automate model development #Specifying the steps pipe_steps = [('Polynomial', PolynomialFeatures()), #to perform a polynomial transform ('Model', Lasso())] #to develop the lasso regression model #Creating the pipeline to build a lasso regression model with polynomial features lasso_model = Pipeline(pipe_steps) #Grid Search #Now performing grid search to obtain the best polynomial order and alpha value #specifying the hyperparameters to test out parameters = {'Polynomial__degree': [2,3,4,5], #specifying the polynomials to test 'Model__alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]} # #Creating a grid object and specifying the cross-validation characteristics Grid = GridSearchCV(lasso_model, parameters, scoring='r2', cv=10) #Fitting the model with the training data for cross validation Grid.fit(x_train, y_train) #Reporting the results of the cross validation (best polynomial order, alpha, and r2 sco best_order = Grid.best_params_['Polynomial__degree'] best_alpha = Grid.best_params_['Model__alpha'] best_r2 = Grid.best_score_ print(f'The best model has a polynomial order of {best_order}, alpha value of: alpha={be The best model has a polynomial order of 3, alpha value of: alpha=10000, and r-squared s core of r2=0.74 Unlike what we saw earlier, before hyperparameter tuning, the best polynomial order for this model is 3. Further, the model performance as indicated by r-squared score, didn't decrease, remaining around 0.74, which means the model still accounts for about 74% of the variance in house prices. Let's see how the final score will be on the out-of-sample, testing set. Model Testing Now we can test the model one final time using the testing set. In [27]: #First, extracting the model with the best parameters Lasso_Model = Grid.best_estimator_ #Calculating the R-squared score for the model using the testing set R2_test = Lasso_Model.score(x_test, y_test) print(f'The r-squared score for the testing set is: r2={round(R2_test,3)}') The r-squared score for the testing set is: r2=0.767 We can see here the resulting r-squared score (0.767) has improved! It proved to be even better on the outof-sample test data. Our model now accounts for approximately 76% of the variance in house prices, i.e., 76% of the price variance can be related back to the predictor house attributes we developed the model with. Thus, the polynomial lasso regression model seems to be the best one so far, and likely the better performing one in the real world with novel datasets. We can again calculate the RMSE values to evaluate the model's predictions and visualize the its results to get a better insight into its strenghts and weaknesses. Model Evaluation: Root Mean Squared Error In [28]: #Generating predictions using both sets Y_pred_train = Lasso_Model.predict(x_train) Y_pred_test = Lasso_Model.predict(x_test) #Calculating root mean squared error for both sets to compare them MSE_train = mean_squared_error(y_train, Y_pred_train) RMSE_train = np.sqrt(MSE_train) MSE_test = mean_squared_error(y_test, Y_pred_test) RMSE_test = np.sqrt(MSE_test) print('RMSE for training set: {:,.3f}'.format(RMSE_train)) print('RMSE for testing set: {:,.3f}'.format(RMSE_test)) RMSE for training set: 174,505.316 RMSE for testing set: 176,024.745 We can see here too that the RMSE values improved for both the in-sample training set and the out-ofsample testing set. The discrepancy between the predicted house prices and actual prices went down to an average of about $176,000, (on testing set) which, again, for house prices that can range to millions of dollars is a fairly good estimate. Let's look again at the distribution density of the predicted vs. actual prices. Model Evaluation: Distribution Plot In [29]: #Setting the characteristics of the plots fig, axes = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(12, 5)) #Visualizing model fitting for the training set ax1 = sns.distplot(y_train, hist=False, ax=axes[0], label='Actual Values (Training)') sns.distplot(Y_pred_train, hist=False, ax=ax1, label='Predicted Values (Training)') #Visualizing model fitting for testing set ax2 = sns.distplot(y_test, hist=False, ax=axes[1], label='Actual Values (Testing)') sns.distplot(Y_pred_test, hist=False, ax=ax2, label='Predicted Values (Testing') #Adding titles and labeling the axes fig.suptitle('Model performance in-sample vs. out-of-sample') axes[0].set_title('Lasso Model fitting with training set') axes[0].set_xlabel('House Price (in USD)') axes[0].set_ylabel('Distribution density of price values') axes[0].legend(loc='best') axes[1].set_title('Lasso Model fitting with testing set') axes[1].set_xlabel('House Price (in USD)') axes[1].set_ylabel('Distribution density of price values') axes[1].legend(loc='best') #Adjusting the x-axis to display the prices in a reader-friendly format ax1.set_xticklabels(ax1.get_xticklabels(), rotation=90) ax2.set_xticklabels(ax2.get_xticklabels(), rotation=90) plt.gcf().axes[0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}')) plt.gcf().axes[1].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}')) #show plot plt.show() Indeed, the performance of the model in-sample and out-of-sample is quite similar and more improved. We can see from the graph on the left that the model is now better able to track and produce accurate predictions in the higher price ranges ($4,000,000 - ~\$5,500,000), which wasn't the case earlier before hyperparameter tuning. Thus, we can conclude that the polynomial lasso regression model is the best model for our house sales dataset. We can also now develop the final model with the whole data set, give it the optimal parameters as obtained from the previous evaluations, and use the model to perform predictive pricing, generate novel predictions from new data. Part Five: Model Prediction For this section, I will create a custom function that takes a set of data comprised of different house characteristics, and based on it the final model will be used to produce a price prediction that best suits each set of characteristics. But first, I will develop the final model again with the whole dataset, before using it to generate predictions. Again, to automate the process of model building and training, I will create a pipeline. The pipeline will take the data and (i) standardize the predictors' scales; (ii) perform polynomial transform (with a polynomial order of 3) on the predictors to turn them into polynomial features; and lastly, (iii) build a polynomial lasso regression model (setting alpha at 10000) with the standardized and transformed data. In [30]: #Specifying the pipeline process pipeline_steps = [('Scaler', StandardScaler()), ('Polynomial', PolynomialFeatures(degree=3)), ('Model', Lasso(alpha=10000))] #Building the pipeline for the lasso regression model Model = Pipeline(pipeline_steps) #Fitting the model with the entire dataset Model.fit(x_data, y_data) Out[30]: Pipeline(steps=[('Scaler', StandardScaler()), ('Polynomial', PolynomialFeatures(degree=3)), ('Model', Lasso(alpha=10000))]) We are now reading to use the model to generate predictions.. Generating Predictions Now I will create a custom function, MakePredictions(), that takes novel data of different house characteristics and uses the model to generate predictions that best suit the new house characteristics passed to the function. In [31]: #Defining the function def MakePrediction(model, X_vars): """This function takes two inputs: 'model', which specifies the model to be used to and 'X_vars', which specifies the house characteristics for each house to make the p runs the prediction-making process and returns a table with the predicted prices for Y_pred = model.predict(X_vars) Y_pred_df = pd.Series(Y_pred, name='Predicted Prices').to_frame().apply(lambda serie return Y_pred_df To test the function I will extract a random sample from the original dataset given that there's no new data available. Thus, I will create here a new dataframe with a sample of 20 data points taken at random from the original dataframe, and then pass them to the MakePrediction() function along with the final model developed, to generate novel price predictions based on these characteristics. In [32]: #Extracting a random sample from the data and assigning it to 'X_new' X_new = x_data.sample(20) #number of samples = 20 #Previewing the sample X_new sqft_living grade sqft_above sqft_living15 bathrooms view sqft_basement bedrooms lat 5118 2300 7 1650 2300 2.500 3 650 4.000 47.413 15394 2460 8 2460 2760 1.750 0 0 4.000 47.726 789 1800 7 1800 1800 2.500 0 0 3.000 47.529 12119 2180 8 1480 2180 1.750 0 700 3.000 47.759 19071 1800 7 1320 1890 2.750 0 480 3.000 47.568 1453 1330 5 1330 1150 2.000 0 0 4.000 47.496 7710 3920 9 2900 2540 4.250 0 1020 5.000 47.587 11598 4170 11 4170 4560 3.500 0 0 4.000 47.528 5215 1990 7 1340 1750 2.750 0 650 3.000 47.743 11337 1170 7 1170 1180 1.750 0 0 3.000 47.368 21504 2700 9 2700 2680 2.750 0 0 2.000 47.724 12763 1700 7 1700 1370 2.500 0 0 4.000 47.731 Out[32]: waterfr 75 3430 10 2390 3240 4.000 0 1040 4.000 47.582 2856 3400 9 3400 2970 2.500 0 0 4.000 47.707 2807 2835 8 2835 2770 2.500 0 0 4.000 47.388 13676 2230 7 1230 2380 1.000 0 1000 3.000 47.522 4676 2290 9 2290 2290 2.250 0 0 4.000 47.588 11292 2300 8 1150 2300 2.000 0 1150 3.000 47.652 8329 3080 10 2300 2910 3.500 3 780 4.000 47.642 20576 1820 9 1820 1710 2.000 0 0 3.000 47.543 In [33]: #Now using the custom function to generate price predictions from the sample, X_new MakePrediction(Model, X_new) Predicted Prices Out[33]: 0 $482,473.99 1 $618,138.86 2 $457,170.59 3 $518,713.89 4 $490,163.24 5 $291,892.08 6 $989,396.43 7 $1,170,895.82 8 $472,381.58 9 $213,971.22 10 $727,637.28 11 $449,708.37 12 $1,088,827.59 13 $837,215.75 14 $462,672.64 15 $497,428.93 16 $682,755.74 17 $653,522.46 18 $1,218,489.78 19 $575,339.85 In [34]: #Showing the house characteristics and the corresponding predicted prices together sample_and_prediction, sample_and_prediction['Predicted Prices'] = X_new.reset_index(drop sample_and_prediction Out[34]: sqft_living grade sqft_above sqft_living15 bathrooms view sqft_basement bedrooms lat waterfront 0 2300 7 1650 2300 2.500 3 650 4.000 47.413 0 1 2460 8 2460 2760 1.750 0 0 4.000 47.726 0 2 1800 7 1800 1800 2.500 0 0 3.000 47.529 0 3 2180 8 1480 2180 1.750 0 700 3.000 47.759 0 4 1800 7 1320 1890 2.750 0 480 3.000 47.568 0 5 1330 5 1330 1150 2.000 0 0 4.000 47.496 0 6 3920 9 2900 2540 4.250 0 1020 5.000 47.587 0 7 4170 11 4170 4560 3.500 0 0 4.000 47.528 0 8 1990 7 1340 1750 2.750 0 650 3.000 47.743 0 9 1170 7 1170 1180 1.750 0 0 3.000 47.368 0 10 2700 9 2700 2680 2.750 0 0 2.000 47.724 0 11 1700 7 1700 1370 2.500 0 0 4.000 47.731 0 12 3430 10 2390 3240 4.000 0 1040 4.000 47.582 0 13 3400 9 3400 2970 2.500 0 0 4.000 47.707 0 14 2835 8 2835 2770 2.500 0 0 4.000 47.388 0 15 2230 7 1230 2380 1.000 0 1000 3.000 47.522 0 16 2290 9 2290 2290 2.250 0 0 4.000 47.588 0 17 2300 8 1150 2300 2.000 0 1150 3.000 47.652 0 18 3080 10 2300 2910 3.500 3 780 4.000 47.642 0 19 1820 9 1820 1710 2.000 0 0 3.000 47.543 0 In [35]: #END