Data Analysis & Machine Learning for Predictive Pricing
Data Analysis & Machine Learning for Predictive Pricing
(Predicting car prices)
This project utilizes Python for data analysis and machine learning. It covers core data science aspects
from exploratory data analysis and data wrangling to advanced statistical analysis to employing
machine learning algorithms for predictive pricing. The data being analyzed here is based on an
automobile dataset comprised of a variety of car characteristics, including key characteristics such as
car brand, horsepower, engine type, and its original pricing. The dataset is analyzed throughly and
prepared for developing a machine learning model which should, based on the data given to it,
produce reliable price predictions. The project was originally completed as part of my IBM course,
'Data Analysis with Python', but was expanded and built upon to cover a wider range of data science
methods and skills learned within and outside the course.
You can access the automobile dataset from the attached excel file or by clicking here. As mentioned,
it is comprised of a variety of car attributes and their corresponding prices. You can view each
coloumn in the set and its describtion in the table below:
Variable
Description
symboling
Car's insurance risk level (continuous from -3 to 3).
normalizedlosses
Relative average loss payment per insured vehicle year (continuous from 65 to 256).
make
Car's brand or manufacturer name (includes, alfa-romero, audi, bmw, chevrolet, dodge, honda,isuzu, jaguar,
mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru,
toyota, volkswagen, volvo).
fuel-type
Car's fuel type (diesel or gas).
aspiration
Car's aspiration engine type (std or turbo).
num-ofdoors
Number of doors (two or four)
body-style
Car's body style (hardtop, wagon, sedan, hatchback, convertible)
drive-wheels
Type of driving wheels (4wd, fwd, rwd).
enginelocation
Car's engine location (front or rear).
wheel-base
Car's wheelbase distance (continuous from 86.6 to 120.9)
length
Car's length (continuous from 141.1 to 208.1).
width
Car's width (continuous from 60.3 to 72.3).
height
Car's height (continuous from 47.8 to 59.8).
curb-weight
Car's curb weight (continous from 1488 to 4066).
engine-type
The engine type (includes, dohc, dohcv, ohc, ohcf, ohcv, l, rotor).
num-ofcylinders
Number of cylinders (two, three, four, five, six, eight, twelve).
engine-size
Car's engine size (continous from 61 to 326).
fuel-system
Car's fuel system (includes, 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi).
bore
Car's bore size (continuous from 2.54 to 3.94).
stroke
Engine's stroke length (continous from 2.07 to 4.17).
compressionratio
Ratio between the cylinder's volume and combustion chamber in combustion engine (continous from 7 to
23).
horsepower
Car's horsepower (continuous from 48 to 288).
peak-rpm
Peak revolutions per minute (continous from 4150 to 6600).
city-mpg
Car's average miles per gallon in the city (continous from 13 to 49).
highwaympg
Car's average miles per gallon on highways (continous from 16 to 54).
price
Car price (continous from 5118 to 45400).
To develop a model that can accurately estimate car prices, the dataset is filtered, statistically
analyzed, and only the most relevant attributes are selected to train the model. To obtain the best
performing model for the current dataset, different models are developed, fine-tuned, and evaluated
through in-sample and out-of-sample evaluations, and their performances compared. The final aim is
to find the model that simultaneously performs best on the data by which it was trained (in-sample)
and in the real world with novel or previously unseen data (out-of-sample). As such, the model
undergoes a processes of parameter fine-tuning to reduce its estimated generalization error and
thereby improve its overall performance in the real world. The model selected is then used to
generate price predictions. The final section also provides a function that takes input from the user
with all the car attributes they have in mind, and employs the model to return back a price prediction
that best corresponds to these given attributes. Feel free to try it yourself.
Overall, the project is broken down into six parts:
1) Reading and Inspecting Data
2) Updating and Cleaning Data
3) Data Selection and Preprocessing
4) Model Development and Evaluation
5) Hyperparameter Tuning
6) Model Prediction
In [ ]: #If you're using the executable notebook version, please run this cell first
#to install the necessary Python libraries for the task
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install statsmodels
!pip install scikit-learn
In [2]: #Importing the Python packages to be used
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import warnings
warnings.simplefilter("ignore")
#Adjusting data display options
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
Part One: Reading and Inspecting Data
1. Loading and reading excel file
In [3]: #Loading the dataset onto a dataframe
df = pd.read_excel('Automobile Dataset.xlsx')
#Previewing the first 5 entries
df.head()
symboling
normalizedlosses
make
fueltype
aspiration
numofdoors
bodystyle
drivewheels
enginelocation
wheelbase
length
widt
0
3
?
alfaromero
gas
std
two
convertible
rwd
front
88.600
168.800
64.10
1
3
?
alfaromero
gas
std
two
convertible
rwd
front
88.600
168.800
64.10
2
1
?
alfaromero
gas
std
two
hatchback
rwd
front
94.500
171.200
65.50
3
2
164
audi
gas
std
four
sedan
fwd
front
99.800
176.600
66.20
4
2
164
audi
gas
std
four
sedan
4wd
front
99.400
176.600
66.40
Out[3]:
2. Inspecting the data
In [4]: #Inspecting the shape of the dataframe
shape = df.shape
print('Number of coloumns:', shape[1])
print('Number of rows:', shape[0])
Number of coloumns: 26
Number of rows: 205
In [5]: #Inspecting the coloumn headers, data type, and number of entries
df.info()
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
#
Column
Non-Null Count Dtype
--- ------------------- ----0
symboling
205 non-null
int64
1
normalized-losses 205 non-null
object
2
make
205 non-null
object
3
fuel-type
205 non-null
object
4
aspiration
205 non-null
object
5
num-of-doors
205 non-null
object
6
body-style
205 non-null
object
7
drive-wheels
205 non-null
object
8
engine-location
205 non-null
object
9
wheel-base
205 non-null
float64
10 length
205 non-null
float64
11 width
205 non-null
float64
12 height
205 non-null
float64
13 curb-weight
205 non-null
int64
14 engine-type
205 non-null
object
15 num-of-cylinders
205 non-null
object
16 engine-size
205 non-null
int64
17 fuel-system
205 non-null
object
18 bore
205 non-null
object
19 stroke
205 non-null
object
20 compression-ratio 205 non-null
float64
21 horsepower
205 non-null
object
22 peak-rpm
205 non-null
object
23 city-mpg
205 non-null
int64
24 highway-mpg
205 non-null
int64
25 price
205 non-null
object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB
We can see from the data inspections that some coloumns have an incorrect or inappropriate data type,
some entries are null and consisting of the special character '?' only, and many of the variables are
categorical and thus in need of conversion to continuous, numerical values to apply analysis. It's time to do
some data cleaning and updating before subjecting them to analysis and machine learning.
Part Two: Updating and Cleaning Data
1. Identifying and handling missing values
We can see in the dataset some entries containing the special character '?', instead of real values. I'll identify
and remove or replace them by the appropriate values.
In [6]: #First, reporting coloumns with inappropriate entries and their total count
print('Number of inappropriate entries per coloumn:')
for col in df.columns:
if any(df[col].astype('str').str.contains('\?')):
print(f'{col}:', df[col].astype('str').str.contains('\?').sum())
Number of inappropriate entries per coloumn:
normalized-losses: 41
num-of-doors: 2
bore: 4
stroke: 4
horsepower: 2
peak-rpm: 2
price: 4
In [7]: #Now replacing the innappropriate values (i.e., '?') with NaN (Not-a-Number) values
# before dealing with them in the most optimal way for a given coloumn
df.replace('?', np.nan, inplace=True)
#Checking the the number of NaN values for each coloumn
print('Number of null/NaN values per coloumn:')
for col in df.columns:
if df[col].isna().sum() > 0:
print(f'{col}:', df[col].isna().sum())
Number of null/NaN values per coloumn:
normalized-losses: 41
num-of-doors: 2
bore: 4
stroke: 4
horsepower: 2
peak-rpm: 2
price: 4
The numbers match. Now all special characters were replaced by NaN values. Time to deal with them as best
fits.
Dealing with missing values
First, given that price is ultimately the most important variable and the only one we are trying to predict. I
will remove the entire rows with missing prices. For each of the coloumns 'normalized-losses', 'bore', 'stroke',
'horsepower', and 'peak-rpm', I will replace their NaN values with the average value of a given coloumn.
Finally, for the last coloumn with null entries, 'num-of-doors', I will replace its NaN values by highest
frequency value (namely, 4 doors). This is because the dataset is quite small for developing the model, and
thus it wouldn't be the best decision to dismiss the missing/nan values entirely.
Dropping rows with missing prices
In [8]: #check number of rows before
print('Number of rows before removal:', len(df))
#Drop rows and resetting the index
df.dropna(subset=["price"], axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)
#Check number of rows after
print('Number of rows after removal:', len(df))
Number of rows before removal: 205
Number of rows after removal: 201
Replacing missing values by mode
In [9]: #For the missing values in the 'num-of-doors' coloumn, I will replace them by the mode
# value ('four') since it is the most frequent and thus most likely to occur
mode_val = df['num-of-doors'].value_counts().idxmax()
df['num-of-doors'].replace(np.nan, mode_val, inplace=True)
Replacing missing values by mean
For the rest of the coloumns with missing values, I will replace the values with the mean value for a given
coloumn.
In [10]: #Iterating over each coloumn and updating its missing values
for col in df.columns:
if df[col].isna().sum() > 0:
#for a given coloumn, get the mean value
mean_val = df[col].astype('float64').mean()
#replace NaN values with the mean
df[col].replace(np.nan, mean_val, inplace=True)
In [11]: #Rechecking the number of missing entries in the dataset again
print('Number of null/NaN values per coloumn:')
print(df.isna().sum())
Number of null/NaN values per coloumn:
symboling
0
normalized-losses
0
make
0
fuel-type
0
aspiration
0
num-of-doors
0
body-style
0
drive-wheels
0
engine-location
0
wheel-base
0
length
0
width
0
height
0
curb-weight
0
engine-type
0
num-of-cylinders
0
engine-size
0
fuel-system
0
bore
0
stroke
0
compression-ratio
0
horsepower
0
peak-rpm
0
city-mpg
0
highway-mpg
0
price
0
dtype: int64
Now the dataset is all cleaned up and free of any missing or null entries. Next I will assign the correct data
types to the coloumns that require correcting.
2. Correcting data format
As seen earlier, some of the coloumns are assigned the wrong data type, a problem that would hinder the
analysis. For instance, each of the coloumns, 'normalized-losses', 'bore', 'stroke', 'horsepower', 'peak-rpm',
and even 'price' is assigned the data type 'object', when clearly their values are either integers (dtype='int')
or a floating-point number (dtype='float'). I will identify such coloumns and correct their data format.
In [12]: #Previewing the dataframe again
df.head()
symboling
normalizedlosses
make
fueltype
aspiration
numofdoors
bodystyle
drivewheels
enginelocation
wheelbase
length
widt
0
3
122.000
alfaromero
gas
std
two
convertible
rwd
front
88.600
168.800
64.10
1
3
122.000
alfaromero
gas
std
two
convertible
rwd
front
88.600
168.800
64.10
2
1
122.000
alfaromero
gas
std
two
hatchback
rwd
front
94.500
171.200
65.50
3
2
164
audi
gas
std
four
sedan
fwd
front
99.800
176.600
66.20
4
2
164
audi
gas
std
four
sedan
4wd
front
99.400
176.600
66.40
Out[12]:
In [13]: #We can check again each coloumn and identify the ones with incorrect data types
print('Coloumns with incorrect data types:')
for col in df.columns:
#return coloumn name only if it's convertable to number but is assigned
# the wrong data type (i.e. 'object')
if df[col].dtype == 'object':
try:
df[col].astype('float')
print(' ', col)
except:
continue
Coloumns with incorrect data types:
normalized-losses
bore
stroke
horsepower
peak-rpm
price
As seen from the output, six coloumns were identified. For each of the coloumns, 'normalizes-losses',
'horsepower', 'peak-rpm', and 'price', I will change their data type from object to integer; meanwhile for the
coloumns, 'bore' and 'stroke', I will convert their data type from object to float, as best suits their values.
Converting data types to proper format
In [14]: #Converting data from object to integer
df[['normalized-losses', 'horsepower', 'peak-rpm', 'price']] = df[['normalized-losses',
#Converting data from object to float-point number
df[['bore', 'stroke']] = df[['bore', 'stroke']].astype('float64')
#We can check the coloumns data types again
print(df.dtypes)
symboling
normalized-losses
make
fuel-type
aspiration
num-of-doors
int64
int64
object
object
object
object
body-style
drive-wheels
engine-location
wheel-base
length
width
height
curb-weight
engine-type
num-of-cylinders
engine-size
fuel-system
bore
stroke
compression-ratio
horsepower
peak-rpm
city-mpg
highway-mpg
price
dtype: object
object
object
object
float64
float64
float64
float64
int64
object
object
int64
object
float64
float64
float64
int64
int64
int64
int64
int64
Now each coloumn is assigned the correct data format. Next I will prepare categorical variables for analysis
by converting them to continuous, numerical variables. To do so, I will perform 'one hot encoding' or
'dummy encoding', which transforms categorical values to numerical ones.
3. Dealing with categorical variables (One Hot Encoding)
To render categorical variable viable for numerical and statistical analysis, we have first to transform them to
numerical variables. We can see in the dataset, many coloumns consist of categorical values. For instance,
the engine type is specified as nominal label (e.g., 'dohc', 'ohc', or 'rotor'), the fuel type is specified by either
of the categories 'gas' or 'diesel', and the number of cylinders are specified by name ('three, 'four', 'five',
etc.). In order to be able to numerically analyze these variables, and use them later to build a machine
learning model that can predict car prices, they must be converted to numerical variables first. To do so, I
will use one hot encoding. This technique creates new unique, binary categories for each of the unique
values in a given categorical variable, assigning 1 to flag its presence or 0 for its absence. In other words,
once new categories are created for each categorical feature, numerical labels are assigned, in this case
binary labels, either 1 or 0, to stand as proxies or 'dummies' for the actual nominal values to allow numerical
analyses to be performed as needed. One hot encoding is particularly useful in the current case especially as
there is no inherent rank hierarchy or order amongst the values of each of the categorical variables.
In [15]: #First, identifying the coloumns with categorical variables
categorical_cols = []
for col in df.columns:
if df[col].dtype == 'object':
print(col)
categorical_cols.append(col)
make
fuel-type
aspiration
num-of-doors
body-style
drive-wheels
engine-location
engine-type
num-of-cylinders
fuel-system
In [16]: #Now performing one hot encoding on these coloumns
#Get encoder object
encoder = OneHotEncoder(handle_unknown='ignore')
#Perform one hot encoding and assign feature names
df_encoded_vars = pd.DataFrame(encoder.fit_transform(df[categorical_cols]).toarray())
df_encoded_vars.columns = encoder.get_feature_names(categorical_cols)
#Get new dataframe with the new encoded categories
df_new = df.join(df_encoded_vars)
#Previewing the new dataframe
df_new.head()
symboling
normalizedlosses
make
fueltype
aspiration
numofdoors
bodystyle
drivewheels
enginelocation
wheelbase
length
widt
0
3
122
alfaromero
gas
std
two
convertible
rwd
front
88.600
168.800
64.10
1
3
122
alfaromero
gas
std
two
convertible
rwd
front
88.600
168.800
64.10
2
1
122
alfaromero
gas
std
two
hatchback
rwd
front
94.500
171.200
65.50
3
2
164
audi
gas
std
four
sedan
fwd
front
99.800
176.600
66.20
4
2
164
audi
gas
std
four
sedan
4wd
front
99.400
176.600
66.40
Out[16]:
We can see the number of coloumns proliferated to accomodate the new encoded cateogies, totalling 85
coloumns. Each encoded coloumn consist of binary values for flagging each of the values of the encoded
cateogircal variables.
Part Three: Data Selection and Preprocessing
For this section, I will identify the subset of data that will be used for developing the model and
making the necessary preperations and adjustments to optimize the model's capacity for predictive
pricing.
1. Identifying the variables
First, we need to identify the car attributes that are most relevant for predicting the final car prices. These
will be the features based on which the model will generate its price predictions. One quick way to identify
the most relevant characteristics is to apply correlational analysis and assess the relationships between each
of the features and car price. Note, however, given that the dataset is split into numerical and categorical
variables. I will apply the standard, pearson correlation analysis on the numerical ones only; meanwhile, I will
assess the relationship between the categorical variables and car price using a One Way ANOVA test.
1.1 Correlational analysis for numerical variables
In [17]: #Performing pearson correlation on the numerical variables
correlations_ByPrice = df.corr()['price'].sort_values(ascending=False)
correlations_ByPrice
Out[17]:
price
engine-size
curb-weight
horsepower
width
length
wheel-base
bore
height
normalized-losses
stroke
compression-ratio
symboling
peak-rpm
city-mpg
highway-mpg
Name: price, dtype:
-
-0.082
-0.102
-0.687
-0.705
float64
In [18]: #Visualizing correlations using a heatmap
plt.figure(figsize=(12,8))
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
sns.heatmap(df.corr(), annot=True, mask=mask, cmap='Blues', vmin=-1, vmax=1)
plt.show()
From the table and heatmap we can conclude that 9 out of 16 numerical features exhibit moderate to
strong correlations with the target variable, car price. Features with correlation coefficients above 0.5 or
below -0.5 will be selected for developing the model. These are engine size, curb weight, horsepower, car
width and length, highway miles/gallon, city miles/gallon, wheel base, and bore. Now analyzing the
relationship between the remaining categorical variables and price.
1.2 One Way ANOVA for categorical variables
In [19]: #Performing One Way ANOVA on the categorical variables
sig_ByVariable = {}
for col in df.columns:
if df[col].dtype == 'object':
if "-" in str(col):
col_renamed = '_'.join(col.split('-'))
df_copy = df.copy()
df_copy.rename(columns={f'{col}': f'{col_renamed}'}, inplace=True)
anova_model = ols(f'price ~ C({col_renamed})', data=df_copy).fit()
anova_table = sm.stats.anova_lm(anova_model, typ=2)
print(anova_table, '\n')
else:
anova_model = ols(f'price ~ C({col})', data=df).fit()
anova_table = sm.stats.anova_lm(anova_model, typ=2)
print(anova_table, '\n')
sig_ByVariable[f'{col}'] = anova_table['PR(>F)'].values[0]
sum_sq
df
F
C(make-
Residual-
NaN
PR(>F)
0.000
NaN
sum_sq
df
F
C(fuel_type-
Residual-
NaN
PR(>F)
0.119
NaN
sum_sq
df
F
C(aspiration-
Residual-
NaN
sum_sq
df
F
C(num_of_doors-
Residual-
NaN
PR(>F)
0.011
NaN
sum_sq
df
F
C(body_style-
Residual-
NaN
PR(>F)
0.550
NaN
sum_sq
df
F
C(drive_wheels-
Residual-
NaN
PR(>F)
0.000
NaN
PR(>F)
0.000
NaN
sum_sq
df
F
C(engine_location-
Residual-
NaN
sum_sq
df
F
C(engine_type-
Residual-
NaN
PR(>F)
0.000
NaN
PR(>F)
0.000
NaN
sum_sq
df
F
C(num_of_cylinders-
Residual-
NaN
PR(>F)
0.000
NaN
sum_sq
df
F
C(fuel_system-
Residual-
NaN
PR(>F)
0.000
NaN
Based on the resulting tables, features with a p-value (PR(>F)) below 0.05 can be said to have a significant
relationship with car price. I will now sort the categorical variables by the resulting p-value to determine
which variables are most strongly associated with car price in an descending order.
In [20]: #Now, sorting the ANOVA results by most significant to least significant
sig_ByVariable_sorted = sorted([(val, key) for (key, val) in sig_ByVariable.items()])
for val, key in sig_ByVariable_sorted:
print('{}: {:,.3f}'.format(key, val))
make: 0.000
num-of-cylinders: 0.000
drive-wheels: 0.000
fuel-system: 0.000
engine-type: 0.000
body-style: 0.000
engine-location: 0.000
aspiration: 0.011
fuel-type: 0.119
num-of-doors: 0.550
As demonstrated, all the tested categoircal variables, with exception of fuel type and number of doors, show
a significant relationship with price; that is, they likely affect the final car pricing. The ANOVA test doesn't tell
us exactly which group within a given categorical variable affects price the most, but it should be enough for
telling us which variables can be used to build the model as they're likely to be good predictors.
Updating the dataframe
Before identifying the selecting the variables that will be used to train the model, first we need to get a clean
dataframe with only the relevant data. As such, I will remove all the variables or coloumns that proved
unnecessary or won't be needed during analysis and model training to get a clean dataframe with only the
relevant data.
In [21]: #Identifying and removing the unnecessary coloumns
unnecessary_cols = (categorical_cols + [col for col in df_new.columns if re.findall('fue
+ [col for col in correlations_ByPrice.index if abs(correlations_ByP
#removing unnecessary coloumns and obtaining a new, updated dataframe
df_updated = df_new.drop(unnecessary_cols, axis=1)
#Previewing the final dataframe
df_updated.head()
wheelbase
length
width
curbweight
enginesize
bore
horsepower
citympg
highwaympg
price
make_alfaromero
make_audi
0
88.600
168.800
64.100
2548
130
3.470
111
21
27
13495
1.000
0.000
1
88.600
168.800
64.100
2548
130
3.470
111
21
27
16500
1.000
0.000
Out[21]:
2
94.500
171.200
65.500
2823
152
2.680
154
19
26
16500
1.000
0.000
3
99.800
176.600
66.200
2337
109
3.190
102
24
30
13950
0.000
1.000
4
99.400
176.600
66.400
2824
136
3.190
115
18
22
17450
0.000
1.000
Selecting the predictor and target variables
Now selecting the variables for training the model. We will have a total of 17 predictors or independent
variables, the car attributes proved to be most relevant, and 1 targer or dependent variable, price.
In [22]: #specifying the predictor variables and assigning them to 'x_data'
x_data = df_updated.drop('price', axis=1)
#specifying the target variable, price, and assigning it to 'y_data'
y_data = df_updated['price']
2. Data Splitting
Now that the data is ready, the next step is to split it into a training set and a testing set. The training set will
be used to develop and fit the model as well as for performing in-sample evaluations, meanwhile the testing
set will be used for testing the model and estimating the model's generalization error. Accordingly, I will
perform a standard 80/20 split such that 80% of the data will be used for training and the remaining 20% for
testing. The goal of data splitting is to get an appropriate estimate of how the model is likely to perform in
the real world with novel data, which the preserved testing sample is aimed to emulate.
In [23]: #Performing data splitting to obtain a training set (80%) and testing set (20%)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, train_size=0.80, ran
#Check the size of both sets
print('Number of training samples:', x_train.shape[0])
print('Number of testing samples:', x_test.shape[0])
Number of training samples: 160
Number of testing samples: 41
3. Feature Scaling: Normalizing the scales
As a last step in data preprocessing, it is best practice to perform feature scaling on the numerical variables
to normalize their scales. This is to ensure that the varying scales of the different variables, and their likely
varying distributions, doesn't affect the analysis and/or the model's validity. Feature normalization controls
for the diversity of scales and score distributions by rescaling all the variables such that their values all fall
within the same score range of 0 to 1. Note however, feature scaling will be performed only on the original,
non-encoded numerical variables, given that the categorical variables were already transformed during one
hot encoding to also fall in the range of 0 to 1. And thus, first, I will isolate them from the training and
testing sets before scaling and then merging them back with the rest of the data.
In [24]: #First, identify the numerical variables and assign them to 'num_vars'
num_vars = [col for col in correlations_ByPrice.index if 1 > abs(correlations_ByPrice.lo
#Now normalizing only those variables
#get a scaler object
Scaler = MinMaxScaler()
#fitting and scaling the training set
num_train_scaled = Scaler.fit_transform(x_train[num_vars])
#scaling the testing set
num_test_scaled = Scaler.transform(x_test[num_vars])
#replacing unscaled coloumns with their scaled counterparts
x_train.drop(num_vars, axis=1, inplace=True)
x_test.drop(num_vars, axis=1, inplace=True)
x_train, x_test = np.concatenate([num_train_scaled, x_train], axis=1), np.concatenate([n
Now the data is ready for analysis and model development...
Part Four: Model Development and Evaluation
In this section I will develop and evaluate multiple models in order to find the best one for the
current data, i.e., the model with the best predictive power. First, I will develop two models, the first a
multiple regression model that assumes a linear relationship between predictor variables selected
above and the target variable, car price, and the second a multivariate polynomial model which
assumes a non-linear or 'curvilinear' relationship between the predictors and target. Both models will
be evaluated and the one that proves to be the best fit for the data will be selected for further
adjustments and fine-tuning before developing the final model.
As such, each model will be trained and evaluated using the training set obtained earlier and then
evaluated one last time using the testing set in order to get an approximate estimate of how the
model is likely to perform in the real world, that is, get an estimate of its generalization error. For
evaluating the models, each model will be evaluated using the R-squared metric which tells us how
much variance in price is accounted for by the model (i.e. predictors), and the root mean squared
error, which would tell us how much on average the predicted prices generated by the model deviate
from the actual prices.
MODEL ONE: Multiple Linear Regression Model
A multiple regression model is a type of regression model that depicts a linear relationship between
multiple predictors and the target. In this case the model will capture the relationship between the relevant
car attribute selected earlier and price.
In [25]: #Create a regression object
multireg_model = LinearRegression()
#Fitting the model using the training data
multireg_model.fit(x_train, y_train)
#Evaluating the model with the testing set using the R-Squared metric
R2_test = multireg_model.score(x_test, y_test)
print(f'The R-squared score for the multiple regression model is: r2={round(R2_test,3)}'
The R-squared score for the multiple regression model is: r2=0.921
As demonstrated in the output, we can interpret the resulting R-squared score as indicating that up to 92%
of the variance in car prices is in fact accounted for and explained by the model! This is a great result for a
first test run.
Now we can use the first model to generate price predictions and compare them to the actual prices
In [26]: #Generating predictions using the testing set
Y_pred = multireg_model.predict(x_test)
#We can compare the actual prices vs. predicted prices
Actual_vs_Predicted = pd.concat([pd.Series(y_test.values), pd.Series(Y_pred)], axis=1, i
Actual_vs_Predicted['Actual Prices'] = Actual_vs_Predicted['Actual Prices'].apply(lambda
Actual_vs_Predicted['Predicted Prices'] = Actual_vs_Predicted['Predicted Prices'].apply(
#Previewing the first 10 price comparisons
Actual_vs_Predicted.head(10)
Actual Prices
Predicted Prices
0
$6,295.00
$6,309.81
1
$10,698.00
$9,722.09
2
$13,860.00
$16,577.85
3
$13,499.00
$15,714.88
4
$15,750.00
$16,368.64
5
$8,495.00
$9,892.58
6
$15,250.00
$15,653.94
7
$5,348.00
$5,646.74
8
$21,105.00
$24,833.45
9
$6,938.00
$6,229.93
Out[26]:
We can see from the table that most of the predicted prices match the actual ones quite well. To get the
exact price deviation on average, I will employ the root mean squared error metric.
Model Evaluation: Root Mean Squared Error
In [27]: #Calculate the mean squared error (MSE)
MSE = mean_squared_error(y_test, Y_pred)
#Get square root of MSE to obtain root mean squared error (RMSE)
RMSE = np.sqrt(MSE)
print(f'The root mean squared error is: RMSE={round(RMSE,3)}')
The root mean squared error is: RMSE=2440.58
The obtained RMSE value indicate that the predicted prices deviate from the actual ones, on average, by
approximately $2,440, which indeed proves that the model seems to be a very good fit for the data. I will
also use a distribution plot to visualize the discrepancy between the actual and predicted car prices.
Model Evaluation: Distribution Plot
In [28]: #Visualizing the distribution of actual vs. predicted prices
#Creating the distribution plot
ax1 = sns.distplot(y_test, hist=False, label='Actual Values')
sns.distplot(Y_pred, ax=ax1, hist=False, label='Predicted Values')
#Adding a title and labeling the axes
plt.title('Actual vs. Predicted Values for Car Prices\n(Multiple Regression Model)')
plt.xlabel('Car Price (in USD)', fontsize=12)
plt.ylabel('Distribution density of price values', fontsize=12)
plt.legend(loc='best')
#Adjusting the x-axis to display the prices in a reader-friendly format
plt.gcf().axes[0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.xticks(rotation=90)
#Displaying the distribution plot
plt.show()
We can see from the distribution plot that the model is able to track the actual prices very well, especially in
the lower price range (0−20,000). It seems that the model needs more improvement in the upper price
range. Although the model seems a good fit to the data, let's now see how a non-linear regression one
would perform; it may or may not improve our price predictions.
MODEL TWO: Multivariate Polynomial Regression Model
A polynomial regression model, unlike a standard, linear one, assumes a non-linear or 'curvilinear'
relationship between the predictor variables and the target; in this case, the relevant car attributes selected
and price. It tries to account for the variance in the target, price, using a non-linear, or more appropriately,
curvilinear function. Note however a polynomial model can have different polynomial 'degrees,' which
basically specifies the degree of curvature for the model's regression line, and the degree of curvature in
return determines the degree to which the model is able to account for the variance in the data, i.e., target
variable. Some data have very high variance and thereby require a higher polynomial degree for the model's
regression line to weave through and capture all or most the data points in the target and account for its
high variance, others have low variance and require only a small degree for the polynomial function to
capture and account for its variance. In order to determine which polynomial degree is most appropriate for
the model to account for the data, and thereby provide reliable predictions, I will use a k-fold cross
validation method and loop over different polynomial degrees to test, compare and contrast these different
degrees. I will once again use the training set for training and cross-validation, before assessing the model's
performance separately using the testing set. The model with the most optimal polynomial degree, as
proven by the r-squared metric, will be selected.
In [29]: #First, specifying the polynomial degrees to test out
poly_orders = [2,3,4]
#up to four polynomials
#Now looping through the different polynomials and using cross validation determine the m
cv_scores = {}
for order in poly_orders:
#creating polynomial features object
poly_features = PolynomialFeatures(degree=order)
#transforming predictor variables to polynomial features
x_train_poly = poly_features.fit_transform(x_train)
#creating a regression object
polyreg_model = LinearRegression()
#Now using 5-fold cross validation to obtain the best polynomial degree
r2_scores = cross_val_score(polyreg_model, x_train_poly, y_train, cv=5)
#Get mean R-squared for a given polynomial degree
cv_scores[order] = np.mean(r2_scores)
#Selecting the best polynomial order
best_order, best_score = None, None
for order,score in cv_scores.items():
if best_score is None or abs(best_score) < abs(score):
best_score = score
best_order = order
#Reporting the best model with the most optimal polynomial
print(f'The best model for the data has a polynomial degree of {best_order}, and R-squar
The best model for the data has a polynomial degree of 4, and R-squared score of: r2=-e+22
Based on the cross validation results, the best polynomial degree is 2, however the corresponding r-squared
is negative. This is a sign ofoverfitting! This means that the model is able to track the variance in the data so
well that it overfits the data. Overfitting is particularly problematic as, while the model may be very good at
tracking and generating predictions from the data set fed into it, it will very likely falter when confronting
and/or trying to generate predictions from new, previously unseen data. Put differently, overfitting on the
training set would prevent learning from transfering well onto new data sets; the model simply becomes too
rigid and unflexible enough to accomodate and explain new data, and thus, overfitting often leads to high
generalization error. To control for and prevent overfitting we need to perform hyperparameter tuning. As
such, I will be performing L1 regularization using Lasso regression.
Part Five: Hyperparameter Tuning
In this section I will perform hyperparameter tuning to control for the overfitting problem
confronted and also check if in doing so the model's predictions can be improved even further. For
hyperparameter tuning, as mentioned, I will be using Lasso regression regularization. This technique
introduces a new hyperparameter, 'alpha', to the model's function in order to regularize its
coefficients. Particularly, by introducing alpha, lasso regression imposes penalties on the coefficients
to shrink their values. Indeed, alpha can take different values to control the degree of shrinkage: the
higher the alpha value, the higher the shrinkage of the coefficients. Further, lasso regression would
not only prevent overfitting, but also reduce model complexity and prevent the potential risk of
multicollinearity, a problem that often arises when the predictor variables used to train the model
exhibit high dependency among themselves (see the correlations heatmap above), which too can
hinder the model's performance.
Model Development: Polynomial Lasso Regression Model
As described, lasso regression can regularize coefficients to different degrees using different alpha values.
Thus, the task here is to identify the best or most optimal alpha. To do so, I will be employing a cross
validation technique known as 'Grid search,' which iterates over different alpha values and identifies the best
one. This will help us identify the model with the best fit. I will use this technique also to iterate once again
over different polynomial degrees and identify the most optimal one. This time, however, to facilitate
processing more, this time I'll employ a pipeline to automate the process. This pipeline would perform a
polynomial transform first to convert the model features to polynomial features, and then will fit the data to
a lasso regression model. Note, finally, I will be performing the training and cross validation only with the
training set and then run a final evaluation on the best obtained model using the testing set in order to get
an estimate of how the model will perform with novel, unseen data and thereby estimate its generalization
error.
In [30]: #Creating a pipeline to automate model development
#Specifying the pipeline steps
pipe_steps = [
('Polynomial', PolynomialFeatures()),
#performs a polynomial transf
('Model', Lasso())
#fits the data to lasso regression model
]
#Creating the pipeline
lasso_model = Pipeline(pipe_steps)
#Grid Search
#Now performing grid search to obtain the best polynomial order and alpha value
#specifying the hyperparameters to test out (polynomial degrees & alpha values)
parameters = {'Polynomial__degree': [2,3,4],
#specifying the polynomials to test ou
'Model__alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]}
#
#Creating a grid object and specifying the cross-validation characteristics
Grid = GridSearchCV(lasso_model, parameters, scoring='r2', cv=5)
#Fitting the model with the training data for cross validation
Grid.fit(x_train, y_train)
#Reporting the results of the cross validation (best polynomial order, alpha, and r2 sco
best_order = Grid.best_params_['Polynomial__degree']
best_alpha = Grid.best_params_['Model__alpha']
best_r2 = Grid.best_score_
print(f'The best model has a polynomial degree of {best_order}, alpha value of: alpha={b
The best model has a polynomial degree of 2, alpha value of: alpha=10, and r-squared sco
re of r2=0.881
As demonstrated in the output, the best model has an alpha of 10 and again a polynomial degree of 2.
Further, as indicated by the r-squared metric, the model accounts for approximately 88% of the variance in
car price. Let's see how it performs when evaluated with the testing data, which should give us the best
estimate of its real world performance.
Model Testing
Now we can test the model one final time using the testing set.
In [31]: #First, extracting the model with the best hyperparameters
Lasso_Model = Grid.best_estimator_
#Calculating the R-squared score for the model using the testing set
R2_test = Lasso_Model.score(x_test, y_test)
print(f'The r-squared score for the testing set is: r2={round(R2_test,3)}')
The r-squared score for the testing set is: r2=0.957
As demonstrated by the resulting r-squared score, the current polynomial model outperformed the earlier,
multiple linear regression one on testing, accounting now for approximately 95% of the variance in the
target. That is, 95% of the car price variance can be traced back to the car attributes selected for model
training! As such, we can conclude that the association between the predictor car attributes and car price is
better captured with a non-linear relationship. With some tuning, the polynomial lasso regression model
thus seems the best fitted model for the data, and likely the better performing one in the real world with
novel data sets. I will now perform further evaluations employing the root mean squared error metric and
visualizations to get a better idea of how well the model performed and how it compares to the multiple
regression one.
Model Evaluation: Root Mean Squared Error
In [32]: #Generating price predictions using the testing set
Y_pred_lasso = Lasso_Model.predict(x_test)
#calculating root mean squared error for the testing set
MSE = mean_squared_error(y_test, Y_pred_lasso)
RMSE = np.sqrt(MSE)
#Report the resulting RMSE value
print(f'The root mean squared error is: RMSE={round(RMSE,3)}')
The root mean squared error is: RMSE=-
As shown here as well, the RMSE score improved, indicating that the discrepancy between the actual prices
and the prices predicted by the model decreased down to approximately $1,809 on average. This is a fairly
good estimate when considering the range of car prices. It indicates that the model can produce fairly
reliable predictions not only from the data by which it is trained but also with new, previously unseen data. I
will again use a density distribution plot to examine the discrepancy between the predicted and actual prices
and get a better insight into the model's current performance.
Model Evaluation: Distribution Plot
In [33]: #Setting the characteristics of the plots
ax1 = sns.distplot(y_test, hist=False, label='Actual Values')
sns.distplot(Y_pred_lasso, ax=ax1, hist=False, label='Predicted Values')
#Adding a title and labeling the axes
plt.title('Actual vs. Predicted Values for Car Prices\n(Polynomial Lasso Regression Mode
plt.xlabel('Car Price (in USD)', fontsize=12)
plt.ylabel('Distribution density of price values', fontsize=12)
plt.legend(loc='best')
#Adjusting the x-axis to display the prices in a reader-friendly format
plt.gcf().axes[0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.xticks(rotation=90)
#Displaying the distribution plot
plt.show()
Indeed, as demonstrated in the plot here, the current model is able to generate reliable predictions that
almost perfectly match the actual prices. Further, unlike the linear regression model that performed well
predicting prices only in the lower price ranges (below $20,000), this model is able to reliably predicted car
prices across the board, modeling prices in the lower and higher price ranges as well. We can compare both
plots side by side to see how their performances differ more clearly.
Model Comparison
In [34]: #Setting the characteristics of the plots
fig, axes = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(13,6))
#Visualizing model fitting for the training set
ax1 = sns.distplot(y_test, hist=False, ax=axes[0], label='Actual Values')
sns.distplot(Y_pred, hist=False, ax=ax1, label='Predicted Values')
#Visualizing model fitting for testing set
ax2 = sns.distplot(y_test, hist=False, ax=axes[1], label='Actual Values')
sns.distplot(Y_pred_lasso, hist=False, ax=ax2, label='Predicted Values')
#Adding titles and labeling the axes
fig.suptitle('Multiple Regression vs. Polynomial Regression')
axes[0].set_title('Multiple regression model fitting on testing')
axes[0].set_xlabel('Car Price (in USD)')
axes[0].set_ylabel('Distribution density of price values')
axes[0].legend(loc='best')
axes[1].set_title('Polynomial regression model fitting on testing')
axes[1].set_xlabel('Car Price (in USD)')
axes[1].set_ylabel('Distribution density of price values')
axes[1].legend(loc='best')
#Adjusting the x-axis to display the prices in a reader-friendly format
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=90)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=90)
plt.gcf().axes[0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.gcf().axes[1].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
#show plot
plt.show()
As shown in the figure, the polynomial model (on the right) outperformed the multiple linear regression one
(on the left), both in predicting prices at the peak of the distribution where the most frequent prices occur
and also at the upper price range (post-$20,000) where the linear model fell short. Thus, we can conclude
that the polynomial model with lasso regularization is the best fitted model for the present dataset.
Now that the best model is decided on, I will procede to develop the final model with the whole dataset,
and applying the best parameters as decided by the earlier evaluations, and finally use the model to
generate predictions.
Part Six: Model Prediction
In this section, as mentioned, I will build the final model again with the whole dataset and use it to
generate predictions. Further, to automate data processing and model development I will once again
use a pipeline, except this time the pipeline will take the raw data and perform the following: (i) One
hot encoding for the categorical variables; (ii) feature normalization to rescale the data
appropriately; (iii) converting the model features to polynomial features by applying a polynomial
transform; and finally, (iv) building a polynomial lasso regression model with the best obtained
parameters (polynomial degrees=2, alpha=10). Finally, I will create a custom function that takes a
dataset consisting of different car attributes and employs the model to generate price predictions
that correspond to these attributes best. I will also define a second function that takes input fom the
user regarding all the different car characteristics and return a price prediction that best suits these
specified characteristics.
Final Model Development
In [35]: #First, preparing the data for training the final model
#specifying predictor variables
x_data = df[['engine-size', 'curb-weight', 'horsepower', 'width', 'length', 'wheel-base'
'make', 'num-of-cylinders', 'drive-wheels', 'fuel-system', 'engine-type', 'b
#specifying the target variable
y_data = df['price']
#extracting categorical and numerical variables and storing them in separate objects
# for processing them later separately
numerical_vars = x_data.select_dtypes(exclude='object').columns.tolist()
categorical_vars = x_data.select_dtypes(include='object').columns.tolist()
Now building the pipeline
In [36]: #Creating the first part of the pipeline for normalizing numerical variables
pipeline_pt1 = Pipeline([('Scaler', MinMaxScaler())])
#Creating the second part of the pipeline for encoding categorical variables
pipeline_pt2 = Pipeline([('Encoder', OneHotEncoder(handle_unknown='ignore', sparse=False
#Combining both pipelines
pipeline_pt3 = ColumnTransformer([
('NumScaler', pipeline_pt1, numerical_vars),
('CatEncoder', pipeline_pt2, categorical_vars)
])
#Adding a polynomial transform function to the pipeline
pipeline_prep = Pipeline([('ColoumnTransformer', pipeline_pt3),
('Polynomial', PolynomialFeatures(degree=2))])
In [37]: #Building the final pipeline for developing the polynomial lasso regression model
Model = Pipeline([('Preprocessing', pipeline_prep),
('Model', Lasso(alpha=10))])
#Training the model with the entire dataset
Model.fit(x_data, y_data)
Out[37]:
Pipeline(steps=[('Preprocessing',
Pipeline(steps=[('ColoumnTransformer',
ColumnTransformer(transformers=[('NumScaler',
Pipeline(steps=[('Sca
ler',
MinM
axScaler())]),
['engine-size',
'curb-weight',
'horsepower',
'width',
'length',
'wheel-base',
'bore',
'city-mpg',
'highway-mpg']),
('CatEncoder',
Pipeline(steps=[('Enc
oder',
OneH
otEncoder(handle_unknown='ignore',
sparse=False))]),
['make',
'num-of-cylinders',
'drive-wheels',
'fuel-system',
'engine-type',
'body-style',
'engine-location',
'aspiration'])])),
('Polynomial', PolynomialFeatures())])),
('Model', Lasso(alpha=10))])
Now the model is ready and can be deployed for predictive pricing...
Generating price predictions from novel data
In this part, I will define a custom function, MakePrediction(), which will take novel new data of different car
and employs the model to return the most suitable price predictions based on these characteristics.
In [38]: #Defining the function
def MakePrediction(model, X_vars):
"""This function takes two inputs: 'model', which specifies the model to be used to
and 'X_vars', which specifies the car characteristics for each car to make the price
the prediction-making process and returns a table with the predicted prices for each
Y_pred = model.predict(X_vars)
Y_pred_df = pd.Series(Y_pred, name='Predicted Prices').to_frame().apply(lambda serie
return Y_pred_df
For a quick test of the function, I will extract a random sample of 10 data points from the original dataset
and pass them to the function, along with the final model developed above. The function should return 10
price predictions as best suited to these 10 data points.
In [39]: #Extracting a random sample from the dataset and assigning it to 'X_new'
X_new = x_data.sample(10)
#Previewing the sample
X_new
enginesize
curbweight
horsepower
width
length
wheelbase
bore
citympg
highwaympg
make
num-ofcylinders
19
90
1909
70
63.600
158.800
94.500
3.030
38
43
chevrolet
four
fwd
125
194
2800
207
65.000
168.900
89.500
3.740
17
25
porsche
six
rwd
68
234
3740
155
71.700
202.600
115.600
3.460
16
18
mercedesbenz
eight
rwd
3
109
2337
102
66.200
176.600
99.800
3.190
24
30
audi
four
fwd
5
136
2507
110
66.300
177.300
99.800
3.190
19
25
audi
five
fwd
51
91
1950
68
64.200
166.800
93.100
3.080
31
38
mazda
four
fwd
Out[39]:
drive
wheel
31
79
1837
60
64.000
150.000
93.700
2.910
38
42
honda
four
fwd
181
109
2212
85
65.500
171.700
97.300
3.190
27
34
volkswagen
four
fwd
4
136
2824
115
66.400
176.600
99.400
3.190
18
22
audi
five
4wd
146
92
1985
62
63.600
158.700
95.700
3.050
35
39
toyota
four
fwd
In [40]: #Now passing the data to the MakePrediction() function to get price predictions
MakePrediction(Model, X_new)
Predicted Prices
Out[40]:
0
$6,469.27
1
$35,824.33
2
$34,631.89
3
$12,219.99
4
$14,449.95
5
$6,937.93
6
$6,354.12
7
$9,012.38
8
$16,749.58
9
$6,437.99
Showing the car characteristics and the corresponding predicted prices together
In [41]: #Reindex and add predicted prices to dataframe
sample_and_prediction, sample_and_prediction['Predicted Prices'] = X_new.reset_index(drop
sample_and_prediction
enginesize
curbweight
horsepower
width
length
wheelbase
bore
citympg
highwaympg
make
num-ofcylinders
drivewheels
0
90
1909
70
63.600
158.800
94.500
3.030
38
43
chevrolet
four
fwd
1
194
2800
207
65.000
168.900
89.500
3.740
17
25
porsche
six
rwd
2
234
3740
155
71.700
202.600
115.600
3.460
16
18
mercedesbenz
eight
rwd
3
109
2337
102
66.200
176.600
99.800
3.190
24
30
audi
four
fwd
4
136
2507
110
66.300
177.300
99.800
3.190
19
25
audi
five
fwd
5
91
1950
68
64.200
166.800
93.100
3.080
31
38
mazda
four
fwd
6
79
1837
60
64.000
150.000
93.700
2.910
38
42
honda
four
fwd
7
109
2212
85
65.500
171.700
97.300
3.190
27
34
volkswagen
four
fwd
8
136
2824
115
66.400
176.600
99.400
3.190
18
22
audi
five
4wd
9
92
1985
62
63.600
158.700
95.700
3.050
35
39
toyota
four
fwd
Out[41]:
Generating price predictions from user input
Finally, for this last part I will define another custom function, MakePrediction_forUser(), which takes user
input with the all characteristics of the car they wish to predict the price of and returns a price prediction
that best suits these given characteristics.
In [42]: #Defining the function
def MakePrediction_forUser(model, x_data):
"""This function asks the user for 17 inputs for 17 different car attributes:
Car brand - engine size - horsepower - fuel system - city MPG - highway MP -, en
number of cylinders - curb weight - length - width - body style - drive wheels t
aspiration engine type.
After taking user input, the function returns a price prediction as best suited
#create empty dictionary for input values
X_vars = dict()
#Take user input for car characteristics
while True:
make = input('Enter car brand: ')
if make not in x_data['make'].unique().tolist():
print('''Invalid car brand. This function only supports the list below:
[alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, maz
mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, sa
volswagen, volvo].
Please make sure to select a car brand featured on this list.\n''')
continue
break
while True:
try:
engine_size = float(input('Enter engine size: '))
break
except:
print('Invalid input. Engine size must be a numerical value. Try again...\n'
continue
while True:
try:
horsepower = float(input('Enter horsepower: '))
break
except:
print('Invalid input. Horsepower must be a numerical value. Try again...\n')
continue
while True:
fuel_system = input('Enter type of fuel system: ')
if fuel_system not in x_data['fuel-system'].unique().tolist():
print('''Invalid fuel system. This function only supports the list below:
[1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdim spfi].
Please make sure to select a fuel system featured on this list.\n''')
continue
break
while True:
try:
city_mpg = float(input('Enter city mpg: '))
break
except:
print('Invalid input. City mpg must be a numerical value. Try again...\n')
continue
while True:
try:
highway_mpg = float(input('Enter highway mpg: '))
break
except:
print('Invalid input. Highway mpg must be a numerical value. Try again...')
continue
while True:
engine_type = input('Enter engine type: ')
if engine_type not in x_data['engine-type'].unique().tolist():
print('''Invalid engine type. This function only supports the list below:
[l, dohc, ohc, ohcf, ohcv, rotor].
Please make sure to select an engine type featured on this list.\n''')
continue
break
while True:
engine_location = input('Enter engine location: ')
if engine_location not in x_data['engine-location'].unique().tolist():
print("Invalid engine location. Engine location must be either 'front' or 'r
continue
break
while True:
num_of_cylinders = input('Enter number of cylinders (as written number): ')
if num_of_cylinders not in x_data['num-of-cylinders'].unique().tolist():
print('''Invalid number of cylinders. This function only supports the values
[two, three, four, five, six, eight, twelve].
Please make sure to select a number featured on this list.\n''')
continue
break
while True:
try:
curb_weight = float(input('Enter car curb weight: '))
break
except:
print('Invalid input. Curb weight must be a numerical value. Try again...\n'
continue
while True:
try:
length = float(input('Enter car length: '))
break
except:
print('Invalid input. Car length must be a numerical value. Try again...\n')
continue
while True:
try:
width = float(input('Enter car width: '))
break
except:
print('Invalid input. Car width must be a numerical value. Try again...\n')
continue
while True:
body_style = input('Enter body style: ')
if body_style not in x_data['body-style'].unique().tolist():
print('''Invalid car body style. This function only supports the values belo
[convertible, hardtop, hatchback, sedan, wagon].
Please make sure to select a body style featured on this list.\n''')
continue
break
while True:
drive_wheels = input('Enter type of drive wheels: ')
if drive_wheels not in x_data['drive-wheels'].unique().tolist():
print('''Invalid drive wheels type. This function only supports the values b
[4wd, fwd, rwd].
Please make sure to select a drive wheel type featured on this list.\n''
continue
break
while True:
try:
wheel_base = float(input('Enter wheel base distance: '))
break
except:
print('Invalid input. Wheel base distance must be a numerical value. Try aga
continue
while True:
try:
bore = float(input('Enter bore size: '))
break
except:
print('Invalid input. Bore size must be a numerical value. Try again...\n')
while True:
aspiration = input('Enter aspiration engine type: ')
if aspiration not in x_data['aspiration'].unique().tolist():
print("Invalid aspiration engine type. This function only supports 'std' and
continue
break
print('\n\n')
#Adding values to dictionary
X_vars['engine-size'], X_vars['horsepower'], X_vars['city-mpg'], X_vars['highway-mp
X_vars['length'], X_vars['width'], X_vars['curb-weight'], X_vars['wheel-base'], X_va
X_vars['make'], X_vars['fuel-system'], X_vars['engine-type'], X_vars['engine-locatio
X_vars['num-of-cylinders'], X_vars['body-style'], X_vars['drive-wheels'], X_vars['asp
#convert dictionary to dataframe
df_X_vars = pd.DataFrame(X_vars)
#Generate and return price prediction
Y_pred = model.predict(df_X_vars)
return 'For a car with the given characteristics, the predicted price is: ${:,.2f}.'
In [43]: #Now we can use the function to produce a prediction from user input
MakePrediction_forUser(Model, x_data)
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Out[43]:
car brand: honda
engine size: 92
horsepower: 76
type of fuel system: 1bbl
city mpg: 31
highway mpg: 38
engine type: ohc
engine location: front
number of cylinders (as written number): four
car curb weight: 1819
car length: 144.6
car width: 63.9
body style: hatchback
type of drive wheels: fwd
wheel base distance: 86.6
bore size: 2.91
aspiration engine type: std
'For a car with the given characteristics, the predicted price is: $6,312.93.'
In [44]: #END