Comparing Ridge and LASSO model to find the best accuracy for House Price.
Nowadays house price has been sky-rocketing, thats why I think it's gonna be intersting to do a prediction using Regularized Regression. This repo is about predicting house price using regularized regression and comparison between Ridge and LASSO accuracy value. The target on this model is 'medv' or house price, the input is a dataframe and the output is a accuracy value.
Requirement : numpy, pandas, matplotlib, seaborn, sklearn, statsmodels
On this data there are 14 columns:
- Criminal rate (crim)
- Residential land zoned proportion (zn)
- Non-retail business acres proportion (indus)
- Is bounds with river (chas)
- Nitrogen oxides concentration (nox)
- Number rooms average (rm)
- Owner age proportion (age)
- Weighted distance to cities (dis)
- Accessibility index (rad)
- Tax rate (tax)
- Pupil-teacher ratio (ptratio)
- Black proportion (black)
- Percent lower status (lstat)
I'm using 'boston.csv' as my main data, after importing it I'm definging the target and the feature the target is 'medv' and the feature is all of the 'boston.csv' columns except 'medv'. Since we want to do a linear regression and find the best lambda I divide the data into train, test, and validation using from sklearn.model_selection import train_test_split
.
After that, I want to check multicolinearity variable using VIF score and correlation, for the VIF score I'm using from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
. Based on the VIF score and correlation I decided to drop 'tax' column to avoid multicolinearity.
The next step is fit the data using training data using Ridge from sklearn.linear_model import Ridge
and Lasso from sklearn.linear_model import Lasso
, and then check the best lambda using validation data for both Ridge and LASSO based on RMSE.
Based on the picture above the best model is ridge data with lambda = 1.
After that, I calculate the coefficient using ridge data with lambda = 1. Last step is calculating the testing error using from sklearn.metrics import mean_absolute_error
(MAE), from sklearn.metrics import mean_absolute_percentage_error
(MAPE), from sklearn.metrics import mean_squared_error
(MSE).
Based on the picture above we can see that The best model for this dataset is a ridge with lambda = 1 using MAE(mean absolute error). For further information and code you can see in my file here.