We are data science consultants who are contracted by property management investors in New York City. Their company, supported by investors, wants to buy residential real estate in NYC at the cheapest price possible, renovate, then resell within a year. The renovation analysis is outside the scope of this project, but they want a baseline model that can predict the price of residential real-estate in order to :
Identify potential undervalued listed properties to buy Predict market price when it’s time to sell in order to sell quickly while maximizing return on investment Because the want to renovate and sell the properties quickly, they want less than 10 residential units, and properties less than 5 million each but are at least ten thousand.
-
pandas
import pandas as pd
-
numpy
import numpy as np
-
matplotlib.pyplot
import matplotlib.pyplot as plt
-
joblib
import joblib
-
seaborn
import seaborn as sns
-
scipy.stats.randint
from scipy.stats import randint
-
sklearn:
-
sklearn.metrics:
- mean_squared_error
- mean_absolute_error
- r2_score
- confusion_matrix
-
sklearn.ensemble:
- RandomForestRegressor
- BaggingRegressor
-
sklearn.model_selection:
- train_test_split
- GridSearchCV
- RandomizedSearchCV
- cross_validate
- KFold
-
sklearn.preprocessing:
- StandardScaler
- OneHotEncoder
- RobustScaler
-
sklearn.linear_model: LinearRegression
-
sklearn.model_selection: train_test_split
-
sklearn.pipline: Pipline
-
sklearn.compose: ColumnTransformer
-
sklearn.decimposition: PCA
-
sklearn.dummy: DummyRegressor
from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import RobustScaler, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor, BaggingRegressor from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score from scipy.stats import randint from sklearn.model_selection import GridSearchCV, RandomizedSearchCV from sklearn.dummy import DummyRegressor from sklearn.tree import ExtraTreeRegressor from sklearn.model_selection import cross_validate, KFold
-
-
random_SCV(pipe = [], grid_param = [], n_iter = 10, cv = 5, scoring = 'neg_mean_squared_error', rnd_state = 42, file_name = "", training = [])
Running RandomizedSearchCV for an estimator "pipe" according to grid_param and the other parameters including a list of x_training and y_training (training). The results are saved in param_tuning folder in the file named: file_name. -
grid_SCV(pipe = [], grid_param = [], cv = 5, scoring = 'neg_mean_squared_error', file_name = "", training = [])
Similar to the first function, but this time it is GridSearchCV that runs on an estimator "pipe". -
wr_pkl_file(file_name = "",content = "", read = False)
Dealing with either reading or writing a pkl file that contains different machine learning pipelines with their corresponding results. -
print_results(labels = [], est = [], plt_num = 50, log = False, testing = [])
Predicting sales prices and printing results (R-Squared, MAE, and RMSE) for different estimators (est). -
validation(models = [], estimators = [], training = [], cv = 5, train_score = False):
Performs cross validation for different models using their estimators and training set.
Clone the repo using the following command in terminal:
git clone https://github.com/avivfaraj/DSCI631-project.git
After cloning the repo, open Final_project.ipynb and run each cell one at a time in the order that they are presented. You can run the whole notebook in a single step by clicking on the menu Cell -> Run All.
The first two sections are packages and functions which are required for the code to run. Make sure to run those two sections before running the program.
Dataset was found at Kaggle.
The origin of the data in this dataset is NYC Department of Finance Rolling Sales