Regression Project: Estimating Home Value

Kwame and Gabby, Codeup Darden Cohort, Oct 2020

Regression Project: Estimating Home Value

About the project

Background

Kwame and Gabby want to predict the values of single unit properties that the tax district assessments using the property data from those whose last transaction was during the peak real estate demand months of May and June 2017.

Goals

Gabby and Kwame want to present to the Zillow Team regarding the findings and prediction models about the drivers of the single unit property values.
Kwame and Gabby want to produce deliverables acquire.py, prepare.py, explore.py and model.py so that people who are interested in the data and those who want to verify the validity of their findings may do so.

Data Dictionary

Term	Definition
Co-Op	A unit of a housing co-operative; a purchased apartment where the apartment owners collectively are responsible for maintenance of common areas and upkeep.
Single Unit Property	The term housing unit refers to a single unit within a larger structure that can be used by an individual or household to eat, sleep, and live. The unit can be in any type of residence such as a house, apartment, mobile home, or may also be a single unit in a group of rooms.
Logistic Regression	A regression algorithm used to predict discrete outcomes.
Decision Tree	A sequence of rules that can be used to classify 2 or more classes using supervised machine learning processes.
Random Forest	A learning method that constructs a multitude of decision trees at training time and outputting the classification.
K-Nearest Neighbor (KNN)	A lazy algorithm in that it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point. Makes predictions based on how close a new data point is to known data points.
Precision	the higher this number is, the more you were able to pinpoint all positives correctly. If this is a low score, you predicted a lot of positives where there were none. tp / (tp + fp)
Recall	If this score is high, you didn’t miss a lot of positives. But as it gets lower, you are not predicting the positives that are actually there. tp / (tp + fn)
f1-score	The balanced harmonic mean of Recall and Precision, giving both metrics equal weight. The higher the F-Measure is, the better.
Support	The number of occurrences of each class in where y is true.
Min-Max Scaling	A linear scaling method that transforms our features such that the range is between 0 and 1.
Standardization	A linear transformation of our data such that is looks like the standard normal distribution. That is, it will have a mean of 0 and a standard deviation of 1.
Regression Line	Also known as a line of best fit, a linear regression algorithm that returns the slope and y-intercept of the line that most accurately predicts y, given the x and y provided to the algorithm.
Baseline	Predicting the target variable without using any features. Take the mean or median value and predict all future values to be that constant value.
Residual	The difference between the observed value and the estimated value.
Sum of the Squared Errors (SSE)	Also known as, RSS, Residual Sum of Squares will be used as the final metric to evaluate. The value of the SSE is derived by simply squaring each of the errors computed in step one and summing them all together.
Mean Squared Error (MSE)	The average of the errors that have each been squared.
Root Mean Squared Error (RMSE)	Simply take the square root of the MSE. Used to see the error in the actual units of the y variable.
K-Best	Features using a statistical test to compare each X with y and find which X's have the strongest relationship with y.
Recursive Feature Elimination (RFE)	Creates a model with all the features, evaluates the performance metrics, find the weakest feature, removes it, then creates a new model with the remaining features, evaluate the performance metrics, find the weakest feature, remove it, and so on, until it gets down to the number of features indicated when creating the rfe object.
Regression	A supervised machine learning technique used to model the relationships between one (simple) or more (multiple) features (independent) and how they contribute to one (univariate) or more (multivariate) target variables (dependent), represented by a continuous variable.

Hypothesis Testing

First Hypothesis
𝐻0: The number of bathrooms has no correlation on the tax estimated property value.
𝐻𝑎: The number of bathrooms has a correlation with the tax estimated property value.

Second Hypothesis
𝐻0: There is no correlation between finished square feet and tax estimated property value.
𝐻𝑎: There is a correlation between finished square feet and tax estimated property value.

Data Science Pipeline Used

acquire.py

acquire data from csv gathered from sql.

prepare.py

address missing data
address outliers
scale numeric data
split into train, validate, test

explore

plot correlation matrix of all variables
test each hypothesis

model

feature engineering to construct new features
try different algorithms: Feature Elimination, Multiple Linear Regression Model, Polynomial Regression Model, Baseline Model
which features are most influential?
evaluate on train
select top 3 +/- models to evaluate on validate
select top model
run model on test to verify.
conclusion
summarize findings
make recommendations
next steps
how to run with new data.

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
deliverables		deliverables
.gitignore		.gitignore
README.md		README.md
acquire_kwame.ipynb		acquire_kwame.ipynb
acquire_kwame.py		acquire_kwame.py
gabby_acquire.ipynb		gabby_acquire.ipynb
gabby_acquire.py		gabby_acquire.py
gabby_exploration.ipynb		gabby_exploration.ipynb
model_kwame.ipynb		model_kwame.ipynb
prepare_kwame.ipynb		prepare_kwame.ipynb
prepare_kwame.py		prepare_kwame.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kwame and Gabby, Codeup Darden Cohort, Oct 2020

Regression Project: Estimating Home Value

About the project

Background

Goals

Data Dictionary

Hypothesis Testing

Data Science Pipeline Used

Conclusion

How to reproduce the results

About

Releases

Packages

Contributors 2

Languages

The-Property-Estimators/regression-project

Folders and files

Latest commit

History

Repository files navigation

Kwame and Gabby, Codeup Darden Cohort, Oct 2020

Regression Project: Estimating Home Value

About the project

Background

Goals

Data Dictionary

Hypothesis Testing

Data Science Pipeline Used

Conclusion

How to reproduce the results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages