- Gabby and Kwame want to present to the Zillow Team regarding the findings and prediction models about the drivers of the single unit property values.
- Kwame and Gabby want to produce deliverables acquire.py, prepare.py, explore.py and model.py so that people who are interested in the data and those who want to verify the validity of their findings may do so.
- acquire data from csv gathered from sql.
- address missing data
- address outliers
- scale numeric data
- split into train, validate, test
- plot correlation matrix of all variables
- test each hypothesis
- feature engineering to construct new features
- try different algorithms: Feature Elimination, Multiple Linear Regression Model, Polynomial Regression Model, Baseline Model
- which features are most influential?
- evaluate on train
- select top 3 +/- models to evaluate on validate
- select top model
- run model on test to verify.
- conclusion
- summarize findings
- make recommendations
- next steps
- how to run with new data.
Term | Definition |
---|---|
Co-Op | A unit of a housing co-operative; a purchased apartment where the apartment owners collectively are responsible for maintenance of common areas and upkeep. |
Single Unit Property | The term housing unit refers to a single unit within a larger structure that can be used by an individual or household to eat, sleep, and live. The unit can be in any type of residence such as a house, apartment, mobile home, or may also be a single unit in a group of rooms. |
Logistic Regression | A regression algorithm used to predict discrete outcomes. |
Decision Tree | A sequence of rules that can be used to classify 2 or more classes using supervised machine learning processes. |
Random Forest | A learning method that constructs a multitude of decision trees at training time and outputting the classification. |
K-Nearest Neighbor (KNN) | A lazy algorithm in that it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point. Makes predictions based on how close a new data point is to known data points. |
Precision | the higher this number is, the more you were able to pinpoint all positives correctly. If this is a low score, you predicted a lot of positives where there were none. tp / (tp + fp) |
Recall | If this score is high, you didn’t miss a lot of positives. But as it gets lower, you are not predicting the positives that are actually there. tp / (tp + fn) |
f1-score | The balanced harmonic mean of Recall and Precision, giving both metrics equal weight. The higher the F-Measure is, the better. |
Support | The number of occurrences of each class in where y is true. |
Min-Max Scaling | A linear scaling method that transforms our features such that the range is between 0 and 1. |
Standardization | A linear transformation of our data such that is looks like the standard normal distribution. That is, it will have a mean of 0 and a standard deviation of 1. |
Regression Line | Also known as a line of best fit, a linear regression algorithm that returns the slope and y-intercept of the line that most accurately predicts y, given the x and y provided to the algorithm. |
Baseline | Predicting the target variable without using any features. Take the mean or median value and predict all future values to be that constant value. |
Residual | The difference between the observed value and the estimated value. |
Sum of the Squared Errors (SSE) | Also known as, RSS, Residual Sum of Squares will be used as the final metric to evaluate. The value of the SSE is derived by simply squaring each of the errors computed in step one and summing them all together. |
Mean Squared Error (MSE) | The average of the errors that have each been squared. |
Root Mean Squared Error (RMSE) | Simply take the square root of the MSE. Used to see the error in the actual units of the y variable. |
K-Best | Features using a statistical test to compare each X with y and find which X's have the strongest relationship with y. |
Recursive Feature Elimination (RFE) | Creates a model with all the features, evaluates the performance metrics, find the weakest feature, removes it, then creates a new model with the remaining features, evaluate the performance metrics, find the weakest feature, remove it, and so on, until it gets down to the number of features indicated when creating the rfe object. |
Regression | A supervised machine learning technique used to model the relationships between one (simple) or more (multiple) features (independent) and how they contribute to one (univariate) or more (multivariate) target variables (dependent), represented by a continuous variable. |
First Hypothesis
𝐻0: The number of bathrooms has no correlation on the tax estimated property value.
𝐻𝑎: The number of bathrooms has a correlation with the tax estimated property value.
Second Hypothesis
𝐻0: There is no correlation between finished square feet and tax estimated property value.
𝐻𝑎: There is a correlation between finished square feet and tax estimated property value.
acquire.py
prepare.py
explore
model
The conclusions of the hypotheses tests are Since p is less than α, we reject our null hypothesis that bathroom has no effect on the tax estimated property value and that since p is less than α, we reject our null hypothesis that there is no correlation between finished square feet and tax estimated property value.
Initially we found that bathroom count has a slightly stronger correlation with tax property value assessments but upon further modeling found that bedroom count and bath/bedroom count have a higher correlation.
Further exploration and modeling can be done to assess differences is estimated prices in counties, between lot sizes and between the two months of May and June 2017.
You may download acquire.py and prepare.py. You will need your own env.py file with your SQL credentials in order to access the SQL server.