Predict salaray based on multiple features.
- We have 2 files - Train and Test File.
- Train file has 100k observations with 7 features
- 4 categorical and 2 numerical data
We ran through data processing to look for following
- Nulls
- Data types to see if numerical columns are marked as object
- how many categorical and numericals columns in the dataframe
Used hot encoding to convert the categorical values to numerical values as below, as the models only work on the numerical columns
Evaluated the correlation to see which featured need to be considered.
From the Correlation, Company ID doenst have impact on Salary, so will be ignored.
Evaluated 2 models - Linear Regression and Gradient Boosting Regressor
Predicted VS Real Plot
MSE Evaluation
Predicted VS Real Plot
MSE Evaluation
Although Predicted VS Real plots looks same, from further evaluations MSE numbers, GBR seems to be better model.
Using GBR, evaluated the Features to see which has more impact