Scikit-Learn Data Processing and Model Evaluation

This notebook shows how you can:

run a processing job to run a Scikit-Learn script to clean, pre-process, perform feature engineering, and split the input data into train and test sets.
run a training job on the pre-processed training data to train a model model
run a processing job on the pre-processed test data to evaluate the trained model's performance
use your own custom container with to run processing jobs with your own Python libraries and dependencies.

The dataset used is the Census-Income KDD Dataset. We will select features from this dataset, clean the data, and turn the data into features that our training algorithm can use to train a binary classification model, and split the data into train and test sets.

The task is to predict whether rows representing census responders have an income greater than $50K, or less than 50K. The dataset is heavily class imbalanced, with most records being labeled as earning less than $50K. After training a logistic regression model, we will evaluate the model against a hold-out test dataset, and save the classification evaluation metrics, including precision, recall, and F1 score for each label, and accuracy and ROC AUC for the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scikit-Learn Data Processing and Model Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scikit-Learn Data Processing and Model Evaluation