[Objectives]] [Project Description] [Project Planning] [Data Dictionary] [Data Acquire and Prep] [Data Exploration] [Modeling] [Conclusion] [Steps to Reproduce]
- Document the data science pipeline. Ensure the findings are presented clearly and the documentation is clear enough for independent reproduction
- Create modules that can be downloaded for the sake of reproducibility.
- The goal is to use data to find and explore predictive factors of churn.
- Ultimately we hope to use these factors to drive actions which help to maintain a strong customer base and drive profits.
Generally we would ask what relationships that might affect churn?
Is there a distinguishable relationship between household size and churn?
This required some feature engineering.
Is there a relationship between churn and paperless billing?
Is there a correlation between customer duration and monthly charges?
- If so, how strong is it?
Due total charges have a relationship with churn?
Specify Target add definintions
Variable Name | Data Type | Categorical/Numerical |
tenure | int64 | Numerical |
monthly_charges | float64 | Numerical |
total_charges | float64 | Numerical |
gender_encoded | int64 | Categorical |
partner_encoded | int64 | Categorical |
dependents_encoded | int64 | Categorical |
phone_service_encoded | int64 | Categorical |
paperless_billing_encoded | int64 | Categorical |
churn_encoded | int64 | Categorical |
multiple_lines_No_phone_service | uint8 | Categorical |
multiple_lines_Yes | uint8 | Categorical |
online_security_No_internet_service | uint8 | Categorical |
online_security_Yes | uint8 | Categorical |
online_backup_No_internet_service | uint8 | Categorical |
online_backup_Yes | uint8 | Categorical |
device_protection_No_internet_service | uint8 | Categorical |
device_protection_Yes | uint8 | Categorical |
tech_support_No_internet_service | uint8 | Categorical |
tech_support_Yes | uint8 | Categorical |
streaming_tv_No_internet_service | uint8 | Categorical |
streaming_tv_Yes | uint8 | Categorical |
streaming_movies_No_internet_service | uint8 | Categorical |
streaming_movies_Yes | uint8 | Categorical |
contract_type_One_year | uint8 | Categorical |
contract_type_Two_year | uint8 | Categorical |
internet_service_type_Fiber_optic | uint8 | Categorical |
internet_service_type_None | uint8 | Categorical |
payment_type_Credit_card_(automatic) | uint8 | Categorical |
payment_type_Electronic_check | uint8 | Categorical |
payment_type_Mailed_check | uint | Categorical |
Our plan is to follow the data science pipeline best practices. The steps are included below. Ultimately we are buil
An aquire.py file is created and used. It aquires the data from the database then saves it a .csv file locally (telco.csv). Also it outputs simple graphs of the counts of unique values per variable in order to give a quick visual of whether or not the data is going to be categorical or not.
A prepare.py file is created and used. Here the data is cleaned. Categorical columns are encoded. Numerical columns are designated as floats if there are no nulls. The columns with nulls are noted to be treated at the next step. The results of this step are saved into a csv file (telco_clean.csv).
A preprocess.py file is created and used. Here we split our data into subsets which are train, validate and test repsectively. From here we address the colums with null values. We take the mean of the non null values in each column and impute them as the null values in the respective columns. This process is done independtley with train, validate and test in order to avoid any "data poisoning". From here we start doing a deep dive into exploration on the train dataset. We ask questions of our data and create graphs in order to better understand our data and ask better questions. We then formulate those questions into hypothesises and do some statisitical tests to find the answers to our questions.
Here we use select various machine learning algorithms from the sklean library to create models. Once we have our models we can further vary our hyperparmeters in each model. From here we
A final report is created which gives a highlevel overview of the the process.
In order to repoduce these you will need a env.py file which contains host, username and password creditials to access the sql server. The remaining files are availble within my github repo. If you clone this repo, add a env.py file in the format shown below you will be able to reproduce the outcome. As and aside the random state is included in the file. If you were to change this your results my slightly differ.
host='xxxxx'
username='xxxxxx'
password='xxxxxx'
## Where the strings are your respective credentials
We beat the baseline with our models.
Hence, their predictive power is useful.
More data might highlight some interesting relationships.
It would be nice to obtain more quantitative data related to the projected disposable income of each household.
Offer a Telco credit card. In doing so, we are able to collect more quantitative information on credit ratings and household income. This could give us insights on the projected disposable income of each household. The goal being to maximize profits by the data gained and perhaps offering incentives that diminish churn.
“Errors using inadequate data are much less than those using no data at all.” (Charles Babbage, English Mathematician)