Skip to content

mackenr/Classification_Telco

Repository files navigation

Telco Project



[Objectives]] [Project Description] [Project Planning] [Data Dictionary] [Data Acquire and Prep] [Data Exploration] [Modeling] [Conclusion] [Steps to Reproduce]


Objectives:

  • Document the data science pipeline. Ensure the findings are presented clearly and the documentation is clear enough for independent reproduction
  • Create modules that can be downloaded for the sake of reproducibility.

Project Description and goals:

  • The goal is to use data to find and explore predictive factors of churn.
  • Ultimately we hope to use these factors to drive actions which help to maintain a strong customer base and drive profits.

Questions:

Generally we would ask what relationships that might affect churn?

Is there a distinguishable relationship between household size and churn?

This required some feature engineering.

Is there a relationship between churn and paperless billing?

Is there a correlation between customer duration and monthly charges?

  • If so, how strong is it?

Due total charges have a relationship with churn?

Data Dictionary:

Specify Target add definintions

Variable Name Data Type Categorical/Numerical
tenure int64 Numerical
monthly_charges float64 Numerical
total_charges float64 Numerical
gender_encoded int64 Categorical
partner_encoded int64 Categorical
dependents_encoded int64 Categorical
phone_service_encoded int64 Categorical
paperless_billing_encoded int64 Categorical
churn_encoded int64 Categorical
multiple_lines_No_phone_service uint8 Categorical
multiple_lines_Yes uint8 Categorical
online_security_No_internet_service uint8 Categorical
online_security_Yes uint8 Categorical
online_backup_No_internet_service uint8 Categorical
online_backup_Yes uint8 Categorical
device_protection_No_internet_service uint8 Categorical
device_protection_Yes uint8 Categorical
tech_support_No_internet_service uint8 Categorical
tech_support_Yes uint8 Categorical
streaming_tv_No_internet_service uint8 Categorical
streaming_tv_Yes uint8 Categorical
streaming_movies_No_internet_service uint8 Categorical
streaming_movies_Yes uint8 Categorical
contract_type_One_year uint8 Categorical
contract_type_Two_year uint8 Categorical
internet_service_type_Fiber_optic uint8 Categorical
internet_service_type_None uint8 Categorical
payment_type_Credit_card_(automatic) uint8 Categorical
payment_type_Electronic_check uint8 Categorical
payment_type_Mailed_check uint Categorical

Procedure:

Planning:

Our plan is to follow the data science pipeline best practices. The steps are included below. Ultimately we are buil

Acquisition:

An aquire.py file is created and used. It aquires the data from the database then saves it a .csv file locally (telco.csv). Also it outputs simple graphs of the counts of unique values per variable in order to give a quick visual of whether or not the data is going to be categorical or not.

Preparation:

A prepare.py file is created and used. Here the data is cleaned. Categorical columns are encoded. Numerical columns are designated as floats if there are no nulls. The columns with nulls are noted to be treated at the next step. The results of this step are saved into a csv file (telco_clean.csv).

Exploration and Pre-processing:

A preprocess.py file is created and used. Here we split our data into subsets which are train, validate and test repsectively. From here we address the colums with null values. We take the mean of the non null values in each column and impute them as the null values in the respective columns. This process is done independtley with train, validate and test in order to avoid any "data poisoning". From here we start doing a deep dive into exploration on the train dataset. We ask questions of our data and create graphs in order to better understand our data and ask better questions. We then formulate those questions into hypothesises and do some statisitical tests to find the answers to our questions.

Modeling:

Here we use select various machine learning algorithms from the sklean library to create models. Once we have our models we can further vary our hyperparmeters in each model. From here we

Delivery:

A final report is created which gives a highlevel overview of the the process.

Explanations for Reproducibility:

In order to repoduce these you will need a env.py file which contains host, username and password creditials to access the sql server. The remaining files are availble within my github repo. If you clone this repo, add a env.py file in the format shown below you will be able to reproduce the outcome. As and aside the random state is included in the file. If you were to change this your results my slightly differ.

host='xxxxx'
username='xxxxxx'
password='xxxxxx'
## Where the strings are your respective credentials

Executive Summary:

Conclusion:

We beat the baseline with our models.

Hence, their predictive power is useful.

More data might highlight some interesting relationships.

Specific Recommendations:

It would be nice to obtain more quantitative data related to the projected disposable income of each household.

Actionable Example:

Offer a Telco credit card. In doing so, we are able to collect more quantitative information on credit ratings and household income. This could give us insights on the projected disposable income of each household. The goal being to maximize profits by the data gained and perhaps offering incentives that diminish churn.

Closing Quote:

“Errors using inadequate data are much less than those using no data at all.” (Charles Babbage, English Mathematician)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages