Skip to content

🐍🀯 Utilising Python w/ Pandas and Jupyter to create Dataframes to read in 1994 US Census data to compile an Analytical Base Table (ABT) which displays a Categorical and Continuous Data Quality Report. Dealing with Data Quality Issues (DQIs) such as Cardinality Issues, Outliers and Missing Values.

Notifications You must be signed in to change notification settings

Stephen2697/MachineLearning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DataAnalyser .py [v1.0]


Creator: Stephen Alger
Machine Learening Introduction - Modelling Data, Deriving Modal Values, Measures of Central Tendency, Variation & Standard Deviation.
Last Update: 29 October 2019

Program Descriptor:

Utilising Pandas and Jupyter to create Dataframes to read in 1994 US Census data to compile an Analytical Base Table (ABT) which displays a Categorical and Continuous Data Quality Report. Dealing with Data Quality Issues (DQIs) such as Cardinality Issues, Outliers and Missing Values.

How to Run DataAnalyser Script:

--Run This Script as Follows
$ cd /Github/ML/ML_DataQuality
$ chmod 755 DataAnalyser.py     #-- if needed 
$ python3 DataAnalyser.py

Data Set In Use: 1994 US Census Data - see files below:

https://archive.ics.uci.edu/ml/datasets/Adult http://cseweb.ucsd.edu/classes/sp15/cse190-c/reports/sp15/048.pdf

Design Notes:

  • Read in US Census Data from .csv into panda Dataframe
  • Get the Column Headings from .txt file
  • Separate Continuous and Categorical Columns
  • Process Both types of Data in their respective ways: *Continuous -[1st Quartile ,Mean ,Median ,3rd Quartile ,Max ,Standard Deviation] *Categorical -[Modal Value, Mode FreQ, Mode %, 2nd Modal Value, 2nd Mode FreQ, 2nd Mode %]
  • Build respective output dataframes with input data columns as features
  • Output both Categorical and Continuous Data to separate CSV files in Analytical Base Table Format

Compatibility:

  • Tested on Python 3.7.2 - Mac OS Catalina (Not Compatible with Python 2.7)
  • Also Tested in Jupyter Notebook

Sample Output Graphs

Output Graph

Output Files:

  • ContinuousDataQualityReportABT.csv
  • CategoricalDataQualityReportABT.csv

Known Issues/ Points of Improvement (Non-Exhaustive) :

  • Yet to Be Added...

About

🐍🀯 Utilising Python w/ Pandas and Jupyter to create Dataframes to read in 1994 US Census data to compile an Analytical Base Table (ABT) which displays a Categorical and Continuous Data Quality Report. Dealing with Data Quality Issues (DQIs) such as Cardinality Issues, Outliers and Missing Values.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages