Akshi Chaudhary
: @akshi8Tina Qian
: @TinaQian2017Xinbin Huang
: @xinbinhuang
- Date : March 18, 2018
- Release : v4
The tidyplusPy
package is an essential data cleaning package with features like missing value treatment, data type cleansing and displaying data as markdown table for documents. The package adds a few additional functionalities on the existing data wrangling packages (i.e. Pandas). The objective of this package is to provide a few specific functions to solve some of the pressing issues in data cleaning.
The package needs to be installed from GitHub. Open your Anaconda or Terminal, and type in:
# if you have git installed in your computer already, try:
pip install git+https://github.com/UBC-MDS/tidyplusPy.git
# if you do not have git installed, try:
pip install https://github.com/UBC-MDS/tidyplusPy/zipball/master
There are 5 functions, typemix
, cleanmix
, EM
, md
and mmm
, in the package tidyplusPy. To import the package:
import tidyplusPy
Three main parts including different functions in tidyplusPy
-
Data Type Cleansing :
-
typemix
- The function helps to find the columns containing different types of data, like character and numeric. The input of the function is a data frame, and the output of the function will be a list of 3 data frames reporting details about the mixture of data types. The first data frame in the list is the same as the input data frame, the second one tells you the location and types of data in the columns where there is type mixture. The third data frame is a summary of the second data frame.
-
cleanmix
- The function helps to clean our data frame. After knowing where the mixture of data types is, one can use this function to keep/delete a type of data in certain columns. Here, the input will be an output by the
typemix
function, ID of the column(s) (the ID is the numbering of the column(s)) that they want to clean, the type of data they want to work on, and if they want to keep or delete the certain type. The output will be a data frame like the original one but with specified data type in the certain columns deleted.
- The function helps to clean our data frame. After knowing where the mixture of data types is, one can use this function to keep/delete a type of data in certain columns. Here, the input will be an output by the
-
-
Missing Value Treatment : Basic Imputation and EM Imputation -
em
andmmm
-
Basic Imputation: function used
mmm
replace missing values in a column of a dataframe, or multiple columns of dataframe based on themethod
of imputation- (Method = 'Mean') replace using mean
- (Method = 'Median') replace using median
- (Method = 'Mode') replace using mode
-
EM Imputation: Bonus function used
em
(method = "EM")- Uses EM(Expectation- Maximization) algorithm to predict the closest value to the missing value
- Can be used for both numeric and categorical predictions
-
-
Markdown Table:
md_new()
: This function creates a bare bone for generating a markdown.
This is a basic example which shows you how to solve a common problem:
# prepare data frame
import pandas as pd
from tidyplusPy.typemix import typemix
d={'x1':[1,2,3,"1.2.3"],
'x2':["test","test",1,True],
'x3':[True,True,False,False]}
dat=pd.DataFrame(data=d)
# run the function
typemix(dat)
# prepare data frame
import pandas as pd
from tidyplusPy.typemix import typemix
from tidyplusPy.cleanmix improt cleanmix
d={'x1':[1,2,3,"1.2.3"],
'x2':["test","test",1,True],
'x3':[True,True,False,False]}
dat=pd.DataFrame(data=d)
result=typemix(dat) # need result from typemix function as input
# run the function
cleanmix(result,column=c(1,2),type=c("number","character"))
- Works on pandas dataframe
from tidyplusPy import mmm as tm
NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.3, 0.3, 0.4]
B = [0.2, NaN, 0.2, 0.7, 0.9, NaN, NaN]
C = [NaN, 'A', 'B', NaN, 'C', 'D', 'D']
D = [NaN, 'C', 'E', NaN, 'C', 'H', 'D']
columns = {'A':A, 'B':B, 'C':C, 'D':D}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
tm.mmm(df,method = "mode") ### method can be changed to mean and median as well
- Works on ONLY on nd-array for now
from tidyplusPy import EM as te
matrix= np.random.rand(5,4)
matrix[1,0] = np.nan
matrix[2,1] = np.nan
matrix[4,2] = np.nan
matrix[3,3] = np.nan
te.em(matrix)
from tidyplusPy import md
tbl1 = md.md_new()
# print out the table in markdown syntax
print(tbl1)
# create table of size 3x3; alignment to center
tbl2 = md.md_new(nrow = 3, ncol = 3, align = "c")
print(tbl2)
# provide header
tbl3 = md.md_new(header = ["foo","boo"])
print(tbl3)
Using Data Manipulation functionalities
- Users can use the package when they want to clean and wrangle their data. For example, if the data has not been cleaned yet, users can use function
typemix
to check where data is not clean and usecleanmix
to clean data. Based on personal work experience, the mix of number and character is usually seen in the data collected from the survey. After clean data is ready, one can use theMissing Value Treatment
to deal with missing data by EM algorithm. You can usemd_new()
to create a empty markdown table.
- Data Type Cleansing
- Pandas:string processing function and Pandas:string processing. Brief Version The existing pandas version doesn't have a functionality to explicitly perform string processing/datatype conversion without affecting the overall column type (which is inconvenient when you have really messed up data with mix of strings and numbers)
- Missing Value treatment
- Python doesn't have imputation methods which use
EM algorithm
for missing value treatment, which in fact is very efficient and accurate Imputation methods in python
- Python doesn't have imputation methods which use
- Markdown table in Python
- Python doesn't have a package or library which can output a dataset in the form of a markdown table (User defined)
This is an open source project. Please follow the guidelines below for contribution.
- Open an issue for any feedback and suggestions.
- For contributing to the project, please refer to Contributing for details.