closure.data.pipe
is a simple machine learning data pipeline based on
functional programming.
Data transformations are functions, and pipelines are functions that return
data transformations.
This repo has:
- Examples how to structure your code to do correctly data processing for machine learning.
- R package with few predefined data transformations.
Use closure.data.pipe
to transform input data before calling
R functions train
and predict
.
I got frustrated that R did not have good data pipelines similar to Python sklearn.Pipeline, but luckily R function closures enable pipelines almost out of the box.
This repository uses package
data.table
to represent and process data, but you can implement a similar approach in dplyr
or with plain data.frame
.
#' create a function that transforms input data for model
#'
#' @param fit_dt: train data to fit data transformations
#'
#' @return transformation function
createTransformFunction <- function(fit_dt) {
# fit any transformation function
fitted_transformation <- createFittedTransformation(fit_dt)
transformData <- function(input_dt) {
# define the label
input_dt[, label := as.character(y) ]
# validate binary label
stopifnot(input_dt[, all(label %in% c("0", "1"))])
# call a fitted transformation
input_dt[, transformed_x := fitted_transformation(x) ]
return( input_dt )
}
return(transformData)
}
transformData <- createTransformFunction(train_data)
save(transformData, file="serialised_transformation.Rdata")
transformed_train_data <- transformData(train_data)
transformed_test_data <- transformData(test_data)
This R package implements following predefined transformations:
-
getCategTransform
processes a categorical input column:- unseen values map to a separate category
- values with cardinality less than argument N are pooled together
-
getDiscretizeTransform
discretizes a continuous input column. Boundaries set with k-means. -
processNAColumn
transforms a continuous input column to two columns:- input column with
NA
values mapped to zero. - new column
<input column name>_na
that has value one forNA
values and zero otherwise.
- input column with
-
getAveragingTransform
calculates average of a continous input column over a given group. For example, add new column that has average sales by country.
This example shows how to use the predefined transformations
createTransformFunction <- function(fit_dt) {
fit_dt <- copy(fit_dt)
# create column recency as difference of input columns ts and ts_old
fit_dt[, recency := as.numeric(difftime(ts, ts_old, units="mins")) ]
# change data type
fit_dt[, x := as.character(x) ]
# define fitted transformations as a list of functions
column_transforms <- list(
# map values of x that occur more than 1000 times to separate values
getCategTransform("x", fit_dt, threshold=1000),
# discretise recency column to 10 categorical bins with k-means
getDiscretizeTransform("recency", fit_dt, n=10),
# add new column avg_sales that has the average sales of the country
getAveragingTransform(
group_cols="country",
avg_col="sales",
out_col="avg_sales",
fit_dt
)
)
# each element of column_transforms is a
# list(col="<output column name>", fun=<transformation function>)
# remove fit_dt from environment to prevent serialisation of fit_dt
rm(fit_dt)
transformData <- function(input_dt) {
# add click-through-rate based on view and click columns
input_dt[views > 20, ctr := clicks / views ]
input_dt[, x := as.character(x) ]
input_dt[, recency := as.numeric(difftime(ts, ts_old, units="mins")) ]
# apply transformations defined in outer scope
for(transform in column_transforms) {
input_dt[, (transform$col) := transform$fun(.SD) ]
}
# handle missing numeric values
processNAColumn(input_dt, "avg_sales")
processNAColumn(input_dt, "ctr")
# create a binary label
input_dt[, label := ifelse(sales > 1000, "Y", "N")]
# validate that there are no NA values in output
stopifnot(all(!is.na(input_dt)))
return( input_dt )
}
return(transformData)
}
You can use any functions inside the transform function,
for example
caret
package
preprocessing functions.
# clone this repo
# start R shell at this root folder
source("install_package.R")
source("build_package.R")