This pipeline focuses on data preprocessing, standardization, and cleaning, with additional features to identify univariate anomalies.
- I used sklearn's Pipeline and Transformer concept to create this preprocessing pipeline
pip install AutoPrep
- scikit-learn
- category_encoders
- bitstring
To utilize this pipeline, you need to import the necessary libraries and initialize the AutoPrep pipeline. Here is a basic example:
import pandas as pd
import numpy as np
X_train = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Alice', 'Alice', "Alice"],
'Rank': ['A','B','C','D'],
'Age': [25, 30, 35, 40],
'Salary': [50000.00, 60000.50, 75000.75, 8_000],
'Hire Date': pd.to_datetime(['2020-01-15', '2019-05-22', '2018-08-30', '2021-04-12']),
'Is Manager': [False, True, False, ""]
})
X_test = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Alice', 'Alice', "Bob"],
'Rank': ['A','B','C','D'],
'Age': [25, 30, 35, np.nan],
'Salary': [50000.00, 60000.50, 75000.75, 8_000_000],
'Hire Date': pd.to_datetime(['2020-01-15', '2019-05-22', '2018-08-30', '2021-04-12']),
'Is Manager': [False, True, False, ""]
})
########################################
from AutoPrep import AutoPrep
pipeline = AutoPrep(remove_columns_no_variance=False)
pipeline.fit(X=X_train)
X_output = pipeline.transform(X=X_test)
X_output
Both methods (MOD Z-Value and Tukey Method) are resilient against outliers, ensuring that the position measurement will not be biased. They also support multivariate anomaly detection algorithms in identifying univariate anomalies.
Newest research shows similar results for encoding nominal columns with significantly fewer dimensions.
- (John T. Hancock and Taghi M. Khoshgoftaar. "Survey on categorical data for neural networks." In: Journal of Big Data 7.1 (2020), pp. 1–41.), Tables 2, 4
- (Diogo Seca and João Mendes-Moreira. "Benchmark of Encoders of Nominal Features for Regression." In: World Conference on Information Systems and Technologies. 2021, pp. 146–155.), P. 151