Skip to content
/ synloc Public

A Python Package to Create Synthetic Tabular Data

License

Notifications You must be signed in to change notification settings

alfurka/synloc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

synloc: An Algorithm to Create Synthetic Tabular Data

synloc

Overview | Installation | A Quick Example | Documentation | How to cite?

PyPI Downloads

Overview

synloc is an algorithm to sequentially and locally estimate distributions to create synthetic versions of a tabular data. The proposed methodology can be combined with parametric and nonparametric distributions.

Installation

synloc can be installed through PyPI:

pip install synloc

A Quick Example

Assume that we have a sample with three variables with the following distributions:

$$x \sim Beta(0.1,,0.1)$$ $$y \sim Beta(0.1,, 0.5)$$ $$z \sim 10 y + Normal(0,,1)$$

The distribution can be generated by tools module in synloc:

from synloc.tools import sample_trivariate_xyz
data = sample_trivariate_xyz() # Generates a sample with size 1000 by default. 

Initializing the resampler:

from synloc import LocalCov
resampler = LocalCov(data = data, K = 30)

Subsample size is defined as K=30. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw "synthetic values."

syn_data = resampler.fit() 
100%|██████████| 1000/1000 [00:01<00:00, 687.53it/s]

syn_data is a pandas.DataFrame where all variables are synthesized. Comparing the original sample using a 3-D Scatter:

resampler.comparePlots(['x','y','z'])

How to cite?

If you use synloc in your research, please cite the following paper:

@article{kalay2022generating,
  title={Generating Synthetic Data with The Nearest Neighbors Algorithm},
  author={Kalay, Ali Furkan},
  journal={arXiv preprint arXiv:2210.00884},
  year={2022}
}