This is an implementation of the following paper:
Arnab Bhattacharyya, Constantinos Daskalakis, Themis Gouleakis, Thanh Vinh Vo, Wang Yuhao
"Learning High-dimensional Gaussians from Censored Data" arXiv preprint arXiv (2022).
The missingness mechanism are as follows:
- Missing Completely At Random, value is missing with some probability \alpha;
- Missing At Random. One fully observed variable lead to the missingness of another variable.
- Missing Not At Random. Hidden variable(s) lead to the missingness of a fully observed variable.
MCAR | MAR |
---|---|
Self-masking MNAR | General MNAR |
Assume the censoring model is MNAR, we study two settings
- [Self-censoring]: Assume self-censoring mechanism, we developed a distribution learning algorithm (Algorithm 1 below) tha learns $ N(\mu^, \Sigma^)$ up to TV distance
$\varepsilon$ . - [Convex masking]: When the missingness mechanisms are in general, we design an efficient mean estimation algorithm from a d-dimensional Gaussian
$N{\mu^*, \Sigma}$ , assuming that the observed missingness pattern is not very rare conditioned on the values of the observed coordinates, and that any small subset of coordinates is observed with sufficiently high probability.
- Recent Advances in Algorithmic High-Dimensional Robust Statistics
- Robustly Learning a Gaussian: Getting Optimal Error, Efficiently
- Workshop https://github.com/YohannaWANG/Missing-Data-Literature
- Python 3.6+
seaborn
argpase
numpy
pandas
scipy
sklearn
matplotlib
torch
cvxpylayers
tqdm
data.py
- generate synthetic data. Load real data.config.py
- simulation parameters.utils.py
- difference missingness mechanism, such as self-censoring MNAR, MNAR missingness in general, MAR, MCAR.truncationPSGD
- the implementation of the algorithm 1 in our paper.main.py
- main algorihtm.demo.ipynb
- demo of our implementation
Parameter | Type | Description | Options |
---|---|---|---|
n |
int | number of samples | - |
d |
int | number of variables | - |
plot |
Bool | plot chain graph or not | - |
algorithm |
str | choice which algorithm | self-censoring , convex-masking |
The simplest way to try out MissingDescent is to run a simple example:
$ git clone https://github.com/YohannaWANG/MissingDescent.git
$ cd MissingDescent/
$ python $ cd MissingDescent/main.py
Alternatively, if you have a CSV data file X.csv
, you can install the package and run the algorithm as a command:
$ pip install git+git://github.com/YohannaWANG/MissingDescent
$ cd MissingDescent
$ python main.py --algorithm self-censoring --d 50 --n 1000
- Algorithm 1 [Truncation_PSGD] Distribution recovery given access to an oracle that generates samples with incomplete data;
- Algorithm 2 [MissingDescent] Mean recovery given access to an oracle that generates samples with incomplete data.
- Algorithm 3 [Initialize] Initialization for the main algorithm.
-
Algorithm 4 [SampleGradient] Sampler for
$\nabla \ell(\bm{\mu})$ . -
Algorithm 5 [ProjectToDomain] The function that projects a current guess back to the domain
onto the
$\ball_{\bm{\Sigma}}$ ball.
[Truncation_PSGD] (Mean absolute percentage error (MAPE) and KL divergence)
[Truncation_PSGD] We fixed N=20,000 and varied the percentage of missing from 10% to 80%.
:--------------------------------------------------------------------:
[Truncation_PSGD] Running time on synthetic data.
[Truncation_PSGD] Semi-synthetic dataset.
One paragraph in our related work section gives almost a complete history of work done on them! We summarized most of the related works below, it will also be updated accordingly.
https://github.com/YohannaWANG/Missing-Data-Literature
Please feel free to contact us if you meet any problem when using this code. We are glad to hear other advise and update our work. We are also open to collaboration if you think that you are working on a problem that we might be interested in it. Please do not hestitate to contact us!