Skip to content

YohannaWANG/MissingDescent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

   

Documentation Status Maintenance License: GPL v3

Learning High-dimensional Gaussians from Censored Data

This is an implementation of the following paper:

Arnab Bhattacharyya, Constantinos Daskalakis, Themis Gouleakis, Thanh Vinh Vo, Wang Yuhao

"Learning High-dimensional Gaussians from Censored Data" arXiv preprint arXiv (2022).

Background

The missingness mechanism are as follows:

  1. Missing Completely At Random, value is missing with some probability \alpha;
  2. Missing At Random. One fully observed variable lead to the missingness of another variable.
  3. Missing Not At Random. Hidden variable(s) lead to the missingness of a fully observed variable.
MCAR MAR
characterization characterization
Self-masking MNAR General MNAR
characterization characterization

Introduction

Assume the censoring model is MNAR, we study two settings

  1. [Self-censoring]: Assume self-censoring mechanism, we developed a distribution learning algorithm (Algorithm 1 below) tha learns $ N(\mu^, \Sigma^)$ up to TV distance $\varepsilon$.
  2. [Convex masking]: When the missingness mechanisms are in general, we design an efficient mean estimation algorithm from a d-dimensional Gaussian $N{\mu^*, \Sigma}$, assuming that the observed missingness pattern is not very rare conditioned on the values of the observed coordinates, and that any small subset of coordinates is observed with sufficiently high probability.

Related work

  1. Recent Advances in Algorithmic High-Dimensional Robust Statistics
  2. Robustly Learning a Gaussian: Getting Optimal Error, Efficiently
  3. Workshop https://github.com/YohannaWANG/Missing-Data-Literature

Prerequisites

  • Python 3.6+
    • seaborn
    • argpase
    • numpy
    • pandas
    • scipy
    • sklearn
    • matplotlib
    • torch
    • cvxpylayers
    • tqdm

Contents

  • data.py - generate synthetic data. Load real data.
  • config.py - simulation parameters.
  • utils.py - difference missingness mechanism, such as self-censoring MNAR, MNAR missingness in general, MAR, MCAR.
  • truncationPSGD - the implementation of the algorithm 1 in our paper.
  • main.py - main algorihtm.
  • demo.ipynb- demo of our implementation

Parameters

Parameter Type Description Options
n int number of samples -
d int number of variables -
plot Bool plot chain graph or not -
algorithm str choice which algorithm self-censoring, convex-masking

Running a simple demo

The simplest way to try out MissingDescent is to run a simple example:

$ git clone https://github.com/YohannaWANG/MissingDescent.git
$ cd MissingDescent/
$ python $ cd MissingDescent/main.py

Runing as a command

Alternatively, if you have a CSV data file X.csv, you can install the package and run the algorithm as a command:

$ pip install git+git://github.com/YohannaWANG/MissingDescent
$ cd MissingDescent
$ python main.py --algorithm self-censoring --d 50 --n 1000 

Algorithms

  • Algorithm 1 [Truncation_PSGD] Distribution recovery given access to an oracle that generates samples with incomplete data; characterization
  • Algorithm 2 [MissingDescent] Mean recovery given access to an oracle that generates samples with incomplete data. characterization
  • Algorithm 3 [Initialize] Initialization for the main algorithm. characterization
  • Algorithm 4 [SampleGradient] Sampler for $\nabla \ell(\bm{\mu})$. characterization
  • Algorithm 5 [ProjectToDomain] The function that projects a current guess back to the domain onto the $\ball_{\bm{\Sigma}}$ ball. characterization

Performance

[Truncation_PSGD] (Mean absolute percentage error (MAPE) and KL divergence)

characterization

[Truncation_PSGD] We fixed N=20,000 and varied the percentage of missing from 10% to 80%. characterization

:--------------------------------------------------------------------:

[Truncation_PSGD] Running time on synthetic data. characterization

[Truncation_PSGD] Semi-synthetic dataset. characterization

Related Works

One paragraph in our related work section gives almost a complete history of work done on them! We summarized most of the related works below, it will also be updated accordingly.

https://github.com/YohannaWANG/Missing-Data-Literature

Citation

Contacts

Ask Me Anything ! Please feel free to contact us if you meet any problem when using this code. We are glad to hear other advise and update our work. We are also open to collaboration if you think that you are working on a problem that we might be interested in it. Please do not hestitate to contact us!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published