Skip to content

Commit

Permalink
Make small changes to the documentation (#209)
Browse files Browse the repository at this point in the history
  • Loading branch information
qubixes authored Nov 14, 2023
1 parent 63f83e4 commit 37d325c
Show file tree
Hide file tree
Showing 5 changed files with 48 additions and 42 deletions.
63 changes: 35 additions & 28 deletions docs/source/about/metasyn_in_detail.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Detailed overview of metasyn
==============================

``Metasyn`` is a python package for generating synthetic data with a focus on maintaining privacy. It is aimed at owners of sensitive datasets such as public organisations, research groups, and individual researchers who want to improve the accessibility, reproducibility and reusability of their data. The goal of ``metasyn`` is to make it easy for data owners to share the structure and approximation of the content of their data with others without any privacy concerns.
``Metasyn`` is a python package for generating synthetic data with a focus on maintaining privacy. It is aimed at owners of sensitive datasets such as public organisations, research groups, and individual researchers who want to improve the accessibility, reproducibility and reusability of their data. The goal of ``metasyn`` is to make it easy for data owners to share the structure and approximation of the content of their data with others with fewer privacy concerns.

With this goal in mind, ``metasyn`` restricts itself to the `\"synthetically-augmented plausible" <https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot>`__ category of synthetic data, as categorized by the Office for National Statistics (ONS).

Expand All @@ -14,13 +14,14 @@ With this goal in mind, ``metasyn`` restricts itself to the `\"synthetically-aug
* To be used for extended code testing, minimal analytical value, non-negligible disclosure risk.


This choice enables the software to generate synthetic data with **privacy and disclosure guarantees** through a plug-in system. Moreover, our system provides an **auditable and editable intermediate representation** in the form of a human- and machine-readable ``.json`` metadata file from which new data can be synthesized.
The ``metasyn`` package also incorporates a **plug-in** system, which enables implementations to generate synthetic data with
stricter and formal privacy and disclosure guarantees within the same framework. Moreover, our system provides an **auditable and editable intermediate representation** in the form of a human- and machine-readable ``.json`` metadata file from which new data can be synthesized.

Through our focus on privacy and transparency, ``metasyn`` explicitly avoids generating synthetic data with high analytical validity. The data generated by our system is realistic in terms of data structure and plausible in terms of values for each variable, but any multivariate relations or conditional patterns are excluded. This has implications for how this synthetic data can be used: not for statistical analysis and inference, but rather for initial exploration, analysis script development, and communication outside the data owner’s institution. In the intended use case, external researchers can make use of the synthetic data to assess the feasibility of their intended research before making the (often time-consuming) step of requesting access to the sensitive source data for the final analysis.

The Metasyn Pipeline
----------------------
The metasyn package offers a seamless and efficient pipeline for synthetic data generation. It is meticulously designed to ensure the privacy-preserving and reproducible generation of realistic tabular data. The three key stages of this pipeline include the **estimation** of the MetaFrame from the original data, the **serialization** of the MetaFrame into an auditable and editable intermediate representation, and the **generation** of the synthetic data from the model represented by the MetaFrame or its serialized representation. This section provides a walkthrough of these steps.
The three key stages of the ``metasyn`` pipeline include the **estimation** of the MetaFrame from the original data, the **serialization** of the MetaFrame into an auditable and editable intermediate representation, and the **generation** of the synthetic data from the model represented by the MetaFrame or its serialized representation. This section provides a walkthrough of these steps.

.. image:: /images/pipeline_basic.png
:alt: Metasyn Pipeline
Expand All @@ -30,22 +31,22 @@ The metasyn package offers a seamless and efficient pipeline for synthetic data

The following illustrates a simple example of metasyn being used, and where the steps in the pipeline (estimation, serialization/deserialization and generation) occur.

A public health, aiming to conduct research on a sensitive dataset could:
A public health researcher, aiming to conduct research on a sensitive dataset could:

#. Conduct statistical research on a sensitive dataset of medical records.
#. Fit a MetaFrame to the sensitive dataset. (Estimation)
#. Export the MetaFrame to a JSON file following the GMF standard. (Serialization)
#. Share the research report alongside the GMF file and their script(s) used for the analysis.
#. Check the GMF file to ensure no private information is present anymore.
#. Use the MetaFrame to generate a synthetic dataset. (Generation)
#. Share the research report alongside the synthetic dataset, GMF file, their script(s) used for the analysis and the outcomes of those scripts with both the real and synthetic dataset.

Other researchers can then:

#. Load the MetaFrame by importing the GMF file. (Deserialization)
#. Use the MetaFrame to generate a synthetic dataset. (Generation)
#. Rerun the scripts on the synthetic data to validate that the *code* used to analyse the results is sound, and behaves as expected. (Note that the *results* themselves can not be reproduced, as the synthetic data generated by metasyn is not identical on a statistical level)

This builds confidence that when the same modeling approach is applied to the real sensitive data, the results and conclusions will be accurate. Sharing MetaFrames, analysis scripts and results in this way improves transparency and trust in data-based research without requiring the release of irreplicable sensitive data.
#. Check that the outcomes of the analysis scripts are reproducible on the synthetic data.

Note that this is just one of many possible ways in which metasyn can be used.
This approach builds confidence that the results and conclusions are accurate, without the need to release sensitive data.


Estimation
Expand All @@ -63,7 +64,7 @@ The generative model for multivariate datasets in ``metasyn`` makes the simplify
There are many advantages to this naïve approach when compared to more advanced generative models: it is transparent and explainable, it is able to flexibly handle data of mixed types, and it is computationally scalable to high-dimensional datasets. As mentioned before, the tradeoff is the limited analytical validity when the independence assumption does
not hold: in the synthetic data, the expected value of correlations, regression parameters, and other measures of association is 0.

Model estimation starts with an appropriately pre-processed data frame. For ``metasyn``, this means the data frame is `tidy <https://www.jstatsoft.org/article/view/v059i10>`_, each column has the correct data type, and missing data are represented by a missing value. Internally, our software uses the `Polars <https://www.pola.rs/>`_ data frame library, as it is performant, has consistent data types, and native support for missing data (``null``). A simple example source table could look like this (note that categorical data has the appropriate ``cat`` data type, not ``str``):
Model estimation starts with an appropriately pre-processed data frame. For ``metasyn``, this means the data frame is `tidy <https://www.jstatsoft.org/article/view/v059i10>`_, each column has the correct data type, and missing data are represented by a missing value. Internally, our software uses the `Polars <https://www.pola.rs/>`_ data frame library, as it is fast, has consistent data types, and native support for missing data (``null``). A simple example source table could look like this (note that categorical data has the appropriate ``cat`` data type, not ``str``):

.. list-table::
:widths: 10 20 10 20 20
Expand Down Expand Up @@ -101,52 +102,58 @@ Model estimation starts with an appropriately pre-processed data frame. For ``me
- -30


For each data type supported by ``metasyn``, there is a set of candidate distributions that can be fitted to that data type (see Table below). To estimate the generative model of Equation, for each variable the software fits all compatible candidate distributions — by default with maximum likelihood estimation — and then selects the one with the lowest `AIC <https://springer.com/chapter/10.1007/978-1-4612-1694-0_15>`_.
For each data type supported by ``metasyn``, there is a set of candidate distributions that can be fitted to that data
type (see Table below). For each variable, the software fits all available distributions with the same variable type.
From all those fits, the distribution with the lowest `AIC <https://springer.com/chapter/10.1007/978-1-4612-1694-0_15>`_ is chosen.

.. list-table::
:header-rows: 1

* - Variable type
- Data type
- example
- candidate distributions
- Example values
- Example distribution
* - continuous
- float
- 1.0, 2.1, ...
- UniformDistribution, NormalDistribution, ...
- UniformDistribution
* - discrete
- int
- 1, 2, ...
- DiscreteUniformDistribution
* - categorical
- pl.Categorical
- gender, country
- Yes, No, Maybe, No
- MultinoulliDistribution
* - structured string
* - string
- str
- Room number A108, C122
- A108, C122, B312
- RegexDistribution
* - unstructured string
- str
- Names, open answers
- FakerDistribution, LLMDistribution
* - temporal
- Date, Datetime
- 2021-01-13, 01:40:12
- DateUniformDistribution
* - time
- time
- 01:40:12
- UniformTimeDistribution
* - date
- date
- 1937-10-28
- UniformDateDistribution
* - datetime
- datetime
- 2022-07-23 08:04:22
- UniformDateTimeDistribution

.. note::
See the :doc:`/usage/generating_metaframes` page for information on *how* to generate a MetaFrame.

Serialization and deserialization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. image:: /images/pipeline_serialization_simple.png
:alt: Metasyn Serialization Step in Pipeline
:align: center

After a ``MetaFrame`` object is created, ``metasyn`` allows it to be stored in a human- and machine-readable ``.json`` file. This file can be considered as metadata.
After a ``MetaFrame`` object is created, ``metasyn`` allows it to be stored in a human- and machine-readable ``.json`` file. This file contains all the (statistical) metadata as input for the generation step.
Exported :obj:`MetaFrames <metasyn.metaframe.MetaFrame>` follow the `Generative Metadata Format (GMF) <https://github.com/sodascience/generative_metadata_format>`__, a standard designed to be easy to read and understand.
This allows for manual and automatic editing, as well as easy sharing.
This allows for manual and automatic editing, as well as sharing.

.. raw:: html

Expand Down
2 changes: 1 addition & 1 deletion docs/source/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Metasyn's synthetically generated datasets are classified as `Synthetically-Augm

**What is a MetaFrame?**
-------------------------
A MetaFrame is a fitted model that describes the aggregate structure and characteristics of a dataset. It functions like (statistical) metadata for the dataset, providing information about the dataset without revealing the actual data itself. When metasyn is fed a dataset (as DataFrame), it generates this MetaFrame to capture certain key aspects of the data.
A MetaFrame is a fitted model that describes the aggregate structure and characteristics of a dataset. It functions like (statistical) metadata for the dataset, providing information about the dataset without revealing the actual data itself. When metasyn is fed a dataset (as a DataFrame), it generates this MetaFrame to capture certain key aspects of the data.

Key elements encapsulated in a MetaFrame include variable names, their data types, the proportion of missing values, and the parameters of the distributions that these variables follow in the dataset. This information is sufficient to understand the overall structure and attributes of the data, without divulging the exact data points.

Expand Down
9 changes: 4 additions & 5 deletions docs/source/usage/generating_metaframes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ This function requires a :obj:`DataFrame` to be specified as parameter. The foll
.. admonition:: Note on Pandas and Polars DataFrames

Internally, metasyn uses Polars (instead of Pandas) mainly because typing and the handling of non-existing data is more consistent. It is possible to supply a Pandas DataFrame instead of a Polars DataFrame to the ``MetaFrame.from_dataframe`` method. However, this uses the automatic Polars conversion functionality, which for some edge cases result in problems. Therefore, we advise users to create Polars DataFrames. The resulting synthetic dataset is always a Polars DataFrame, but this can be easily converted back to a Pandas DataFrame by using ``df_pandas = df_polars.to_pandas()``.
Internally, metasyn uses Polars (instead of Pandas) mainly because typing and the handling of non-existing data is more consistent. It is possible to supply a Pandas DataFrame instead of a Polars DataFrame to the ``MetaFrame.from_dataframe`` method. However, this uses the automatic Polars conversion functionality, which for some edge cases results in problems. Therefore, we recommend users to create Polars DataFrames. The resulting synthetic dataset is always a Polars DataFrame, but this can be easily converted back to a Pandas DataFrame by using ``df_pandas = df_polars.to_pandas()``.


It is possible to print the (statistical metadata contained in the) :obj:`MetaFrame <metasyn.metaframe.MetaFrame>` to the console/output log. This can simply be done by calling the Python built-in `print` function on a :obj:`MetaFrame <metasyn.metaframe.MetaFrame>`:
Expand Down Expand Up @@ -69,7 +69,7 @@ spec

It is safe to ignore this warning - however, be aware that without setting the column as unique, metasyn may generate duplicate values for that column when synthesizing data.

To remove the warning and ensure the column remains unique, set the column to be unique (``"column" = {"unique": True}``) in the ``spec`` dictionary.
To remove the warning and ensure the values in the synthesized column are unique, set the column to be unique (``"column" = {"unique": True}``) in the ``spec`` dictionary.

- ``description``: Includes a description for each column in the DataFrame.

Expand Down Expand Up @@ -109,8 +109,8 @@ spec
# Fit `Age` to a discrete uniform distribution ranging from 20 to 40
"Age": {"distribution": DiscreteUniformDistribution(20, 40)},
# Use a regex-based distribution to generate `Cabin` values following [ABCDEF]\d{2,3}
"Cabin": {"distribution": RegexDistribution(r"[ABCDEF][0-9]{2,3}")}
# Use a regex-based distribution to generate `Cabin` values following [A-F][0-9]{2,3}
"Cabin": {"distribution": RegexDistribution(r"[A-F][0-9]{2,3}")}
}
Expand All @@ -125,7 +125,6 @@ dist_providers
privacy
^^^^^^^^^
**privacy** allows you to set the global privacy level for synthetic data generation. If it's not provided, the function defaults it to ``None``.
For more on privacy modules available refer to :mod:`Privacy Features (experimental) <metasyn.privacy>`.

.. warning::
Privacy features (such as differential privacy or other forms of disclosure control) are currently under active development. More information on currently available extensions can be found in the :doc:`/usage/extensions` section.
12 changes: 6 additions & 6 deletions docs/source/usage/generating_synthetic_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,18 @@ Metasyn can **generate synthetic data** from any :obj:`MetaFrame <metasyn.metafr
:alt: Synthetic Data Generation
:align: center

The generated synthetic data, emulates the original data's format and plausibility at the individual record level and attempts to reproduce marginal (univariate) distributions where possible. Generated values are based on the observed distributions while adding a degree of variance and smoothing. The frequency of missing values is also maintained in the synthetically-augmented dataset.
The generated synthetic data, emulates the original data's format and plausibility at the individual record
level and attempts to reproduce marginal (univariate) distributions where possible.
Generated values are based on the observed distributions.
The frequency of missing values is also maintained in the synthetically-augmented dataset.

The generated data does **not** aim to preserve the relationships between variables.
The generated data does **not** preserve any relationships between variables.

.. warning::

Before synthetic data can be generated, a :obj:`MetaFrame <metasyn.metaframe.MetaFrame>` object must be :doc:`created </usage/generating_metaframes>` or :doc:`loaded </usage/exporting_metaframes>`.

To generate a synthetic dataset, simply call the :meth:`MetaFrame.synthesize(n) <metasyn.metaframe.MetaFrame.synthesize>` method on a :obj:`MetaFrame <metasyn.metaframe.MetaFrame>` object. This method takes an integer parameter `n` which represents the number of rows of data that should be generated. This parameter *must* be specified when calling the method.
To generate a synthetic dataset, simply call the :meth:`MetaFrame.synthesize(n) <metasyn.metaframe.MetaFrame.synthesize>` method on a :obj:`MetaFrame <metasyn.metaframe.MetaFrame>` object. This method takes an integer parameter `n` which represents the number of rows of data that should be generated.

.. image:: /images/pipeline_generation_code.png
:alt: Synthetic Data Generation With Code Snippet
Expand All @@ -28,6 +31,3 @@ The following code generates 5 rows of data based on a :obj:`MetaFrame <metasyn.
mf.synthesize(5)
Upon succesful execution of the :meth:`MetaFrame.synthesize(n)<metasyn.metaframe.MetaFrame.synthesize>` method, a `Polars DataFrame <https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/index.html>`_ will be returned.



4 changes: 2 additions & 2 deletions docs/source/usage/quick_start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ Quick start guide
Get started quickly with metasyn using the following example. In this concise demonstration, you'll learn the basic functionality of metasyn by generating synthetic data from `titanic <https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv>`_ dataset.

.. note::
The steps on this page correspond to the basic tutorial available on the :doc:`/usage/interactive_tutorials` page. As such, if you prefer an interactive experience, check out the basic tutorial for a guided walkthrough!
A more elaborate version of this page is also available as an interactive tutorial available on the :doc:`/usage/interactive_tutorials` page.

Importing Libraries
-------------------

The first step in any Python project is to import the necessary libraries. For this example, we will need Polars and metasyn.
The first step is to import the required Python libraries. For this example, we will need Polars and metasyn.


.. code:: python
Expand Down

0 comments on commit 37d325c

Please sign in to comment.