Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revarspec #275

Merged
merged 22 commits into from
Mar 8, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ jobs:
pylint metasyn
- name: Lint with Ruff
run: |
ruff metasyn
ruff check metasyn
- name: Check docstrings with pydocstyle
run: |
pydocstyle metasyn --convention=numpy --add-select=D417 --add-ignore="D102,D105"
Expand All @@ -57,3 +57,7 @@ jobs:
if: ${{ matrix.os != 'macos-latest' }}
run: |
pytest --nbval-lax examples

- name: Test basic example
run: |
python examples/basic_example.py
10 changes: 6 additions & 4 deletions docs/source/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,13 +52,15 @@ This warning occurs when ``metasyn`` detects a column, that seems to have unique

.. code-block:: python

from metasyn import VarSpec

# Create a specification dictionary, and specify the column as unique:
var_spec = {
"PassengerId": {"unique": True}
}
var_specs = [
VarSpec("PassengerId", unique=true)
qubixes marked this conversation as resolved.
Show resolved Hide resolved
]

# Call the fit_dataframe() function, passing in the `var_spec` dictionary as the `spec` argument
mf = MetaFrame.fit_dataframe(df, spec=var_spec)
mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)

More information on how to use the optional parameters in the :meth:`metasyn.MetaFrame.fit_dataframe() <metasyn.metaframe.MetaFrame.fit_dataframe>` function can be found in :doc:`/usage/generating_metaframes` under :ref:`optionalparams`.

5 changes: 2 additions & 3 deletions docs/source/usage/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ The ``create-meta`` command can be used as follows:

.. code-block:: bash

metasyn create-meta --input [input] --output [output]
metasyn create-meta [input] --output [output]

This will:

Expand Down Expand Up @@ -154,7 +154,6 @@ column is ``data_free``. It is also required to set the number of rows under the

name = "PassengerId"
data_free = true
unique = true
prop_missing = 0.0
description = "ID of the unfortunate passenger."
var_type = "discrete"
Expand All @@ -176,7 +175,7 @@ The ``synthesize`` command can be used as follows:

.. code-block:: bash

metasyn synthesize [input] [output]
metasyn synthesize [input] --output [output]

This will:

Expand Down
33 changes: 12 additions & 21 deletions docs/source/usage/generating_metaframes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,20 +54,15 @@ allows you to have more control over how your synthetic dataset is generated wit
parameters:

Besides the required `df` parameter, :meth:`metasyn.MetaFrame.fit_dataframe() <metasyn.metaframe.MetaFrame.fit_dataframe>`
accepts four parameters: ``meta_config``, ``var_specs``, ``dist_providers`` and ``privacy``.
accepts three parameters: ``var_specs``, ``dist_providers`` and ``privacy``.

Let's take a look at each optional parameter individually:

meta_config
^^^^^^^^^^^
**meta_config** is an optional parameter that encompasses all the other parameters; it contains information on the
``var_specs``, ``dist_providers`` and ``privacy``. This parameter is generally used when the configuration is loaded
from a .toml file. Otherwise it is recommended to leave ``meta_config`` at its default value (None) and specify
the other optional parameters.

var_specs
^^^^^^^^^
**var_specs** is an optional list that outlines specific directives for columns (variables) in the DataFrame.
This list can also be generated from a .toml file. In that case you have to provide a string of path instead of
a list.
The potential directives include:

- ``name``: This specifies the column name and is mandatory.
Expand All @@ -79,12 +74,12 @@ The potential directives include:
.. admonition:: Detection of unique variables

When generating a MetaFrame, ``metasyn`` will automatically analyze the columns of the input DataFrame to detect ones that contain only unique values.
If such a column is found, and it has not manually been set to unique in the ``var_specs`` dictionary, the user will be notified with the following warning:
If such a column is found, and it has not manually been set to unique in the ``var_specs`` list, the user will be notified with the following warning:
``Warning: Variable [column_name] seems unique, but not set to be unique. Set the variable to be either unique or not unique to remove this warning``

It is safe to ignore this warning - however, be aware that without setting the column as unique, ``metasyn`` may generate duplicate values for that column when synthesizing data.

To remove the warning and ensure the values in the synthesized column are unique, set the column to be unique (``"column" = {"unique": True}``) in the ``var_specs`` list.
To remove the warning and ensure the values in the synthesized column are unique, set the column to be unique (``unique = True``) in the ``var_specs`` list.

- ``description``: Includes a description for each column in the DataFrame.

Expand All @@ -101,31 +96,27 @@ The potential directives include:
- The ``Name`` column should be populated with realistic fake names using the `Faker <https://faker.readthedocs.io/en/master/>`_ library.
- In the ``Fare`` column, we aim for an exponential distribution.
- Age values in the ``Age`` column should follow a discrete uniform distribution, ranging between 20 and 40.
- The ``Cabin`` column should adhere to a predefined structure: a letter between A and F, followed by 2 to 3 digits (e.g., A40, B721).

The following code to achieve this would look like:

.. code-block:: python

from metasyn.distribution import FakerDistribution, DiscreteUniformDistribution, RegexDistribution
from metasyn.config import VarConfig, DistributionSpec
from metasyn.distribution import FakerDistribution, DiscreteUniformDistribution
from metasyn.config import VarSpec

# Create a specification dictionary for generating synthetic data
# Create a specification list for generating synthetic data
var_specs = [
# Ensure unique values for the `PassengerId` column
VarConfig(name="PassengerId", dist_spec=DistributionSpec(unique=True)),
VarSpec("PassengerId", unique=True),

# Utilize the Faker library to synthesize realistic names for the `Name` column
VarConfig(name="Name", dist_spec=FakerDistribution("name")),
VarSpec("Name", distribution=FakerDistribution("name")),

# Fit `Fare` to an log-normal distribution, but base the parameters on the data
VarConfig(name="Name", dist_spec="LogNormalDistribution"),
VarSpec("Name", distribution="lognormal"),

# Set the `Age` column to a discrete uniform distribution ranging from 20 to 40
VarConfig(name="Age", dist_spec=DiscreteUniformDistribution(20, 40)),

# Use a regex-based distribution to generate `Cabin` values following [A-F][0-9]{2,3}
VarConfig(name="Cabin", dist_spec=cabin_distribution, description="The cabin number of the passenger."),
VarSpec("Age", distribution=DiscreteUniformDistribution(20, 40)),
]

mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)
Expand Down
7 changes: 3 additions & 4 deletions examples/basic_example.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
from metasyn import MetaFrame, demo_dataframe
from metasyn.config import VarConfig
from metasyn.util import DistributionSpec
from metasyn.config import VarSpec

# example dataframe from polars website
df = demo_dataframe("fruit")

# set A to unique and B to not unique
specs = [
VarConfig(name="ID", dist_spec=DistributionSpec(unique=True)),
VarConfig(name="B", dist_spec=DistributionSpec(unique=True)),
VarSpec("ID", unique=True),
VarSpec("B", unique=False),
]

# create MetaFrame
Expand Down
16 changes: 8 additions & 8 deletions examples/example_gmf_titanic.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
"provenance": {
"created by": {
"name": "metasyn",
"version": "0.7.1.dev1+g1f601ea.d20240226"
"version": "0.7.1.dev15+g2ce8291.d20240308"
},
"creation time": "2024-02-27T14:10:08.278961"
"creation time": "2024-03-08T10:54:42.702163"
},
"vars": [
{
Expand All @@ -29,7 +29,7 @@
{
"name": "Name",
"type": "string",
"dtype": "Utf8",
"dtype": "String",
"prop_missing": 0.0,
"distribution": {
"implements": "core.freetext",
Expand All @@ -47,7 +47,7 @@
{
"name": "Sex",
"type": "categorical",
"dtype": "Categorical",
"dtype": "Categorical(ordering='physical')",
"prop_missing": 0.0,
"distribution": {
"implements": "core.multinoulli",
Expand Down Expand Up @@ -108,7 +108,7 @@
{
"name": "Ticket",
"type": "string",
"dtype": "Utf8",
"dtype": "String",
"prop_missing": 0.0,
"distribution": {
"implements": "core.regex",
Expand Down Expand Up @@ -177,7 +177,7 @@
{
"name": "Cabin",
"type": "string",
"dtype": "Utf8",
"dtype": "String",
"prop_missing": 0.7710437710437711,
"distribution": {
"implements": "core.regex",
Expand Down Expand Up @@ -226,7 +226,7 @@
{
"name": "Embarked",
"type": "categorical",
"dtype": "Categorical",
"dtype": "Categorical(ordering='physical')",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this is new from polars.

"prop_missing": 0.002244668911335578,
"distribution": {
"implements": "core.multinoulli",
Expand Down Expand Up @@ -304,7 +304,7 @@
{
"name": "all_NA",
"type": "string",
"dtype": "Utf8",
"dtype": "String",
"prop_missing": 1.0,
"distribution": {
"implements": "core.na",
Expand Down
Loading
Loading