sodascience · qubixes · Mar 8, 2024 · Mar 6, 2024 · Mar 6, 2024 · Mar 8, 2024
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -40,7 +40,7 @@ jobs:
         pylint metasyn
     - name: Lint with Ruff
       run: |
-        ruff metasyn
+        ruff check metasyn
     - name: Check docstrings with pydocstyle
       run: |
         pydocstyle metasyn --convention=numpy --add-select=D417 --add-ignore="D102,D105"
@@ -57,3 +57,7 @@ jobs:
       if: ${{ matrix.os != 'macos-latest' }}
       run: |
         pytest --nbval-lax examples
+
+    - name: Test basic example
+      run: |
+        python examples/basic_example.py
diff --git a/docs/source/faq.rst b/docs/source/faq.rst
@@ -52,13 +52,15 @@ This warning occurs when ``metasyn`` detects a column, that seems to have unique
 
 .. code-block:: python
 
+   from metasyn import VarSpec
+
    # Create a specification dictionary, and specify the column as unique:
-   var_spec = {
-      "PassengerId": {"unique": True}
-   }
+   var_specs = [
+      VarSpec("PassengerId", unique=true)
+   ]
 
    # Call the fit_dataframe() function, passing in the `var_spec` dictionary as the `spec` argument
-   mf = MetaFrame.fit_dataframe(df, spec=var_spec)
+   mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)
 
 More information on how to use the optional parameters in the :meth:`metasyn.MetaFrame.fit_dataframe() <metasyn.metaframe.MetaFrame.fit_dataframe>` function can be found in :doc:`/usage/generating_metaframes` under :ref:`optionalparams`.
 
diff --git a/docs/source/usage/cli.rst b/docs/source/usage/cli.rst
@@ -85,7 +85,7 @@ The ``create-meta`` command can be used as follows:
 
 .. code-block:: bash
 
-   metasyn create-meta --input [input] --output [output]
+   metasyn create-meta [input] --output [output]
 
 This will:
 
@@ -154,7 +154,6 @@ column is ``data_free``. It is also required to set the number of rows under the
 
       name = "PassengerId"
       data_free = true
-      unique = true
       prop_missing = 0.0
       description = "ID of the unfortunate passenger."
       var_type = "discrete"
@@ -176,7 +175,7 @@ The ``synthesize`` command can be used as follows:
 
 .. code-block:: bash
 
-   metasyn synthesize [input] [output]
+   metasyn synthesize [input] --output [output]
 
 This will:
 

diff --git a/docs/source/usage/generating_metaframes.rst b/docs/source/usage/generating_metaframes.rst
@@ -54,20 +54,15 @@ allows you to have more control over how your synthetic dataset is generated wit
 parameters:
 
 Besides the required `df` parameter, :meth:`metasyn.MetaFrame.fit_dataframe() <metasyn.metaframe.MetaFrame.fit_dataframe>`
-accepts four parameters: ``meta_config``, ``var_specs``, ``dist_providers`` and ``privacy``.
+accepts three parameters: ``var_specs``, ``dist_providers`` and ``privacy``.
 
 Let's take a look at each optional parameter individually:
 
-meta_config
-^^^^^^^^^^^
-**meta_config** is an optional parameter that encompasses all the other parameters; it contains information on the
-``var_specs``, ``dist_providers`` and ``privacy``. This parameter is generally used when the configuration is loaded
-from a .toml file. Otherwise it is recommended to leave ``meta_config`` at its default value (None) and specify
-the other optional parameters.
-
 var_specs
 ^^^^^^^^^
 **var_specs** is an optional list that outlines specific directives for columns (variables) in the DataFrame.
+This list can also be generated from a .toml file. In that case you have to provide a string of path instead of
+a list.
 The potential directives include:
 
     - ``name``: This specifies the column name and is mandatory.
@@ -79,12 +74,12 @@ The potential directives include:
     .. admonition:: Detection of unique variables
 
         When generating a MetaFrame, ``metasyn`` will automatically analyze the columns of the input DataFrame to detect ones that contain only unique values.
-        If such a column is found, and it has not manually been set to unique in the ``var_specs`` dictionary, the user will be notified with the following warning:
+        If such a column is found, and it has not manually been set to unique in the ``var_specs`` list, the user will be notified with the following warning:
         ``Warning: Variable [column_name] seems unique, but not set to be unique. Set the variable to be either unique or not unique to remove this warning``
 
         It is safe to ignore this warning - however, be aware that without setting the column as unique, ``metasyn`` may generate duplicate values for that column when synthesizing data.
 
-        To remove the warning and ensure the values in the synthesized column are unique, set the column to be unique (``"column" = {"unique": True}``) in the ``var_specs`` list.    
+        To remove the warning and ensure the values in the synthesized column are unique, set the column to be unique (``unique = True``) in the ``var_specs`` list.    
 
     - ``description``: Includes a description for each column in the DataFrame.
 
@@ -101,31 +96,27 @@ The potential directives include:
     - The ``Name`` column should be populated with realistic fake names using the `Faker <https://faker.readthedocs.io/en/master/>`_ library.
     - In the ``Fare`` column, we aim for an exponential distribution.
     - Age values in the ``Age`` column should follow a discrete uniform distribution, ranging between 20 and 40.
-    - The ``Cabin`` column should adhere to a predefined structure: a letter between A and F, followed by 2 to 3 digits (e.g., A40, B721).
 
     The following code to achieve this would look like:
 
     .. code-block:: python
 
-        from metasyn.distribution import FakerDistribution, DiscreteUniformDistribution, RegexDistribution
-        from metasyn.config import VarConfig, DistributionSpec
+        from metasyn.distribution import FakerDistribution, DiscreteUniformDistribution
+        from metasyn.config import VarSpec
 
-        # Create a specification dictionary for generating synthetic data
+        # Create a specification list for generating synthetic data
         var_specs = [
             # Ensure unique values for the `PassengerId` column
-            VarConfig(name="PassengerId", dist_spec=DistributionSpec(unique=True)),
+            VarSpec("PassengerId", unique=True),
 
             # Utilize the Faker library to synthesize realistic names for the `Name` column
-            VarConfig(name="Name", dist_spec=FakerDistribution("name")),
+            VarSpec("Name", distribution=FakerDistribution("name")),
 
             # Fit `Fare` to an log-normal distribution, but base the parameters on the data
-            VarConfig(name="Name", dist_spec="LogNormalDistribution"),
+            VarSpec("Name", distribution="lognormal"),
 
             # Set the `Age` column to a discrete uniform distribution ranging from 20 to 40
-            VarConfig(name="Age", dist_spec=DiscreteUniformDistribution(20, 40)),
-
-            # Use a regex-based distribution to generate `Cabin` values following [A-F][0-9]{2,3}
-            VarConfig(name="Cabin", dist_spec=cabin_distribution, description="The cabin number of the passenger."),
+            VarSpec("Age", distribution=DiscreteUniformDistribution(20, 40)),
         ]
 
         mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)

diff --git a/examples/basic_example.py b/examples/basic_example.py
@@ -1,14 +1,13 @@
 from metasyn import MetaFrame, demo_dataframe
-from metasyn.config import VarConfig
-from metasyn.util import DistributionSpec
+from metasyn.config import VarSpec
 
 # example dataframe from polars website
 df = demo_dataframe("fruit")
 
 # set A to unique and B to not unique
 specs = [
-    VarConfig(name="ID", dist_spec=DistributionSpec(unique=True)),
-    VarConfig(name="B", dist_spec=DistributionSpec(unique=True)),
+    VarSpec("ID", unique=True),
+    VarSpec("B", unique=False),
 ]
 
 # create MetaFrame

diff --git a/examples/example_gmf_titanic.json b/examples/example_gmf_titanic.json
@@ -4,9 +4,9 @@
     "provenance": {
         "created by": {
             "name": "metasyn",
-            "version": "0.7.1.dev1+g1f601ea.d20240226"
+            "version": "0.7.1.dev15+g2ce8291.d20240308"
         },
-        "creation time": "2024-02-27T14:10:08.278961"
+        "creation time": "2024-03-08T10:54:42.702163"
     },
     "vars": [
         {
@@ -29,7 +29,7 @@
         {
             "name": "Name",
             "type": "string",
-            "dtype": "Utf8",
+            "dtype": "String",
             "prop_missing": 0.0,
             "distribution": {
                 "implements": "core.freetext",
@@ -47,7 +47,7 @@
         {
             "name": "Sex",
             "type": "categorical",
-            "dtype": "Categorical",
+            "dtype": "Categorical(ordering='physical')",
             "prop_missing": 0.0,
             "distribution": {
                 "implements": "core.multinoulli",
@@ -108,7 +108,7 @@
         {
             "name": "Ticket",
             "type": "string",
-            "dtype": "Utf8",
+            "dtype": "String",
             "prop_missing": 0.0,
             "distribution": {
                 "implements": "core.regex",
@@ -177,7 +177,7 @@
         {
             "name": "Cabin",
             "type": "string",
-            "dtype": "Utf8",
+            "dtype": "String",
             "prop_missing": 0.7710437710437711,
             "distribution": {
                 "implements": "core.regex",
@@ -226,7 +226,7 @@
         {
             "name": "Embarked",
             "type": "categorical",
-            "dtype": "Categorical",
+            "dtype": "Categorical(ordering='physical')",
             "prop_missing": 0.002244668911335578,
             "distribution": {
                 "implements": "core.multinoulli",
@@ -304,7 +304,7 @@
         {
             "name": "all_NA",
             "type": "string",
-            "dtype": "Utf8",
+            "dtype": "String",
             "prop_missing": 1.0,
             "distribution": {
                 "implements": "core.na",