Custom data synthesis strategy for dataframe #1088

francesco086 · 2023-02-10T12:28:34Z

francesco086
Feb 10, 2023

I don't seem to be able to find anywhere some hints on solving this issue, so I am asking for help here!
I promise that if someone helps me out here I will create a PR to the documentation that will help other people in the future :)

I will keep it simple with a small example. I want to write a schema for a df with columns a, b, and c, where c should be a + b.
I want to be able to generate examples too, for testing purposes.

So, I wrote:

import pandera as pa

def a_plus_b_equals_c_strategy(
    pandera_dtype: pa.DataType,
    strategy: Optional[st.SearchStrategy] = None,
    *,
    a: str,
    b: str,
    c: str,
):
    acs = st.column_strategy(pa.dtypes.Int32)
    bcs = st.column_strategy(pa.dtypes.Int32)
    ccs = st.column_strategy(pa.dtypes.Int32)
    dfs = st.dataframe_strategy(columns={a: acs, b: bcs, c: ccs})
    return dfs

@extensions.register_check_method(
    statistics=["a", "b", "c"],
    check_type="groupby",
    strategy=a_plus_b_equals_c_strategy,
)
def a_plus_b_equals_c(df, *, a: str, b: str, c: str) -> bool:
    return df[c] == df[a] + df[b]

schema = pa.DataFrameSchema(
    {
        "a": pa.Column(int),
        "b": pa.Column(int),
        "c": pa.Column(int),
    },
    checks=[pa.Check.a_plus_b_equals_c(a="a", b="b", c="c")],
    coerce=True,
)

schema.example()

I get this output:

AttributeError: 'column' object has no attribute 'regex'

My venv in python 3.10.6:

hypothesis           6.68.0
pandas               1.5.3
pandera              0.13.4

Bonus question: why it isn't possible to use a hypothesis strategy? That was easy:

a = column("a", dtype=np.int32)
b = column("b", dtype=np.int32)

df = data_frames([a, b]).map(lambda x: x.assign(c=x.a + x.b))
df.example()

I fine the documentation very confusing, as it states: strategy (Optional[SearchStrategy]) – if specified, this will raise a BaseStrategyOnlyError, since it cannot be chained to a prior strategy. Then why giving the chance of passing this argument?

cosmicBboy · 2023-02-14T18:44:04Z

cosmicBboy
Feb 14, 2023
Maintainer

hi @francesco086 so the BaseStrategyOnlyError is raised only for specific strategies, including the dataframe_strategy. The argument is there to adhere to pandera's strategy API.

To answer your question, unfortunately pandera's support for cross-column strategies is not great. Here's a working solution:

from typing import Optional
import pandera as pa
from pandera import strategies as st
from pandera import extensions

def pass_through_strat(
    pandera_dtype: pa.DataType,
    strategy: Optional[st.SearchStrategy] = None,
    **kwargs,
):
    return st.pandas_dtype_strategy(pandera_dtype, strategy)

@extensions.register_check_method(
    statistics=["a", "b", "c"],
    check_type="vectorized",
    strategy=pass_through_strat,
)
def a_plus_b_equals_c(df, *, a: str, b: str, c: str) -> bool:
    return df[c] == df[a] + df[b]

schema = pa.DataFrameSchema(
    {
        "a": pa.Column(int),
        "b": pa.Column(int),
        "c": pa.Column(int),
    },
    checks=[pa.Check.a_plus_b_equals_c(a="a", b="b", c="c")],
    coerce=True,
)
strategy = schema.strategy(size=3).map(lambda x: x.assign(c=x.a + x.b))
print(strategy.example())

For reasons I'll chalk up to quirks around pandera's handling of hypothesis strategies, you'll need to create a "pass through" strategy that just returns the data type. The reason for this is that strategies that are registered along-side checks only operate at the atomic level: i.e. they are strategies to produce values of a specific column only.

Then basically we extract the strategy with schema.strategy(size=3) and then take your suggested approach of mapping additional changes to the synthesized dataframe.

See previous discussions of this issue:

How to define a strategy function for a custom wide check #538: make custom strategies more flexible
Pandera strategy re-write: improve base implementation and add API for custom strategies and global schema-level override strategy #561: provide a dataframe-level strategy override

2 replies

francesco086 Feb 15, 2023
Author

So many thanks for the detailed answer! I will try it out asap

francesco086 Feb 21, 2023
Author

Hi @cosmicBboy , I finally had time to check out your answer in detail. It is a little bit unconvenient, and indeed #561 would be desirable imho. I will see if I can help there.

francesco086 · 2023-02-21T16:05:13Z

francesco086
Feb 21, 2023
Author

I have a suggestion for an alternative approach: what about overwriting the example class method (I switched to the pedantic-inspired syntax). Could do the same with strategy in principle...

import numpy as np
import pandas as pd
import pandera as pa
from pandera.typing import Series


class Schema(pa.SchemaModel):
    a: Series[np.int32] = pa.Field(ge=0, le=10, nullable=False)
    b: Series[np.int32] = pa.Field(ge=0, le=10, nullable=False)
    c: Series[np.int32] = pa.Field(nullable=False)

    @pa.dataframe_check
    def a_plus_b_equal_c(cls, df: pd.DataFrame) -> bool:
        return (df[cls.c] == df[cls.a] + df[cls.b]).all()

    @classmethod
    def example(cls) -> pd.DataFrame:
        from hypothesis.extra.numpy import from_dtype
        from hypothesis.extra.pandas import column, data_frames

        a = column(
            cls.a,
            elements=from_dtype(dtype=np.dtype(np.int32), min_value=0, max_value=10),
        )
        b = column(
            cls.b,
            elements=from_dtype(dtype=np.dtype(np.int32), min_value=0, max_value=10),
        )

        df = data_frames([a, b]).map(lambda x: x.assign(c=x.a + x.b))
        return df.example()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom data synthesis strategy for dataframe #1088

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Custom data synthesis strategy for dataframe #1088

francesco086 Feb 10, 2023

Replies: 2 comments · 2 replies

cosmicBboy Feb 14, 2023 Maintainer

francesco086 Feb 15, 2023 Author

francesco086 Feb 21, 2023 Author

francesco086 Feb 21, 2023 Author

francesco086
Feb 10, 2023

Replies: 2 comments 2 replies

cosmicBboy
Feb 14, 2023
Maintainer

francesco086 Feb 15, 2023
Author

francesco086 Feb 21, 2023
Author

francesco086
Feb 21, 2023
Author