Custom data synthesis strategy for dataframe #1088
Replies: 2 comments 2 replies
-
hi @francesco086 so the To answer your question, unfortunately pandera's support for cross-column strategies is not great. Here's a working solution: from typing import Optional
import pandera as pa
from pandera import strategies as st
from pandera import extensions
def pass_through_strat(
pandera_dtype: pa.DataType,
strategy: Optional[st.SearchStrategy] = None,
**kwargs,
):
return st.pandas_dtype_strategy(pandera_dtype, strategy)
@extensions.register_check_method(
statistics=["a", "b", "c"],
check_type="vectorized",
strategy=pass_through_strat,
)
def a_plus_b_equals_c(df, *, a: str, b: str, c: str) -> bool:
return df[c] == df[a] + df[b]
schema = pa.DataFrameSchema(
{
"a": pa.Column(int),
"b": pa.Column(int),
"c": pa.Column(int),
},
checks=[pa.Check.a_plus_b_equals_c(a="a", b="b", c="c")],
coerce=True,
)
strategy = schema.strategy(size=3).map(lambda x: x.assign(c=x.a + x.b))
print(strategy.example()) For reasons I'll chalk up to quirks around pandera's handling of hypothesis strategies, you'll need to create a "pass through" strategy that just returns the data type. The reason for this is that strategies that are registered along-side checks only operate at the atomic level: i.e. they are strategies to produce values of a specific column only. Then basically we extract the strategy with See previous discussions of this issue:
|
Beta Was this translation helpful? Give feedback.
-
I have a suggestion for an alternative approach: what about overwriting the import numpy as np
import pandas as pd
import pandera as pa
from pandera.typing import Series
class Schema(pa.SchemaModel):
a: Series[np.int32] = pa.Field(ge=0, le=10, nullable=False)
b: Series[np.int32] = pa.Field(ge=0, le=10, nullable=False)
c: Series[np.int32] = pa.Field(nullable=False)
@pa.dataframe_check
def a_plus_b_equal_c(cls, df: pd.DataFrame) -> bool:
return (df[cls.c] == df[cls.a] + df[cls.b]).all()
@classmethod
def example(cls) -> pd.DataFrame:
from hypothesis.extra.numpy import from_dtype
from hypothesis.extra.pandas import column, data_frames
a = column(
cls.a,
elements=from_dtype(dtype=np.dtype(np.int32), min_value=0, max_value=10),
)
b = column(
cls.b,
elements=from_dtype(dtype=np.dtype(np.int32), min_value=0, max_value=10),
)
df = data_frames([a, b]).map(lambda x: x.assign(c=x.a + x.b))
return df.example() |
Beta Was this translation helpful? Give feedback.
-
I don't seem to be able to find anywhere some hints on solving this issue, so I am asking for help here!
I promise that if someone helps me out here I will create a PR to the documentation that will help other people in the future :)
I will keep it simple with a small example. I want to write a schema for a
df
with columnsa
,b
, andc
, where c should bea + b
.I want to be able to generate examples too, for testing purposes.
So, I wrote:
I get this output:
My venv in python
3.10.6
:Bonus question: why it isn't possible to use a hypothesis strategy? That was easy:
I fine the documentation very confusing, as it states:
strategy (Optional[SearchStrategy]) – if specified, this will raise a BaseStrategyOnlyError, since it cannot be chained to a prior strategy
. Then why giving the chance of passing this argument?Beta Was this translation helpful? Give feedback.
All reactions