How to define a strategy function for a custom wide check #538

suned · 2021-07-02T08:14:56Z

I'm attempting to implement a dataframe level custom check for one-hot encoding for a subset of the columns in a dataframe. I'm interested in using the resulting schema to create a hypothesis strategy, so to that end I'm writing a strategy function as described here. A minimal example:

import pandas as pd
import pandera as pa


def is_one_hot_strategy(dtype, strategy, *, exclude_cols):
    ...   # How can I implement this?


@extensions.register_check_method(statistics=['exclude_cols'], strategy=is_one_hot_strategy)
def is_one_hot(df: pd.DataFrame, *, exclude_cols) -> pd.Series:
    return df.drop(exclude_cols, axis=1).sum(axis=1) == 1


schema = pa.DataFrameSchema(
    columns={
        'id': pa.Column(pa.Int, allow_duplicates=False),
        'x': pa.Column(pa.Bool),
        'y': pa.Column(pa.Bool)
    },
    checks=[pa.Check.is_one_hot(exclude_cols=['id'])]
)

Since the is_one_hot_strategy is called once per column in the schema, its quite hard to implement a strategy for a wide check since, by definition, a wide check is a constraint between columns, not on a single column.

My use-case is further complicated by the fact that the is_one_hot check only operates on a subset of the columns, and its tricky to tell just from the dtype which columns should be constrained in is_one_hot_strategy.

My questions:

Is there a suggested way to implement strategies for custom wide checks?
Is there a suggested way to implement strategies for dataframe level checks that operate on a subset of columns?

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2021-12-06T14:25:43Z

hi @suned sorry for the long reply-time! currently this isn't quite possible, but we're working on adding more flexibility to the data synthesis strategy functionality:

Pandera strategy re-write: improve base implementation and add API for custom strategies and global schema-level override strategy #561
Custom Hypothesis Strategies #648

Will keep you updated on when support for your use case will be added!

RNKuhns · 2022-05-17T17:55:09Z

I've got a different use case, but similarly want to be able to provide a strategy for data synthesis for multiple column checks.

@cosmicBboy is this currently a work in progress (I know there are a couple related PRs)?

If not, I'm new to the pandera's code base, but I am willing to do what I can to help contribute this functionality.

def check_maturity(df):
    is_okay = df["Maturity_Date"] - df["Orig_Date"] = df["Age"]
   return is_okay

cosmicBboy · 2022-05-18T02:01:23Z

hi @RNKuhns, can you elaborate on the column checks for which you want to generate data?

It'll help with the design and implementation of making the data synthesis strategy customization effort.

No one's currently working on this, so would be happy to work with you to get started.

At a high level, we want to do the following:

add a strategy kwarg to the Check (and Hypothesis) classes, which is a function that adheres to the pandera strategy function signature (see eq_strategy for example). Supplying a strategy here should override the built-in strategy, which is registered at the Check method-level (https://github.com/pandera-dev/pandera/blob/master/pandera/checks.py#L525-L528)
add a strategy kwarg to all the schema and schema component classes, which serves as an override of the default strategy method. Pandera strategy re-write: improve base implementation and add API for custom strategies and global schema-level override strategy #561 (e.g. https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L887-L904)

This is pretty down in the weeds of the pandera codebase, let me know if you'd like to tackle it!

RNKuhns · 2022-05-19T14:59:11Z

@cosmicBboy I would definitely like to help out.

Do you have some time to have a quick meeting? I can explain some of what I'm hoping to accomplish and you can give me some quick pointers on the code base.

cosmicBboy · 2022-05-24T15:24:11Z

hi @RNKuhns feel free to find a time here! https://calendly.com/niels-bantilan/30min

cosmicBboy · 2022-06-14T14:49:41Z

high-level approach:

Generate "multi-column" data strategy
Generate single column data strategy
Somehow merge the two dataframes

tactically:

segment the dataframe schemas into two separate schemas using schema transformations:
- one segment is columns with isolated checks: the .strategy method should work fine here
- one segment is columns with multi-dependency checks: for this, BYO strategy

contribution as per:

add a strategy kwarg to all the schema and schema component classes, which serves as an override of the default strategy method. #561 (e.g. https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L887-L904)

update DataFrameSchema.__init__ to take a strategy kwarg, which serves as the override strategy
if self.strategy is not None use the user-defined strategy in the strategy method
explore idea: override just a subset of columns that uses schema transformation to only override certain columns

for example, the make_strategy function might look something like:

def make_strategy(schema, ...):  # with some other args TBD
   # return hypothesis strategy using schema metadata

rbudnar · 2022-09-21T15:29:46Z

If I'm understanding correctly, I am trying to accomplish essentially what the OP is. I want to make sure that I'm in the right issue and that this is still not currently possible.

My example - I am applying the following dataframe check to schemas:

@extensions.register_check_method(statistics=["columns"], strategy=??)
def no_duplicates_in_each_column(df: pd.DataFrame, columns: List[str]) -> bool:
    """
    Checks for duplicates in each individual dataframe column provided.
    Note that this is not the same as checking across all columns with df.duplicated(subset=columns).
    """
    dups = np.repeat([False], len(df))
    for col in columns:
        col_dups = df[col].duplicated()
        if col_dups.any():
            print(f"Duplicates found in column: {col}")
        dups |= col_dups

    return len(df[dups]) == 0

class MySchema(pa.SchemaModel):
    id: Series[int]

   class Config:
       no_duplicates_in_each_column = ["id"]

How would I go about defining a strategy so that we can correctly generate data for testing purposes? Or is this not possible at this time? I'm fairly new to hypothesis and strategy generation so I apologize if I'm missing something.

suned added the question Further information is requested label Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to define a strategy function for a custom wide check #538

How to define a strategy function for a custom wide check #538

suned commented Jul 2, 2021 •

edited

Loading

cosmicBboy commented Dec 6, 2021

RNKuhns commented May 17, 2022 •

edited

Loading

cosmicBboy commented May 18, 2022

RNKuhns commented May 19, 2022

cosmicBboy commented May 24, 2022

cosmicBboy commented Jun 14, 2022 •

edited

Loading

rbudnar commented Sep 21, 2022

How to define a strategy function for a custom wide check #538

How to define a strategy function for a custom wide check #538

Comments

suned commented Jul 2, 2021 • edited Loading

cosmicBboy commented Dec 6, 2021

RNKuhns commented May 17, 2022 • edited Loading

cosmicBboy commented May 18, 2022

RNKuhns commented May 19, 2022

cosmicBboy commented May 24, 2022

cosmicBboy commented Jun 14, 2022 • edited Loading

rbudnar commented Sep 21, 2022

suned commented Jul 2, 2021 •

edited

Loading

RNKuhns commented May 17, 2022 •

edited

Loading

cosmicBboy commented Jun 14, 2022 •

edited

Loading