Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to define a strategy function for a custom wide check #538

Open
suned opened this issue Jul 2, 2021 · 7 comments
Open

How to define a strategy function for a custom wide check #538

suned opened this issue Jul 2, 2021 · 7 comments
Labels
question Further information is requested

Comments

@suned
Copy link

suned commented Jul 2, 2021

I'm attempting to implement a dataframe level custom check for one-hot encoding for a subset of the columns in a dataframe. I'm interested in using the resulting schema to create a hypothesis strategy, so to that end I'm writing a strategy function as described here. A minimal example:

import pandas as pd
import pandera as pa


def is_one_hot_strategy(dtype, strategy, *, exclude_cols):
    ...   # How can I implement this?


@extensions.register_check_method(statistics=['exclude_cols'], strategy=is_one_hot_strategy)
def is_one_hot(df: pd.DataFrame, *, exclude_cols) -> pd.Series:
    return df.drop(exclude_cols, axis=1).sum(axis=1) == 1


schema = pa.DataFrameSchema(
    columns={
        'id': pa.Column(pa.Int, allow_duplicates=False),
        'x': pa.Column(pa.Bool),
        'y': pa.Column(pa.Bool)
    },
    checks=[pa.Check.is_one_hot(exclude_cols=['id'])]
)

Since the is_one_hot_strategy is called once per column in the schema, its quite hard to implement a strategy for a wide check since, by definition, a wide check is a constraint between columns, not on a single column.

My use-case is further complicated by the fact that the is_one_hot check only operates on a subset of the columns, and its tricky to tell just from the dtype which columns should be constrained in is_one_hot_strategy.

My questions:

  • Is there a suggested way to implement strategies for custom wide checks?
  • Is there a suggested way to implement strategies for dataframe level checks that operate on a subset of columns?
@suned suned added the question Further information is requested label Jul 2, 2021
@cosmicBboy
Copy link
Collaborator

hi @suned sorry for the long reply-time! currently this isn't quite possible, but we're working on adding more flexibility to the data synthesis strategy functionality:

Will keep you updated on when support for your use case will be added!

@RNKuhns
Copy link

RNKuhns commented May 17, 2022

I've got a different use case, but similarly want to be able to provide a strategy for data synthesis for multiple column checks.

@cosmicBboy is this currently a work in progress (I know there are a couple related PRs)?

If not, I'm new to the pandera's code base, but I am willing to do what I can to help contribute this functionality.

def check_maturity(df):
    is_okay = df["Maturity_Date"] - df["Orig_Date"] = df["Age"]
   return is_okay

@cosmicBboy
Copy link
Collaborator

hi @RNKuhns, can you elaborate on the column checks for which you want to generate data?

It'll help with the design and implementation of making the data synthesis strategy customization effort.

No one's currently working on this, so would be happy to work with you to get started.

At a high level, we want to do the following:

  1. add a strategy kwarg to the Check (and Hypothesis) classes, which is a function that adheres to the pandera strategy function signature (see eq_strategy for example). Supplying a strategy here should override the built-in strategy, which is registered at the Check method-level (https://github.com/pandera-dev/pandera/blob/master/pandera/checks.py#L525-L528)
  2. add a strategy kwarg to all the schema and schema component classes, which serves as an override of the default strategy method. Pandera strategy re-write: improve base implementation and add API for custom strategies and global schema-level override strategy #561 (e.g. https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L887-L904)

This is pretty down in the weeds of the pandera codebase, let me know if you'd like to tackle it!

@RNKuhns
Copy link

RNKuhns commented May 19, 2022

@cosmicBboy I would definitely like to help out.

Do you have some time to have a quick meeting? I can explain some of what I'm hoping to accomplish and you can give me some quick pointers on the code base.

@cosmicBboy
Copy link
Collaborator

hi @RNKuhns feel free to find a time here! https://calendly.com/niels-bantilan/30min

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jun 14, 2022

high-level approach:

  1. Generate "multi-column" data strategy
  2. Generate single column data strategy
  3. Somehow merge the two dataframes

tactically:

  1. segment the dataframe schemas into two separate schemas using schema transformations:
    • one segment is columns with isolated checks: the .strategy method should work fine here
    • one segment is columns with multi-dependency checks: for this, BYO strategy

contribution as per:

add a strategy kwarg to all the schema and schema component classes, which serves as an override of the default strategy method. #561 (e.g. https://github.com/pandera-dev/pandera/blob/master/pandera/schemas.py#L887-L904)

  • update DataFrameSchema.__init__ to take a strategy kwarg, which serves as the override strategy
  • if self.strategy is not None use the user-defined strategy in the strategy method
  • explore idea: override just a subset of columns that uses schema transformation to only override certain columns

for example, the make_strategy function might look something like:

def make_strategy(schema, ...):  # with some other args TBD
   # return hypothesis strategy using schema metadata

@rbudnar
Copy link

rbudnar commented Sep 21, 2022

If I'm understanding correctly, I am trying to accomplish essentially what the OP is. I want to make sure that I'm in the right issue and that this is still not currently possible.

My example - I am applying the following dataframe check to schemas:

@extensions.register_check_method(statistics=["columns"], strategy=??)
def no_duplicates_in_each_column(df: pd.DataFrame, columns: List[str]) -> bool:
    """
    Checks for duplicates in each individual dataframe column provided.
    Note that this is not the same as checking across all columns with df.duplicated(subset=columns).
    """
    dups = np.repeat([False], len(df))
    for col in columns:
        col_dups = df[col].duplicated()
        if col_dups.any():
            print(f"Duplicates found in column: {col}")
        dups |= col_dups

    return len(df[dups]) == 0

class MySchema(pa.SchemaModel):
    id: Series[int]

   class Config:
       no_duplicates_in_each_column = ["id"]

How would I go about defining a strategy so that we can correctly generate data for testing purposes? Or is this not possible at this time? I'm fairly new to hypothesis and strategy generation so I apologize if I'm missing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants