-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandera strategy re-write: improve base implementation and add API for custom strategies and global schema-level override strategy #561
Comments
Following up on the discussion in #648. There are often use cases where it would be useful to override the base strategy for a column. The following hypothesis strategies clearly express the shape of data, but cannot be easily represented using the pandera check API. # uuids
st.uuids().map(str)
# dictionaries
st.fixed_dictionaries(
{
'symbol': st.text(string.ascii_uppercase),
'cusip': st.text(string.ascii_uppercase + string.digits),
},
) A workaround described in #648 uses custom check methods to store a strategy override for later use and accesses it during strategy generation in a subclass of As suggested by @cosmicBboy in #648, first class support for this use case could be added by adding a This would allow for the following schema specification, while still supporting additional checks on the column (unlike the workaround described above). class Schema(SchemaModel):
uuids: Series[object] = pa.Field(strategy=st.uuids().map(str)) |
Following from #1088 Perhaps not exactly what you had in mind but... a rather simple brute-force approach: create a strategy with hypothesis that generates the whole dataframe, and feed it in the schema as the one to use to generate examples. What do you think of this @cosmicBboy ? It could be something relatively simple to implement (if it fits your design choices)...? If so, I volunteer to create a PR for this. |
@NowanIlfideme I re-wrote this issue to encapsulate a broader re-write of the pandera pandas strategy module. Please chime in here with your thoughts on how this might work! |
Hi, this turned into quite a big comment, so I added sections. I should also note that I am quite new at Hypothesis specfically, though not with data generation in general. I see several cases that are very relevant to my day-to-day work that would be great to support in Pandera; they would let me do API contract testing with "I need this data schema as input" and to generate more complex data from that schema. Columns with dependencies within the columnThe example in #1605 was for generating time series data. Here I would want to create timestamps with a particular frequency, such as Another example would be generating monotonically increasing, but not necessarily contiguous, IDs. You need to know the length of the series, and you can generate the series with In terms of user API, the def _make_freq_series(start_date: date, periods: int, freq: str) -> pd.Series:
return pd.Series(pd.date_range(start_date=start-date, freq=freq, periods=periods))
def freq_strategy(
pandera_dtype: pa.DataType,
strategy: Optional[st.SearchStrategy] = None, # would you even support a base strategy?
size: int,
*,
freq: str,
) -> Strategy: # creates series/arrays/lists of size SIZE instead of a single element
date_gen = hypothesis.dates(min_value=datetime(1800, 1, 1), max_value=date(3080, 1, 1))
return hypothesis.builds(_make_freq_series, start_date=date_gen, periods=size, freq=freq) One major potential issue with this is that only Pandas has a real defined ordering of elements. Other dataframes can be generated in Pandas and then converted, but that isn't useful for things like performance testing. (Though, I guess that use case is limited enough that custom generators could be made...) Columns that depend on other columnsA totally different issue is to generate columns that depend on the values other columns. That would be valuable for all sorts of things. For example, hierarchical relationships can be generated like this (if value is "A", you can create "A1" - "A9", for "B" you create "B1" - "B5", etc.). This would be difficult to implement as a column-based strategy, since (as far as I understand) Pandera doesn't support cross-column checks except as entire dataframes. So, one way to "fix" this would be to use a custom entire-dataframe strategy; however, that means you lose out on generating the other columns using Pandera. From the user API, you could consider def cond_strategy(
pandera_dtype: pa.DataType,
strategy: Optional[st.SearchStrategy] = None, # would you even support a base strategy?
base_df: pd.DataFrame,
) -> Strategy: # creates series/arrays/lists of size LEN(base_df) instead of a single element
def inner(func): ... # not quite sure how to make this
return base_df.apply(func, axis='columns')
return hypothesis.builds(inner, func=hypothesis.strategies.sampled_from(['sum', 'mean', 'median'])) It's not entirely clear to me how to actually use Hypothesis to generate different elements for every column, though. Generating from existing dataframes (e.g. for grouped dataframes)Another use case that is very common for me is to generate dataframes that are grouped somehow. For example, I have a composite primary key that consists of IDs and timestamps, and I want an outer join of these dataframes. Here, I guess the best case would be to just generate the individual dataframes and merge them. However, what if I want other columns in the dataframe, that I want to be filled? Here, I think the example generation API could work to "complete" the example. Naming is tough, but value_schema = ... # the 'values' part of your schema
df_ids = id_schema.example()
df_timestamps = ts_schema.example()
df_index = pd.merge(df_ids, df_timestamps, how='cross')
full_schema = value_schema.add_columns(id_schema.columns).add_columns(ts_schema.columns)
df_all = full_schema.example(base_df=df_index) I hope some of the above makes sense - even after going through it again it seems a bit ramble-y. |
Is your feature request related to a problem? Please describe.
Currently, strategies are limited by the
hypothesis.extras.pandas
convention of how to define a dataframe. Namely, the strategy used to generate data values are at the element-level. This makes it hard to create strategies for a whole column or those that model the dependencies between columns.For previous context on the problem with strategies, see #1605, #1220, #1275.
Describe the solution you'd like
We need a re-write! 🔥
As described in #1605, the requirements for a pandera pandas strategy rewrite are:
More context on the current state
At a high level, this is how pandera currently translates a schema to a hypothesis strategy:
column
. This contains the datatypes, elements, and other properties of the column.pa.Column
dtype, properties (e.g. unique), and first check in the list ofcheck
, forward them to the hypothesis column. This creates an element strategy for a single value in that column.Check
in the list, get their check stats (constraint values) and chain them to the element strategy withfilter
(this really sucks, i.e. slows down performance.)The text was updated successfully, but these errors were encountered: