Add computation and retrieval of batch feature statistics #612

zhilingc · 2020-04-13T07:47:03Z

What this PR does / why we need it:
This PR adds support for retrieval of batch statistics over data ingested into a feast warehouse store, as proposed in M2 of the Feature Validation RFC.

Note that it deviates from the RFC in the following ways:

Statistics are computed using SQL. This is because TFDV is unfortunately, only available in python, and Multi-SDK connectors for Beam is still a work in progress. Computing the statistics using SQL will be the compromise until either TFDV is available in Java, or cross-language execution is supported.
Statistics can only be computed over a single feature-set at a time. This is mostly to reduce complexity in implementation. Since datasets are unable to span multiple feature sets, it makes sense to have this restriction in place.

This is a bit of a chonky PR, and the code itself requires a bit of cleaning up, hence the WIP status, but refer to the attached notebook for how this implementation looks like for a user of Feast.

Does this PR introduce a user-facing change?:

- Adds GetFeatureStatistics to the CoreService
- Adds get_statistics method to the client in the python SDK

feast-ci-bot · 2020-04-13T07:47:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zhilingc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [zhilingc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

core/src/main/java/feast/core/model/FieldId.java

woop · 2020-04-13T08:14:39Z

Will try to review it tomorrow morning @zhilingc

core/src/main/java/feast/core/grpc/CoreServiceImpl.java

core/src/main/java/feast/core/model/Entity.java

sdk/python/feast/client.py

woop · 2020-04-16T07:40:24Z

sdk/python/feast/client.py

@@ -908,6 +987,20 @@ def _build_feature_references(
    return features


+def _generate_dataset_id(feature_set: FeatureSet) -> str:


The term dataset still seems a bit off to me.

The id that we are returning is a job_id or an ingestion_id since it is unique to every invocation of ingest(). If we refer to it as dataset_id then I think users would expect it to be associated with their dataset, which means they provide the value or at the very least the value doesnt change in subsequent runs.

I was wary of letting users set this value since there is a chance they would assign the same value twice. And running the ingestion twice SHOULD have distinct ids because users probably don't want to treat the data as belonging to the same set... I can add the option to override the dataset id, but it really opens up the possibility of users breaking the system.

I don't think it should be a job_id, which is already an extremely overloaded term, and retrieval of data by ingestion_id doesn't seem as intuitive as something like a dataset_id

The reason I brought this up is because the current implementation does not fit the typical SOP for our users where they build idempotent data systems. They expect to be able to create a dataset and ingest it based on a name, and have the data system upsert. So basically they just run ingestion until it succeeds and then they are done. They don't care, and often don't know, that we are running Kafka.

And our users definitely don't want to maintain a list of UUIDs. They probably just want to query their datasets by the names they are using like "2020417-mydataset`

Regardless, I think your current approach is fine for now because nothing prevents us from eventually allowing them to set this Id if we allow idempotency. I would probably reuse this same dataset_id though.

ok. I've added the option for people to supply their own dataset id, and fixed the sql so that it only retrieves the latest value for each dataset/entity/event timestamp. This should ensure idempotency.

...rc/main/java/feast/storage/connectors/bigquery/statistics/FeatureSetStatisticsQueryInfo.java

...ery/src/main/java/feast/storage/connectors/bigquery/statistics/FieldStatisticsQueryInfo.java

tests/e2e/feature-validation.py

woop · 2020-04-16T08:55:43Z

I made a first run through.

The only big thing that stands out is the duplication between entity level statistics and feature level statistics. I'll think about how we can simplify this implementation because as written it is a lot of code to merge in and a bit scary.

Other than that I am finding it a bit slow going to review the PR because some of the methods require comments. I'd rather not ding the methods and ask for JavaDocs or comments, so would you mind just giving the PR a scan through and adding them? It would be especially useful for StatsService.java as well as the model fields.

woop · 2020-04-18T07:42:46Z

I'll talk about the data model first because that is the most important part to get right now.

Assumptions

Entities are primarily used as keys
Features are primarily used as values in the context of models. Our users care more about statistics on features than on entities.
Features need references (like feature references).
Entities need references, but these references are different from features.
Entities will also be treated differently in the future (See Entity types as a higher-level concept #405). Constraints on features and constraints on entities are different. Entities will be reused and features must be unique by name.
Nothing prevents a user from ingesting entity data as a feature.

Thoughts

Although there is a use case for having entity level statistics, I don't think it warrants the amount of code we'd need to maintain here.
There are pre-existing problems with the data model that should be solved now before 0.5 is released. Normally that would happen in a separate PR, but we can also reuse this PR if that is easier for you.
One of the problems that I see is the fact that Field and its inheritance adds complexity without any real value.
The second is that there is a coupling between the references of features and entities that I don't think should exist.
Lastly, and this is definitely out of scope for your work, is that the source table is hard-coded to Kafka (thanks for the reminder @mrzzy . Will add a separate issue for that. Edit: Source data model in Feast Core should be generalized #632 )

Recommendation

Remove Field and its inheritance
Keep FieldId but rename it to FeatureReference
Have Feature reference FeatureReference as its primary key
Create an EntityReference (for the time being this can be the same as FeatureReference). This will change as part of Entity types as a higher-level concept #405.
Have Entity reference EntityReference.
Keep FeatureStatistics and have it reference FeatureReference
Remove EntityStatistics completely as well as statistics methods on Entity.

Alternatively we could also add support for entity statistics by making all data in a row a "feature, which would then map a bit better the Tensorflow schema. Essentially Field is replaced by Feature. This might be too much work for the 0.5 release.

cc @ches @Yanson

zhilingc · 2020-04-20T08:10:31Z

@woop

Features are primarily used as values in the context of models. Our users care more about statistics on features than on entities.
Although there is a use case for having entity level statistics, I don't think it warrants the amount of code we'd need to maintain here.

I think that’s a very strong assumption to make.

I’ve been back and forth about entity statistics the past few weeks; my first iteration only involved feature statistics, but i ended up adding support for entity statistics later because i think it is something that is truly valuable to users, for checking id distributions and bounds, etc. I don’t think it’s

To your point:

Nothing prevents a user from ingesting entity data as a feature.

This doesn’t make sense when you consider the user experience. Do they double-up on entity columns so that they can be ingested as features JUST to retrieve statistics? Do they ingest it as a separate dataset? Why do users have to perform hacks to transform a logical unit (the dataset) into something else just to do something that is expectedly a basic requirement?

One of the problems that I see is the fact that Field and its inheritance adds complexity without any real value.

I’m not sure what sort of complexity you mean here. I think if we do have statistics for entities it’s a lot easier to maintain compatibility with TFDV fields with inheritance, which is the main reason why the class was introduced in the first place. Additional fields that differ between features and entities can be defined in their concrete classes.

The second is that there is a coupling between the references of features and entities that I don't think should exist.

Do you mean the fact that they use the same object? I guess it’s fine to separate them for the sake of forward compatibility…

Have Feature reference FeatureReference as its primary key

Do you mean have FeatureReference as a distinct entity and have feature sets, features and statistics reference that entity? I’m not sure if that’s the best idea.

Keep FeatureStatistics and have it reference FeatureReference

FeatureReference is part of the definition of Feature and FeatureStatistics cannot form a relation directly to it. To be exact, with the current implementation, the table definition for Features looks like:

-- Table Definition ----------------------------------------------

CREATE TABLE feature (
    feature_set character varying(255),
    name character varying(255),
    project character varying(255),
    version integer,
    bool_domain bytea,
    domain character varying(255),
    float_domain bytea,
    group_presence bytea,
    image_domain bytea,
    int_domain bytea,
    mid_domain bytea,
    natural_language_domain bytea,
    presence bytea,
    shape bytea,
    string_domain bytea,
    struct_domain bytea,
    time_domain bytea,
    time_of_day_domain bytea,
    type character varying(255),
    url_domain bytea,
    value_count bytea,
    CONSTRAINT feature_pkey PRIMARY KEY (feature_set, name, project, version)
);

And FeatureStatistics has the following constraint:

    CONSTRAINT fksrjnxx9em2xg2cqgd6ifm4lll FOREIGN KEY (feature_set, name, project, version) REFERENCES feature(feature_set, name, project, version)

zhilingc · 2020-04-20T11:06:42Z

Ok, I polled a few data scientists and they acknowledged that there is no need to have entity statistics except for occasional once-off checks, which can be done directly to the db anyway. I'll remove the entity statistics - but does that mean i remove the entity field schemas introduced in #438 as well?

zhilingc · 2020-04-25T11:07:34Z

/test test-end-to-end-batch

zhilingc · 2020-04-26T02:33:58Z

/test test-end-to-end-batch

zhilingc · 2020-04-26T02:52:51Z

/test test-end-to-end-batch

zhilingc · 2020-04-26T04:25:13Z

/test test-end-to-end-batch

zhilingc · 2020-04-26T04:25:22Z

/test test-end-to-end

zhilingc · 2020-04-26T05:16:03Z

/test test-end-to-end

zhilingc · 2020-04-26T07:43:40Z

/test test-end-to-end-batch

zhilingc · 2020-04-26T09:07:05Z

/test test-end-to-end-batch

zhilingc · 2020-04-26T14:48:06Z

/test test-end-to-end-batch

feast-ci-bot · 2020-04-26T15:02:09Z

@zhilingc: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
test-end-to-end	af37bab50b6b0a1297037d1d3d15a071df757023	link	`/test test-end-to-end`
test-end-to-end-batch	`baf4b88`	link	`/test test-end-to-end-batch`

Full PR test history

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ches · 2020-04-26T15:06:18Z

Would it be feasible to break this pull request up into a few somehow? At a first glance there's a lot that I think is generally beneficial independent of the feature addition of batch statistics, like entity/domain model additions and updates like Entity, feature and entity references, etc.

I have little doubt that the batch statistics feature addition in itself will necessarily be a large PR no matter how we slice it, but I see the 5,000 line diffstat here and my spirits get deflated that I will need to reserve the majority of a weekend to it sometime in the next few weeks, by which point it will be merged (which is totally fine, just that it feels like a drag to bring people back to it later if there are questions & comments).

ches · 2020-04-26T15:26:58Z

Some products alternate technical and feature releases, starting to feel Feast 0.5 could be split on its two roadmap categories that way… speaking from the perspective of someone who will upgrade a deployment at some point (i.e. merging down a branch, and rolling it out to operation), the scope is getting pretty huge (and that's not yet including this, #502, #554, #533…). It's great work, it's just creating some anxiety for me about the effort and risks of upgrading.

Anyway, I'm off topic for this PR, I'll take this elsewhere…

zhilingc · 2020-04-26T15:59:59Z

@ches That's fair. I'll do what I can.

woop · 2020-04-27T06:12:56Z

Some products alternate technical and feature releases, starting to feel Feast 0.5 could be split on its two roadmap categories that way

Yea the scope of 0.5 has increased a lot, which is partly good and bad. By splitting do you mean based on the existing master branch versus the additional functionality we are introducing with open PRs, or would you move out some functionality also merged into master?

My intuition is to put our efforts towards making 0.5 as stable and well tested as possible, so we would probably deploy it side by side and migrate folks over slowly. I would want to include all breaking changes for that matter. I think its possible to have 0.5 and 0.6 be technical and feature releases, but we probably will roll up to 0.6 quite quickly, which means we won't have much time to maintain the 0.5 branch.

If you feel strongly about splitting this then let me know. Otherwise we will try to merge these into 0.5 and take extra time in manual functional testing prior to rolling it out.

zhilingc added do-not-merge/work-in-progress area/core labels Apr 13, 2020

zhilingc requested review from davidheryanto, khorshuheng, pradithya and woop as code owners April 13, 2020 07:47

feast-ci-bot removed the do-not-merge/work-in-progress label Apr 13, 2020

feast-ci-bot added approved size/XXL labels Apr 13, 2020

zhilingc changed the title ~~Batch query get stats api~~ Add computation and retrieval of batch feature statistics Apr 13, 2020

woop reviewed Apr 13, 2020

View reviewed changes

core/src/main/java/feast/core/model/FieldId.java Outdated Show resolved Hide resolved

zhilingc added the do-not-merge/work-in-progress label Apr 13, 2020

zhilingc force-pushed the batch-query-get-stats-api branch from 1ef6767 to d714016 Compare April 15, 2020 02:04