Optimize historical retrieval with BigQuery offline store #1602

MattDelac · 2021-05-31T16:57:04Z

What this PR does / why we need it:
The current SQL template to perform historical retrieval of BigQuery datasets could be more efficient.

Which issue(s) this PR fixes:
As a former data scientist, I often realized that big data engines (Spark, BigQuery, Presto, etc.) are often much more efficient with a series of GROUP BY rather than a Window function. Moreover, we should avoid ORDER BY operations as much as possible.

So this PR comes with a new approach to compute the point in time join. This is solely compose of JOINs and GROUP BYs

Here are the results of my benchmark:
Context
The api call is the following

result = store.get_historical_features(
    entity_df="""
    SELECT user_id, TIMESTAMP '2021-01-01' AS observed_at
    FROM `my_dataset`
    LIMIT 100000000 -- 100M
    """,
    feature_refs=[
        "feature_view_A:feature_1",
        "feature_view_A:feature_2",
        "feature_view_B:feature_3",
        "feature_view_B:feature_4",
        "feature_view_B:feature_5",
        "feature_view_C:feature_6",
        "feature_view_D:feature_7",
    ]
)

And some idea of the scale of this historical retrieval

feature_view_A contains ~5B rows and ~3.6B unique "user_id"
feature_view_B contains ~5B rows and ~3.6B unique "user_id"
feature_view_C contains ~1.7B rows and ~1.1B unique "user_id"
feature_view_D contains ~42B rows and ~3.5B unique "user_id"

Results
On the original SQL template

Elapsed time: 5 min 59 sec
Slot time consumed: 5 days 5 hr
Bytes shuffled: 5.89 TB
Bytes spilled to disk: 0 B

With the SQL template of this PR

Elapsed time: 3 min 21 sec (-44%)
Slot time consumed: 2 days 12 hr (-52%)
Bytes shuffled: 3.55 TB (-40%)
Bytes spilled to disk: 0 B

So as we can see, the proposed SQL template consume half the resources that the one currently implemented when we asked for 100M "user_id".

Also, because this new SQL template is only composed of JOINs and GROUP BY, it should scale "indefinitely" except if the data is highly skewed (eg: a single "user_id" represents 20% of the dataset).
Does this PR introduce a user-facing change?:

Optimized how we perform historical retrieval with BigQuery offline store

feast-ci-bot · 2021-05-31T16:57:13Z

Hi @MattDelac. Thanks for your PR.

I'm waiting for a feast-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sdk/python/feast/infra/offline_stores/bigquery.py

woop · 2021-05-31T18:05:45Z

This is great, thanks @MattDelac

Results
On the original SQL template

Elapsed time: 5 min 59 sec

Slot time consumed: 5 days 5 hr

Bytes shuffled: 5.89 TB

Bytes spilled to disk: 0 B

With the SQL template of this PR

Elapsed time: 3 min 21 sec (-44%)

Slot time consumed: 2 days 12 hr (-52%)

Bytes shuffled: 3.55 TB (-40%)

Bytes spilled to disk: 0 B

These results are very promising. One thing that isn't clear from your test is whether all your entities occur on a single timestamp, or if a range of times are being used. The pathological case that I would want to test for is when there is a table with super granular events (like real-time features being updated every second) being joined onto a table with comparatively static data (like daily customer data).

I think it may make sense to

Make the BQ historical retrieval tests more representative (larger volume of data and larger differences in granularity of rows between views)
Print timing as part of our existing BQ historical retrieval tests

At that point it would be trivial to assess whether an optimization in our BQ template has led to improvements. We'd be happy to extend the tests if you don't have bandwidth, but it might take a week or two given our backlog.

MattDelac · 2021-05-31T19:38:40Z

One thing that isn't clear from your test is whether all your entities occur on a single timestamp, or if a range of times are being used

The tests with my data were on a single timestamp. But the integration tests are on a range of time I believe.

MattDelac · 2021-05-31T19:39:40Z

The pathological case that I would want to test for is when there is a table with super granular events (like real-time features being updated every second) being joined onto a table with comparatively static data (like daily customer data).

Yes I don't think why it should not be supported here ... 🤔

Definitely, if you think that it's not covered by the integration tests, we should add those first.

woop · 2021-06-01T05:33:04Z

The pathological case that I would want to test for is when there is a table with super granular events (like real-time features being updated every second) being joined onto a table with comparatively static data (like daily customer data).

Yes I don't think while it should not be supported here ...

Definitely, if you think that it's not covered by the integration tests, we should add those first.

I think they are covered to some degree, but the problem is we don't use a very large amount of data, nor do we have proper benchmarking. So its more a matter of being able to see the performance difference between tests than a functional test.

codecov-commenter · 2021-06-08T22:23:30Z

Codecov Report

Merging #1602 (5430c46) into master (3129dc8) will decrease coverage by 6.37%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #1602      +/-   ##
==========================================
- Coverage   83.94%   77.56%   -6.38%     
==========================================
  Files          67       66       -1     
  Lines        5911     5791     -120     
==========================================
- Hits         4962     4492     -470     
- Misses        949     1299     +350

Flag	Coverage Δ
integrationtests	`?`
unittests	`77.56% <0.00%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdk/python/feast/infra/offline_stores/bigquery.py	`29.47% <ø> (-63.01%)`	⬇️
sdk/python/tests/test_historical_retrieval.py	`56.61% <0.00%> (-42.32%)`	⬇️
sdk/python/tests/test_cli_gcp.py	`40.74% <0.00%> (-59.26%)`	⬇️
sdk/python/tests/utils/data_source_utils.py	`51.85% <0.00%> (-48.15%)`	⬇️
sdk/python/tests/test_feature_store.py	`68.29% <0.00%> (-31.71%)`	⬇️
sdk/python/tests/test_inference.py	`73.91% <0.00%> (-26.09%)`	⬇️
...hon/tests/test_offline_online_store_consistency.py	`74.72% <0.00%> (-25.28%)`	⬇️
sdk/python/feast/registry.py	`61.66% <0.00%> (-18.34%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3129dc8...5430c46. Read the comment docs.

MattDelac · 2021-06-08T22:26:07Z

/kind housekeeping

woop · 2021-06-08T22:32:09Z

/ok-to-test

Signed-off-by: Matt Delacour <matt.delacour@shopify.com>

MattDelac · 2021-06-09T14:46:15Z

PR ready for a final round of reviews (and for approval 😉 )

feast-ci-bot · 2021-06-09T19:48:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MattDelac, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [woop]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

woop · 2021-06-09T19:48:19Z

/lgtm

Signed-off-by: Matt Delacour <matt.delacour@shopify.com>

MattDelac requested review from jklegar, tsotnet, woop and a team as code owners May 31, 2021 16:57

feast-ci-bot added release-note needs-kind labels May 31, 2021

feast-ci-bot added needs-ok-to-test size/L labels May 31, 2021

MattDelac commented May 31, 2021

View reviewed changes

sdk/python/feast/infra/offline_stores/bigquery.py Outdated Show resolved Hide resolved

MattDelac force-pushed the sql_template_bq branch 3 times, most recently from dfeb89a to e08ae6b Compare May 31, 2021 17:01

woop changed the title ~~Sql template bq~~ Optimize historical retrieval with BigQuery offline store Jun 3, 2021

MattDelac mentioned this pull request Jun 8, 2021

Make test historical retrieval longer #1630

Merged

MattDelac force-pushed the sql_template_bq branch from e03d519 to fecae33 Compare June 8, 2021 22:20

MattDelac requested a review from achals as a code owner June 8, 2021 22:20

MattDelac force-pushed the sql_template_bq branch from fecae33 to 7951926 Compare June 8, 2021 22:20

feast-ci-bot added kind/housekeeping and removed needs-kind labels Jun 8, 2021

feast-ci-bot added ok-to-test and removed needs-ok-to-test labels Jun 8, 2021

MattDelac force-pushed the sql_template_bq branch from 7951926 to 33dcc63 Compare June 8, 2021 23:01

MattDelac force-pushed the sql_template_bq branch 2 times, most recently from 69d28c2 to 5d47795 Compare June 8, 2021 23:16

New SQL template for BQ historical retrieval

5430c46

Signed-off-by: Matt Delacour <matt.delacour@shopify.com>

MattDelac force-pushed the sql_template_bq branch from 5d47795 to 5430c46 Compare June 8, 2021 23:35

woop approved these changes Jun 9, 2021

View reviewed changes

feast-ci-bot added the approved label Jun 9, 2021

feast-ci-bot assigned woop Jun 9, 2021

feast-ci-bot added the lgtm label Jun 9, 2021

feast-ci-bot merged commit 563d9cf into feast-dev:master Jun 9, 2021

MattDelac deleted the sql_template_bq branch June 9, 2021 20:06

tsotnet pushed a commit that referenced this pull request Jun 17, 2021

New SQL template for BQ historical retrieval (#1602)

8289d1b

Signed-off-by: Matt Delacour <matt.delacour@shopify.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize historical retrieval with BigQuery offline store #1602

Optimize historical retrieval with BigQuery offline store #1602

MattDelac commented May 31, 2021 •

edited

Loading

feast-ci-bot commented May 31, 2021

woop commented May 31, 2021 •

edited

Loading

MattDelac commented May 31, 2021

MattDelac commented May 31, 2021 •

edited

Loading

woop commented Jun 1, 2021 •

edited

Loading

codecov-commenter commented Jun 8, 2021 •

edited

Loading

MattDelac commented Jun 8, 2021

woop commented Jun 8, 2021

MattDelac commented Jun 9, 2021

feast-ci-bot commented Jun 9, 2021

woop commented Jun 9, 2021

Optimize historical retrieval with BigQuery offline store #1602

Optimize historical retrieval with BigQuery offline store #1602

Conversation

MattDelac commented May 31, 2021 • edited Loading

feast-ci-bot commented May 31, 2021

woop commented May 31, 2021 • edited Loading

MattDelac commented May 31, 2021

MattDelac commented May 31, 2021 • edited Loading

woop commented Jun 1, 2021 • edited Loading

codecov-commenter commented Jun 8, 2021 • edited Loading

Codecov Report

MattDelac commented Jun 8, 2021

woop commented Jun 8, 2021

MattDelac commented Jun 9, 2021

feast-ci-bot commented Jun 9, 2021

woop commented Jun 9, 2021

MattDelac commented May 31, 2021 •

edited

Loading

woop commented May 31, 2021 •

edited

Loading

MattDelac commented May 31, 2021 •

edited

Loading

woop commented Jun 1, 2021 •

edited

Loading

codecov-commenter commented Jun 8, 2021 •

edited

Loading