Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize historical retrieval with BigQuery offline store #1602

Merged
merged 1 commit into from
Jun 9, 2021

Conversation

MattDelac
Copy link
Collaborator

@MattDelac MattDelac commented May 31, 2021

What this PR does / why we need it:
The current SQL template to perform historical retrieval of BigQuery datasets could be more efficient.

Which issue(s) this PR fixes:
As a former data scientist, I often realized that big data engines (Spark, BigQuery, Presto, etc.) are often much more efficient with a series of GROUP BY rather than a Window function. Moreover, we should avoid ORDER BY operations as much as possible.

So this PR comes with a new approach to compute the point in time join. This is solely compose of JOINs and GROUP BYs

Here are the results of my benchmark:
Context
The api call is the following

result = store.get_historical_features(
    entity_df="""
    SELECT user_id, TIMESTAMP '2021-01-01' AS observed_at
    FROM `my_dataset`
    LIMIT 100000000 -- 100M
    """,
    feature_refs=[
        "feature_view_A:feature_1",
        "feature_view_A:feature_2",
        "feature_view_B:feature_3",
        "feature_view_B:feature_4",
        "feature_view_B:feature_5",
        "feature_view_C:feature_6",
        "feature_view_D:feature_7",
    ]
)

And some idea of the scale of this historical retrieval

  • feature_view_A contains ~5B rows and ~3.6B unique "user_id"
  • feature_view_B contains ~5B rows and ~3.6B unique "user_id"
  • feature_view_C contains ~1.7B rows and ~1.1B unique "user_id"
  • feature_view_D contains ~42B rows and ~3.5B unique "user_id"

Results
On the original SQL template

  • Elapsed time: 5 min 59 sec
  • Slot time consumed: 5 days 5 hr
  • Bytes shuffled: 5.89 TB
  • Bytes spilled to disk: 0 B

With the SQL template of this PR

  • Elapsed time: 3 min 21 sec (-44%)
  • Slot time consumed: 2 days 12 hr (-52%)
  • Bytes shuffled: 3.55 TB (-40%)
  • Bytes spilled to disk: 0 B

So as we can see, the proposed SQL template consume half the resources that the one currently implemented when we asked for 100M "user_id".

Also, because this new SQL template is only composed of JOINs and GROUP BY, it should scale "indefinitely" except if the data is highly skewed (eg: a single "user_id" represents 20% of the dataset).
Does this PR introduce a user-facing change?:

Optimized how we perform historical retrieval with BigQuery offline store

@feast-ci-bot
Copy link
Collaborator

Hi @MattDelac. Thanks for your PR.

I'm waiting for a feast-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MattDelac MattDelac force-pushed the sql_template_bq branch 3 times, most recently from dfeb89a to e08ae6b Compare May 31, 2021 17:01
@woop
Copy link
Member

woop commented May 31, 2021

This is great, thanks @MattDelac

Results
On the original SQL template

  • Elapsed time: 5 min 59 sec
  • Slot time consumed: 5 days 5 hr
  • Bytes shuffled: 5.89 TB
  • Bytes spilled to disk: 0 B

With the SQL template of this PR

  • Elapsed time: 3 min 21 sec (-44%)
  • Slot time consumed: 2 days 12 hr (-52%)
  • Bytes shuffled: 3.55 TB (-40%)
  • Bytes spilled to disk: 0 B

These results are very promising. One thing that isn't clear from your test is whether all your entities occur on a single timestamp, or if a range of times are being used. The pathological case that I would want to test for is when there is a table with super granular events (like real-time features being updated every second) being joined onto a table with comparatively static data (like daily customer data).

I think it may make sense to

  • Make the BQ historical retrieval tests more representative (larger volume of data and larger differences in granularity of rows between views)
  • Print timing as part of our existing BQ historical retrieval tests

At that point it would be trivial to assess whether an optimization in our BQ template has led to improvements. We'd be happy to extend the tests if you don't have bandwidth, but it might take a week or two given our backlog.

@MattDelac
Copy link
Collaborator Author

One thing that isn't clear from your test is whether all your entities occur on a single timestamp, or if a range of times are being used

The tests with my data were on a single timestamp. But the integration tests are on a range of time I believe.

@MattDelac
Copy link
Collaborator Author

MattDelac commented May 31, 2021

The pathological case that I would want to test for is when there is a table with super granular events (like real-time features being updated every second) being joined onto a table with comparatively static data (like daily customer data).

Yes I don't think why it should not be supported here ... 🤔

Definitely, if you think that it's not covered by the integration tests, we should add those first.

@woop
Copy link
Member

woop commented Jun 1, 2021

The pathological case that I would want to test for is when there is a table with super granular events (like real-time features being updated every second) being joined onto a table with comparatively static data (like daily customer data).

Yes I don't think while it should not be supported here ...

Definitely, if you think that it's not covered by the integration tests, we should add those first.

I think they are covered to some degree, but the problem is we don't use a very large amount of data, nor do we have proper benchmarking. So its more a matter of being able to see the performance difference between tests than a functional test.

@woop woop changed the title Sql template bq Optimize historical retrieval with BigQuery offline store Jun 3, 2021
@MattDelac MattDelac requested a review from achals as a code owner June 8, 2021 22:20
@codecov-commenter
Copy link

codecov-commenter commented Jun 8, 2021

Codecov Report

Merging #1602 (5430c46) into master (3129dc8) will decrease coverage by 6.37%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1602      +/-   ##
==========================================
- Coverage   83.94%   77.56%   -6.38%     
==========================================
  Files          67       66       -1     
  Lines        5911     5791     -120     
==========================================
- Hits         4962     4492     -470     
- Misses        949     1299     +350     
Flag Coverage Δ
integrationtests ?
unittests 77.56% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/feast/infra/offline_stores/bigquery.py 29.47% <ø> (-63.01%) ⬇️
sdk/python/tests/test_historical_retrieval.py 56.61% <0.00%> (-42.32%) ⬇️
sdk/python/tests/test_cli_gcp.py 40.74% <0.00%> (-59.26%) ⬇️
sdk/python/tests/utils/data_source_utils.py 51.85% <0.00%> (-48.15%) ⬇️
sdk/python/tests/test_feature_store.py 68.29% <0.00%> (-31.71%) ⬇️
sdk/python/tests/test_inference.py 73.91% <0.00%> (-26.09%) ⬇️
...hon/tests/test_offline_online_store_consistency.py 74.72% <0.00%> (-25.28%) ⬇️
sdk/python/feast/registry.py 61.66% <0.00%> (-18.34%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3129dc8...5430c46. Read the comment docs.

@MattDelac
Copy link
Collaborator Author

/kind housekeeping

@woop
Copy link
Member

woop commented Jun 8, 2021

/ok-to-test

@MattDelac MattDelac force-pushed the sql_template_bq branch 2 times, most recently from 69d28c2 to 5d47795 Compare June 8, 2021 23:16
Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
@MattDelac
Copy link
Collaborator Author

PR ready for a final round of reviews (and for approval 😉 )

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MattDelac, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@woop
Copy link
Member

woop commented Jun 9, 2021

/lgtm

@feast-ci-bot feast-ci-bot merged commit 563d9cf into feast-dev:master Jun 9, 2021
@MattDelac MattDelac deleted the sql_template_bq branch June 9, 2021 20:06
tsotnet pushed a commit that referenced this pull request Jun 17, 2021
Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants