Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BQ historical retrieval with rows that got backfilled #1744

Merged

Conversation

MattDelac
Copy link
Collaborator

@MattDelac MattDelac commented Jul 29, 2021

What this PR does / why we need it:
There is bug about finding the "latest" values when the rows have been backfilled (when created_timestamp and event_timestamp don't follow the same order).

The problem is that the ANY_VALUE strategy does not work here because it will take the first occurrence it sees.

For example, the following example would fail

| driver_id | value | event_timestamp | created_at |
|-----------|-------|-----------------|------------|
| 1         | 10    | today           | tomorrow   |
| 1         | 20    | tomorrow        | today      |

The JOIN happening later would consist of the entity_dataframe with the __latest CTE where created_at and event_timestamp got mixed

| driver_id | value | event_timestamp | created_at |
|-----------|-------|-----------------|------------|
| 1         | 20    | tomorrow        | tomorrow   |

Because the JOIN operation happens on USING(driver_id, event_timestamp, created_at), it cannot find a match and will return NULL for the desired feature.

The goal of this PR is to make this step work and deterministic by using a Window function. Then we preserve the information between event_timestamp and created_at on a given row

Tophat

# Logs of the example I ran (on billions of records)
Job started at 2021-08-05 11:07:28
Job started at 2021-08-05 11:42:00 # Thus it took 35 minutes

Metrics (compared to current SQL template)
Elapsed time: 24 min 49 sec (vs 26 min 35 sec)
Slot time consumed: 21 days 6 hr (vs 26 days 0 hr)
Bytes shuffled: 23.11 TB (vs 22.86 TB)
Bytes spilled to disk: 7.87 TB (10.09 TB)

I also confirm that it fixes the bug we had on a specific feature view

The new test properly catches the problem encountered with the backfill rows (as the feature of the second driver is Null instead of 40)
image

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

[Bug] Fixed how we retrieve the latest value of a feature in the BigQuery SQL template

@feast-ci-bot
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@codecov-commenter
Copy link

codecov-commenter commented Jul 29, 2021

Codecov Report

Merging #1744 (0fb046e) into master (7977a53) will increase coverage by 0.08%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1744      +/-   ##
==========================================
+ Coverage   84.65%   84.73%   +0.08%     
==========================================
  Files          85       85              
  Lines        6268     6297      +29     
==========================================
+ Hits         5306     5336      +30     
+ Misses        962      961       -1     
Flag Coverage Δ
integrationtests 84.65% <100.00%> (+0.08%) ⬆️
unittests 64.25% <6.89%> (-0.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/feast/infra/offline_stores/bigquery.py 80.30% <ø> (ø)
...gration/offline_store/test_historical_retrieval.py 99.05% <100.00%> (+0.09%) ⬆️
...python/feast/infra/offline_stores/offline_utils.py 92.22% <0.00%> (+1.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7977a53...0fb046e. Read the comment docs.

@MattDelac MattDelac force-pushed the fix_bq_historical_retrieval_backfill branch 3 times, most recently from a61f9d2 to 2655af7 Compare July 29, 2021 14:31
@MattDelac MattDelac force-pushed the fix_bq_historical_retrieval_backfill branch from 930519b to 7df8899 Compare July 30, 2021 05:50
@feast-ci-bot feast-ci-bot added size/S and removed size/M labels Aug 5, 2021
@MattDelac MattDelac force-pushed the fix_bq_historical_retrieval_backfill branch from 18bbfc7 to 7b9cc71 Compare August 5, 2021 08:57
Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
@feast-ci-bot feast-ci-bot added size/L and removed size/S labels Aug 6, 2021
@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MattDelac, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@woop
Copy link
Member

woop commented Aug 6, 2021

/lgtm

@feast-ci-bot feast-ci-bot merged commit 54bbe5f into feast-dev:master Aug 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants