Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CONCAT() instead of ROW_NUMBER() #1601

Merged
merged 1 commit into from
Jun 1, 2021

Conversation

MattDelac
Copy link
Collaborator

@MattDelac MattDelac commented May 31, 2021

What this PR does / why we need it:
In order to calculate a unique ID for each row of the entity dataframe, we compute the following

SELECT ROW_NUMBER() OVER() AS row_number, edf.* FROM {{ left_table_query_string }} as edf

The problem is that BigQuery will need to send all the data to a single worker in order to properly calculate the row number of each row. For our use case, we end up with a OOM error
image

Which issue(s) this PR fixes:
The solution is to calculate a deterministic funcyion that will act as a unique identifier.
Because the entity_dataframe should contain all entity keys, I use CONCAT() that will compute a deterministic string for a given input. This hash is computed in a distributed fashion as it only needs the datapoints of a given row.

WITH entity_dataframe AS (
    SELECT
        *,
        CONCAT(
            {% for entity_key in unique_entity_keys %}
                CAST({{entity_key}} AS STRING),
            {% endfor %}
            CAST({{entity_df_event_timestamp_col}} AS STRING)
        ) AS entity_row_unique_id
    FROM {{ left_table_query_string }}
),

Alternative
First the result of the CONCAT() was passed to a Hash function (FARM_FINGERPRINT()). The problem with a hash function is that we introduce the possibility of a collision between different keys resulting to the same Hash.

I tried GENERATE_UUID() that is non deterministic and the query got wrong because i suspect that it computed it multiple times (depending on how the SQL query gets optimized & parsed). So we ended up with all features being always Null

TODO: Matt can look at how the query is interpreted and see if GENERATE_UUID() is called multiple times

Does this PR introduce a user-facing change?:

Use CONCAT() instead of ROW_NUMBER() to generate the unique identifiers of `entity_dataframe`

@feast-ci-bot
Copy link
Collaborator

Hi @MattDelac. Thanks for your PR.

I'm waiting for a feast-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@codecov-commenter
Copy link

codecov-commenter commented May 31, 2021

Codecov Report

Merging #1601 (ee51904) into master (3460e58) will decrease coverage by 6.22%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1601      +/-   ##
==========================================
- Coverage   83.61%   77.39%   -6.23%     
==========================================
  Files          65       64       -1     
  Lines        5761     5635     -126     
==========================================
- Hits         4817     4361     -456     
- Misses        944     1274     +330     
Flag Coverage Δ
integrationtests ?
unittests 77.39% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/feast/infra/offline_stores/bigquery.py 32.41% <ø> (-58.63%) ⬇️
sdk/python/tests/fixtures/data_source_fixtures.py 38.88% <0.00%> (-61.12%) ⬇️
sdk/python/tests/test_cli_gcp.py 40.74% <0.00%> (-59.26%) ⬇️
sdk/python/tests/test_feature_store.py 60.52% <0.00%> (-39.48%) ⬇️
sdk/python/tests/test_historical_retrieval.py 61.49% <0.00%> (-37.36%) ⬇️
...hon/tests/test_offline_online_store_consistency.py 74.72% <0.00%> (-25.28%) ⬇️
sdk/python/feast/registry.py 59.55% <0.00%> (-19.56%) ⬇️
sdk/python/feast/data_source.py 67.94% <0.00%> (-10.90%) ⬇️
sdk/python/feast/feature_view.py 81.63% <0.00%> (-6.13%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3460e58...ee51904. Read the comment docs.

@MattDelac MattDelac changed the title Remove row number Use FARM_FINGERPRINT() instead of ROW_NUMBER() May 31, 2021
@woop
Copy link
Member

woop commented May 31, 2021

/ok-to-test

@MattDelac
Copy link
Collaborator Author

Base on a conversation with one of my coworker, using a hash function introduce a potential collision due to the randomness of it.

That makes me realize that I don't even need it as a CONCAT(col1, col2, col3) is also deterministic and will make each row unique.

Thus I am making changes in this direction

@MattDelac MattDelac changed the title Use FARM_FINGERPRINT() instead of ROW_NUMBER() Use CONCAT() instead of ROW_NUMBER() May 31, 2021
Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
@woop
Copy link
Member

woop commented Jun 1, 2021

/lgtm

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MattDelac, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@woop
Copy link
Member

woop commented Jun 1, 2021

/kind housekeeping

@feast-ci-bot feast-ci-bot merged commit 63db728 into feast-dev:master Jun 1, 2021
woop pushed a commit that referenced this pull request Jun 7, 2021
Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants