-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CONCAT() instead of ROW_NUMBER() #1601
Conversation
Hi @MattDelac. Thanks for your PR. I'm waiting for a feast-dev member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
0e79193
to
7829080
Compare
1ce17e2
to
271836d
Compare
Codecov Report
@@ Coverage Diff @@
## master #1601 +/- ##
==========================================
- Coverage 83.61% 77.39% -6.23%
==========================================
Files 65 64 -1
Lines 5761 5635 -126
==========================================
- Hits 4817 4361 -456
- Misses 944 1274 +330
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
271836d
to
ee51904
Compare
/ok-to-test |
Base on a conversation with one of my coworker, using a hash function introduce a potential collision due to the randomness of it. That makes me realize that I don't even need it as a Thus I am making changes in this direction |
ee51904
to
469eac3
Compare
Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
469eac3
to
30b4104
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MattDelac, woop The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/kind housekeeping |
Signed-off-by: Matt Delacour <matt.delacour@shopify.com>
What this PR does / why we need it:
In order to calculate a unique ID for each row of the entity dataframe, we compute the following
The problem is that BigQuery will need to send all the data to a single worker in order to properly calculate the row number of each row. For our use case, we end up with a OOM error
Which issue(s) this PR fixes:
The solution is to calculate a deterministic funcyion that will act as a unique identifier.
Because the entity_dataframe should contain all entity keys, I use
CONCAT()
that will compute a deterministic string for a given input. This hash is computed in a distributed fashion as it only needs the datapoints of a given row.Alternative
First the result of the
CONCAT()
was passed to a Hash function (FARM_FINGERPRINT()
). The problem with a hash function is that we introduce the possibility of a collision between different keys resulting to the same Hash.I tried
GENERATE_UUID()
that is non deterministic and the query got wrong because i suspect that it computed it multiple times (depending on how the SQL query gets optimized & parsed). So we ended up with all features being always NullTODO: Matt can look at how the query is interpreted and see if
GENERATE_UUID()
is called multiple timesDoes this PR introduce a user-facing change?: