-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Cassandra online store, concurrent fetching for multiple entities #3356
feat: Cassandra online store, concurrent fetching for multiple entities #3356
Conversation
minimal handling of exceptions in concurrent query execution read_concurrency parameter in Cassandra online store config yaml Signed-off-by: Stefano Lottini <stefano.lottini@datastax.com>
ce4a0eb
to
e9c04f9
Compare
/lgtm |
@hemidactylus: you cannot LGTM your own PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adchia, hemidactylus The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
# [0.27.0](v0.26.0...v0.27.0) (2022-12-05) ### Bug Fixes * Changing Snowflake template code to avoid query not implemented … ([#3319](#3319)) ([1590d6b](1590d6b)) * Dask zero division error if parquet dataset has only one partition ([#3236](#3236)) ([69e4a7d](69e4a7d)) * Enable Spark materialization on Yarn ([#3370](#3370)) ([0c20a4e](0c20a4e)) * Ensure that Snowflake accounts for number columns that overspecify precision ([#3306](#3306)) ([0ad0ace](0ad0ace)) * Fix memory leak from usage.py not properly cleaning up call stack ([#3371](#3371)) ([a0c6fde](a0c6fde)) * Fix workflow to contain env vars ([#3379](#3379)) ([548bed9](548bed9)) * Update bytewax materialization ([#3368](#3368)) ([4ebe00f](4ebe00f)) * Update the version counts ([#3378](#3378)) ([8112db5](8112db5)) * Updated AWS Athena template ([#3322](#3322)) ([5956981](5956981)) * Wrong UI data source type display ([#3276](#3276)) ([8f28062](8f28062)) ### Features * Cassandra online store, concurrency in bulk write operations ([#3367](#3367)) ([eaf354c](eaf354c)) * Cassandra online store, concurrent fetching for multiple entities ([#3356](#3356)) ([00fa21f](00fa21f)) * Get Snowflake Query Output As Pyspark Dataframe ([#2504](#2504)) ([#3358](#3358)) ([2f18957](2f18957))
This changes the retrieval of features from the Cassandra online store by leveraging the
Cassandra driver's native concurrency capabilities.
When there are several entities to be retrieved, instead of a sequential read one-by-one, entity after entity,
the reads are executed concurrently, with the driver ensuring the results are kept in the correct order and the call
returns when all results are available.
This, as measured in realistic environments, implies a speedup of 2-3x for retrieval of 20 to 100 entities at once.
Using the Cassandra driver's
execute_concurrent_with_args
function requires a new parameter controlling the maximum amount of concurrency to use (somewhat bounded by the number of vCPUs at hand): for transparency, this is exposed in the feature store configuration yaml as a new parameter, which is documented and correctly handled by the guided procedure offeast init -t cassandra
.