-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Spark materialization engine for parallel, distributed materialization of large datasets. #3167
Comments
Hey! Thanks for the FR. Are you in the Slack? There have been recent discussions in the #feast-development Slack about this topic |
Hi @ckarwicki I'm taking a look at potential designs for a (1) The default (2) Alternatively, data can be written to remote storage (like S3) and the (3) I don't know if it's possible to parallelise the code from the Any thoughts or ideas? |
@niklasvm We shouldn't rely on parquet files. This should all be done in executors which will make materialization distributed. Initially we can parallelize We basically have |
Thanks @ckarwicki for the explanation. I have actually gone ahead and implemeted something quite similar in #3184 but instead of using |
@niklasvm Not sure if we should use
It works in three steps: Split-apply-combine, check this: https://docs.databricks.com/spark/latest/spark-sql/pandas-function-apis.html#grouped-map We should use |
That is very useful, thank you. I'll update the PR. |
@niklasvm You should also use repartition() on df before calling |
I believe And without Is the intention to only repartition to a number that redis can handle? Unfortunately Could we use coalesce()? |
@niklasvm coalesce() should be used when partitioning down. It would be useful to call repartition() since it would allow control on the parallelism of materialization and simultaneous writes. They are saying it should be possible to call it without arguments:
|
Thanks @ckarwicki. I have tested running this without a parameter and it fails. |
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Current implementation of Spark offline store doesn't have Spark based materialization engine. This makes materialization slow, inefficient and makes Spark offline store not very useful since materialization is still happening in driver node and will be limited by its resources.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Spark based materialization engine.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
BytewaxMaterializationEngine - it relies on
offline_job.to_remote_storage()
butSparkRetrievalJob
doesn't supportto_remote_storage()
. Also, would rather use one stack for job execution (preferably Spark) instead of two.Additional context
Add any other context or screenshots about the feature request here.
spark_materialization_engine
would make Feast highly scalable and leverage full Spark potential. Right now it it very limited.The text was updated successfully, but these errors were encountered: