Add Spark materialization engine for parallel, distributed materialization of large datasets. #3167

ckarwicki · 2022-09-01T22:33:50Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Current implementation of Spark offline store doesn't have Spark based materialization engine. This makes materialization slow, inefficient and makes Spark offline store not very useful since materialization is still happening in driver node and will be limited by its resources.

Describe the solution you'd like
A clear and concise description of what you want to happen.
Spark based materialization engine.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
BytewaxMaterializationEngine - it relies on offline_job.to_remote_storage() but SparkRetrievalJob doesn't support to_remote_storage(). Also, would rather use one stack for job execution (preferably Spark) instead of two.

Additional context
Add any other context or screenshots about the feature request here.
spark_materialization_engine would make Feast highly scalable and leverage full Spark potential. Right now it it very limited.

The text was updated successfully, but these errors were encountered:

adchia · 2022-09-01T22:46:49Z

Hey! Thanks for the FR. Are you in the Slack? There have been recent discussions in the #feast-development Slack about this topic

niklasvm · 2022-09-04T08:44:59Z

Hi @ckarwicki

I'm taking a look at potential designs for a SparkBatchMaterializationEngine. I've got an open PR that implements the to_remote_storage method for a SparkRetrievalJob in the mean time. I currently see 3 solutions:

(1) The default LocalMaterializationEngine writes the data to the local file system on the driver node and then loops over each parquet file in series. One possible solution is to parallelise this process however the success of this will still be limited by the size of the driver node.

(2) Alternatively, data can be written to remote storage (like S3) and the LambdaMaterializationEngine can be used however this requires the use of AWS Lambda.

(3) I don't know if it's possible to parallelise the code from the LocalMaterializationEngine to run over the executor nodes.

Any thoughts or ideas?

ckarwicki · 2022-09-05T16:55:06Z

@niklasvm We shouldn't rely on parquet files. This should all be done in executors which will make materialization distributed. Initially we can parallelize _materialize_one() after that materialize(). We can start very simple. We can take code from LocalMaterializationEngine and parallelize it in SparkBatchMaterializationEngine. Take out table = offline_job.to_arrow() it is inefficient. Then take code https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/materialization/local_engine.py#L158-L178 and put it in https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.foreachPartition.html#pyspark.sql.DataFrame.foreachPartition. It is recommended to use foreachPartition when writing to external stores.

We basically have SparkRetrievalJob after a call to self.offline_store.pull_latest_from_table_or_query() so we execute that and we get sql.DataFrame which is just df. We call foreachPartition() on that df and put that modified code from LocalMaterializationEngine which will write rows from sql.DataFrame to online store in executors. foreachPartition is transformation so we will need to trigger action in driver, which is fine. With this approach materialization will be handled from executors in parallel, distributed way.

niklasvm · 2022-09-05T17:35:03Z

Thanks @ckarwicki for the explanation. I have actually gone ahead and implemeted something quite similar in #3184 but instead of using foreachPartition I've used applyInPandas. I'm not sure which is superior.

ckarwicki · 2022-09-05T19:19:24Z

@niklasvm Not sure if we should use applyInPandas It is memory intensive and requires full shuffle: https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html

This function requires a full shuffle. All the data of a group will be loaded into memory, so the user should be aware of the potential OOM risk if data is skewed and certain groups are too large to fit in memory.

It works in three steps: Split-apply-combine, check this: https://docs.databricks.com/spark/latest/spark-sql/pandas-function-apis.html#grouped-map

We should use foreachPartition()

niklasvm · 2022-09-05T19:39:48Z

@niklasvm Not sure if we should use applyInPandas It is memory intensive and requires full shuffle: https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html
This function requires a full shuffle. All the data of a group will be loaded into memory, so the user should be aware of the potential OOM risk if data is skewed and certain groups are too large to fit in memory.
It works in three steps: Split-apply-combine, check this: https://docs.databricks.com/spark/latest/spark-sql/pandas-function-apis.html#grouped-map

We should use foreachPartition()

That is very useful, thank you. I'll update the PR.

ckarwicki · 2022-09-05T23:25:04Z

@niklasvm You should also use repartition() on df before calling foreachPartition() where number of partitions should be number of simultaneous connections (writes) to online store. We can call it without arguments - it will create 200 partitions (number taken from spark.sql.shuffle.partitions). It will be that many simultaneous connections (parallel writes) to Redis. Redis default is 10K connections. You can change it on Redis side if you need more.

niklasvm · 2022-09-06T04:39:58Z

I believe repartition will cause a shuffle operation. Here's the DAG that uses repartition:

And without repartition:

Is the intention to only repartition to a number that redis can handle? Unfortunately numPartitions is a required parameter of repartition.

Could we use coalesce()?

ckarwicki · 2022-09-07T18:03:46Z

@niklasvm coalesce() should be used when partitioning down. It would be useful to call repartition() since it would allow control on the parallelism of materialization and simultaneous writes. They are saying it should be possible to call it without arguments:

numPartitions : int
can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the first partitioning column. If not specified, the default number of partitions is used.

niklasvm · 2022-09-08T06:40:24Z

Thanks @ckarwicki. I have tested running this without a parameter and it fails.

ckarwicki added the kind/feature New feature or request label Sep 1, 2022

niklasvm mentioned this issue Sep 3, 2022

feat: Add to_remote_storage functionality to SparkOfflineStore #3175

Merged

niklasvm mentioned this issue Sep 5, 2022

feat: Implement spark materialization engine #3184

Merged

feast-ci-bot closed this as completed in #3184 Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Spark materialization engine for parallel, distributed materialization of large datasets. #3167

Add Spark materialization engine for parallel, distributed materialization of large datasets. #3167

ckarwicki commented Sep 1, 2022 •

edited

Loading

adchia commented Sep 1, 2022

niklasvm commented Sep 4, 2022

ckarwicki commented Sep 5, 2022 •

edited

Loading

niklasvm commented Sep 5, 2022

ckarwicki commented Sep 5, 2022 •

edited

Loading

niklasvm commented Sep 5, 2022

ckarwicki commented Sep 5, 2022 •

edited

Loading

niklasvm commented Sep 6, 2022

ckarwicki commented Sep 7, 2022 •

edited

Loading

niklasvm commented Sep 8, 2022

Add Spark materialization engine for parallel, distributed materialization of large datasets. #3167

Add Spark materialization engine for parallel, distributed materialization of large datasets. #3167

Comments

ckarwicki commented Sep 1, 2022 • edited Loading

adchia commented Sep 1, 2022

niklasvm commented Sep 4, 2022

ckarwicki commented Sep 5, 2022 • edited Loading

niklasvm commented Sep 5, 2022

ckarwicki commented Sep 5, 2022 • edited Loading

niklasvm commented Sep 5, 2022

ckarwicki commented Sep 5, 2022 • edited Loading

niklasvm commented Sep 6, 2022

ckarwicki commented Sep 7, 2022 • edited Loading

niklasvm commented Sep 8, 2022

ckarwicki commented Sep 1, 2022 •

edited

Loading

ckarwicki commented Sep 5, 2022 •

edited

Loading

ckarwicki commented Sep 5, 2022 •

edited

Loading

ckarwicki commented Sep 5, 2022 •

edited

Loading

ckarwicki commented Sep 7, 2022 •

edited

Loading