Skip to content

Ray-2.10.0

Compare
Choose a tag to compare
@khluu khluu released this 21 Mar 19:02
· 17 commits to releases/2.10.0 since this release
09abba2

Release Highlights

Ray 2.10 release brings important stability improvements and enhancements to Ray Data, with Ray Data becoming generally available (GA).

  • [Data] Ray Data becomes generally available with stability improvements in streaming execution, reading and writing data, better tasks concurrency control, and debuggability improvement with dashboard, logging and metrics visualization.
  • [RLlib] “New API Stack” officially announced as alpha for PPO and SAC.
  • [Serve] Added a default autoscaling policy set via num_replicas=”auto” (#42613).
  • [Serve] Added support for active load shedding via max_queued_requests (#42950).
  • [Serve] Added replica queue length caching to the DeploymentHandle scheduler (#42943).
    • This should improve overhead in the Serve proxy and handles.
    • max_ongoing_requests (max_concurrent_queries) is also now strictly enforced (#42947).
    • If you see any issues, please report them on GitHub and you can disable this behavior by setting: RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0.
  • [Serve] Renamed the following parameters. Each of the old names will be supported for another release before removal.
    • max_concurrent_queries -> max_ongoing_requests
    • target_num_ongoing_requests_per_replica -> target_ongoing_requests
    • downscale_smoothing_factor -> downscaling_factor
    • upscale_smoothing_factor -> upscaling_factor
  • [Core] Autoscaler v2 is in alpha and can be tried out with Kuberay. It has improved observability and stability compared to v1.
  • [Train] Added support for accelerator types via ScalingConfig(accelerator_type).
  • [Train] Revamped the XGBoostTrainer and LightGBMTrainer to no longer depend on xgboost_ray and lightgbm_ray. A new, more flexible API will be released in a future release.
  • [Train/Tune] Refactored local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR.

Ray Libraries

Ray Data

🎉 New Features:

  • Streaming execution stability improvement to avoid memory issue, including per-operator resource reservation, streaming generator output buffer management, and better runtime resource estimation (#43026, #43171, #43298, #43299, #42930, #42504)
  • Metadata read stability improvement to avoid AWS transient error, including retry on application-level exception, spread tasks across multiple nodes, and configure retry interval (#42044, #43216, #42922, #42759).
  • Allow tasks concurrency control for read, map, and write APIs (#42849, #43113, #43177, #42637)
  • Data dashboard and statistics improvement with more runtime metrics for each components (#43790, #43628, #43241, #43477, #43110, #43112)
  • Allow to specify application-level error to retry for actor task (#42492)
  • Add num_rows_per_file parameter to file-based writes (#42694)
  • Add DataIterator.materialize (#43210)
  • Skip schema call in DataIterator.to_tf if tf.TypeSpec is provided (#42917)
  • Add option to append for Dataset.write_bigquery (#42584)
  • Deprecate legacy components and classes (#43575, #43178, #43347, #43349, #43342, #43341, #42936, #43144, #43022, #43023)

💫 Enhancements:

  • Restructure stdout logging for better readability (#43360)
  • Add a more performant way to read large TFRecord datasets (#42277)
  • Modify ImageDatasource to use Image.BILINEAR as the default image resampling filter (#43484)
  • Reduce internal stack trace output by default (#43251)
  • Perform incremental writes to Parquet files (#43563)
  • Warn on excessive driver memory usage during shuffle ops (#42574)
  • Distributed reads for ray.data.from_huggingface (#42599)
  • Remove Stage class and related usages (#42685)
  • Improve stability of reading JSON files to avoid PyArrow errors (#42558, #42357)

🔨 Fixes:

  • Turn off actor locality by default (#44124)
  • Normalize block types before internal multi-block operations (#43764)
  • Fix memory metrics for OutputSplitter (#43740)
  • Fix race condition issue in OpBufferQueue (#43015)
  • Fix early stop for multiple Limit operators. (#42958)
  • Fix deadlocks caused by Dataset.streaming_split for job hanging (#42601)

📖 Documentation:

Ray Train

🎉 New Features:

  • Add support for accelerator types via ScalingConfig(accelerator_type) for improved worker scheduling (#43090)

💫 Enhancements:

  • Add a backend-specific context manager for train_func for setup/teardown logic (#43209)
  • Remove DEFAULT_NCCL_SOCKET_IFNAME to simplify network configuration (#42808)
  • Colocate Trainer with rank 0 Worker for to improve scheduling behavior (#43115)

🔨 Fixes:

  • Enable scheduling workers with memory resource requirements (#42999)
  • Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)
  • [Lightning] Fix resuming from checkpoint when using RayFSDPStrategy (#43594)
  • [Lightning] Fix deadlock in RayTrainReportCallback (#42751)
  • [Transformers] Fix checkpoint reporting behavior when get_latest_checkpoint returns None (#42953)

📖 Documentation:

  • Enhance docstring and user guides for train_loop_config (#43691)
  • Clarify in ray.train.report docstring that it is not a barrier (#42422)
  • Improve documentation for prepare_data_loader shuffle behavior and set_epoch (#41807)

🏗 Architecture refactoring:

  • Simplify XGBoost and LightGBM Trainer integrations. Implemented XGBoostTrainer and LightGBMTrainer as DataParallelTrainer. Removed dependency on xgboost_ray and lightgbm_ray. (#42111, #42767, #43244, #43424)
  • Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
  • Split overloaded ray.train.torch.get_device into another get_devices API for multi-GPU worker setup (#42314)
  • Refactor restoration configuration to be centered around storage_path (#42853, #43179)
  • Deprecations related to SyncConfig (#42909)
  • Remove deprecated preprocessor argument from Trainers (#43146, #43234)
  • Hard-deprecate MosaicTrainer and remove SklearnTrainer (#42814)

Ray Tune

💫 Enhancements:

  • Increase the minimum number of allowed pending trials for faster auto-scaleup (#43455)
  • Add support to TBXLogger for logging images (#37822)
  • Improve validation of Experiment(config) to handle RLlib AlgorithmConfig (#42816, #42116)

🔨 Fixes:

  • Fix reuse_actors error on actor cleanup for function trainables (#42951)
  • Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)

📖 Documentation:

🏗 Architecture refactoring:

  • Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
  • Deprecations related to SyncConfig and chdir_to_trial_dir (#42909)
  • Refactor restoration configuration to be centered around storage_path (#42853, #43179)
  • Add back NevergradSearch (#42305)
  • Clean up invalid checkpoint_dir and reporter deprecation notices (#42698)

Ray Serve

🎉 New Features:

  • Added support for active load shedding via max_queued_requests (#42950).
  • Added a default autoscaling policy set via num_replicas=”auto” (#42613).

🏗 API Changes:

  • Renamed the following parameters. Each of the old names will be supported for another release before removal.
    • max_concurrent_queries to max_ongoing_requests
    • target_num_ongoing_requests_per_replica to target_ongoing_requests
    • downscale_smoothing_factor to downscaling_factor
    • upscale_smoothing_factor to upscaling_factor
  • WARNING: the following default values will change in Ray 2.11:
    • Default for max_ongoing_requests will change from 100 to 5.
    • Default for target_ongoing_requests will change from 1 to 2.

💫 Enhancements:

  • Add RAY_SERVE_LOG_ENCODING env to set the global logging behavior for Serve (#42781).
  • Config Serve's gRPC proxy to allow large payload (#43114).
  • Add blocking flag to serve.run() (#43227).
  • Add actor id and worker id to Serve structured logs (#43725).
  • Added replica queue length caching to the DeploymentHandle scheduler (#42943).
    • This should improve overhead in the Serve proxy and handles.
    • max_ongoing_requests (max_concurrent_queries) is also now strictly enforced (#42947).
    • If you see any issues, please report them on GitHub and you can disable this behavior by setting: RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0.
  • Autoscaling metrics (tracking ongoing and queued metrics) are now collected at deployment handles by default instead of at the Serve replicas (#42578).
    • This means you can now set max_ongoing_requests=1 for autoscaling deployments and still upscale properly, because requests queued at handles are properly taken into account for autoscaling.
    • You should expect deployments to upscale more aggressively during bursty traffic, because requests will likely queue up at handles during bursts of traffic.
    • If you see any issues, please report them on GitHub and you can switch back to the old method of collecting metrics by setting the environment variable RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0
  • Improved the downscaling behavior of smoothing_factor for low numbers of replicas (#42612).
  • Various logging improvements (#43707, #43708, #43629, #43557).
  • During in-place upgrades or when replicas become unhealthy, Serve will no longer wait for old replicas to gracefully terminate before starting new ones (#43187). New replicas will be eagerly started to satisfy the target number of healthy replicas.
    • This new behavior is on by default and can be turned off by setting RAY_SERVE_EAGERLY_START_REPLACEMENT_REPLICAS=0

🔨 Fixes:

  • Fix deployment route prefix override by default route prefix from serve run cli (#43805).
  • Fixed a bug causing batch methods to hang upon cancellation (#42593).
  • Unpinned FastAPI dependency version (#42711).
  • Delay proxy marking itself as healthy until it has routes from the controller (#43076).
  • Fixed an issue where multiplexed deployments could go into infinite backoff (#43965).
  • Silence noisy KeyError on disconnects (#43713).
  • Fixed the prometheus counter metrics emitted as gauge bug (#43795, #43901).
    • All the serve counter metrics are emitted as counters with _total suffix. The old gauge metrics are still emitted for compatibility.

📖 Documentation:

  • Update serve logging config docs (#43483).
  • Added documentation for max_replicas_per_node (#42743).

RLlib

🎉 New Features:

💫 Enhancements:

  • Old API Stack cleanups:
    • Move SampleBatch column names (e.g. SampleBatch.OBS) into new class (Columns). (#43665)
    • Remove old exec_plan API code. (#41585)
    • Introduce OldAPIStack decorator (#43657)
    • RLModule API: Add functionality to define kernel and bias initializers via config. (#42137)
  • Learner/LearnerGroup APIs:
    • Replace Learner/LearnerGroup specific config classes (e.g. LearnerHyperparameters) with AlgorithmConfig. (#41296)
    • Learner/LearnerGroup: Allow updating from Episodes. (#41235)
  • In preparation of DQN on the new API stack: (#43199, #43196)

🔨 Fixes:

  • New API Stack bug fixes: Fix policy_to_train logic (#41529), fix multi-APU for PPO on the new API stack. (#44001), Issue 40347: (#42090)
  • Other fixes: MultiAgentEnv would NOT call env.close() on a failed sub-env (#43664), Issue 42152 (#43317), issue 42396: (#43316), issue 41518 (#42011), issue 42385 (#43313)

📖 Documentation:

  • New API Stack examples: Self-play and league-based self-play (#43276), MeanStdFilter (for both single-agent and multi-agent) (#43274), Prev-actions/prev-rewards for multi-agent (#43491)
  • Other docs fixes and enhancements: (#43438, #41472, #42117, #43458)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

  • Autoscaler v2 is in alpha and can be tried out with Kuberay.
  • Introduced subreaper to prevent leaks of sub-processes created by user code. (#42992)

💫 Enhancements:

  • Ray state api get_task() now accepts ObjectRef (#43507)
  • Add an option to disable task tracing for task/actor (#42431)
  • Improved object transfer throughput. (#43434)
  • Ray client now compares the Ray and Python version for compatibility with the remote Ray cluster. (#42760)

🔨 Fixes:

  • Fixed several bugs for streaming generator (#43775, #43772, #43413)
  • Fixed Ray counter metrics emitted as gauge bug (#43795)
  • Fixed a bug where empty resource task doesn’t work with placement group (#43448)
  • Fixed a bug where CPU resource is not released for a blocked worker inside placement group (#43270)
  • Fixed GCS crashes when PG commit phase failed due to node failure (#43405)
  • Fixed a bug where Ray memory monitor prematurely kill tasks (#43071)
  • Fixed placement group resource leak (#42942)
  • Upgraded cloudpickle to 3.0 which fixes the incompatibility with dataclasses (#42730)

📖 Documentation:

  • Updated the doc for Ray accelerators support (#41849)

Ray Clusters

💫 Enhancements:

  • [spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster (#42604)
  • [spark] Add global mode for ray on spark cluster (#41153)

🔨 Fixes:

  • [VSphere] Only deploy ovf to first host of cluster (#42258)

Thanks

Many thanks to all those who contributed to this release!

@ronyw7, @xsqian, @justinvyu, @matthewdeng, @sven1977, @thomasdesr, @veryhannibal, @klebster2, @can-anyscale, @simran-2797, @stephanie-wang, @simonsays1980, @kouroshHakha, @Zandew, @akshay-anyscale, @matschaffer-roblox, @WeichenXu123, @matthew29tang, @vitsai, @Hank0626, @anmyachev, @kira-lin, @ericl, @zcin, @sihanwang41, @peytondmurray, @raulchen, @aslonnie, @ruisearch42, @vszal, @pcmoritz, @rickyyx, @chrislevn, @brycehuang30, @alexeykudinkin, @vonsago, @shrekris-anyscale, @andrewsykim, @c21, @mattip, @hongchaodeng, @dabauxi, @fishbone, @scottjlee, @justina777, @surenyufuz, @robertnishihara, @nikitavemuri, @Yard1, @huchen2021, @shomilj, @architkulkarni, @liuxsh9, @Jocn2020, @liuyang-my, @rkooo567, @alanwguo, @KPostOffice, @woshiyyya, @n30111, @edoakes, @y-abe, @martinbomio, @jiwq, @arunppsg, @ArturNiederfahrenhorst, @kevin85421, @khluu, @JingChen23, @masariello, @angelinalg, @jjyao, @omatthew98, @jonathan-anyscale, @sjoshi6, @gaborgsomogyi, @rynewang, @ratnopamc, @chris-ray-zhang, @ijrsvt, @scottsun94, @raychen911, @franklsf95, @GeneDer, @madhuri-rai07, @scv119, @bveeramani, @anyscalesam, @zen-xu, @npuichigo