Ray-2.10.0
Release Highlights
Ray 2.10 release brings important stability improvements and enhancements to Ray Data, with Ray Data becoming generally available (GA).
- [Data] Ray Data becomes generally available with stability improvements in streaming execution, reading and writing data, better tasks concurrency control, and debuggability improvement with dashboard, logging and metrics visualization.
- [RLlib] “New API Stack” officially announced as alpha for PPO and SAC.
- [Serve] Added a default autoscaling policy set via
num_replicas=”auto”
(#42613). - [Serve] Added support for active load shedding via
max_queued_requests
(#42950). - [Serve] Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
max_ongoing_requests (max_concurrent_queries)
is also now strictly enforced (#42947).- If you see any issues, please report them on GitHub and you can disable this behavior by setting:
RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0
.
- [Serve] Renamed the following parameters. Each of the old names will be supported for another release before removal.
max_concurrent_queries
->max_ongoing_requests
target_num_ongoing_requests_per_replica
->target_ongoing_requests
downscale_smoothing_factor
->downscaling_factor
upscale_smoothing_factor
->upscaling_factor
- [Core] Autoscaler v2 is in alpha and can be tried out with Kuberay. It has improved observability and stability compared to v1.
- [Train] Added support for accelerator types via
ScalingConfig(accelerator_type)
. - [Train] Revamped the
XGBoostTrainer
andLightGBMTrainer
to no longer depend onxgboost_ray
andlightgbm_ray
. A new, more flexible API will be released in a future release. - [Train/Tune] Refactored local staging directory to remove the need for
local_dir
andRAY_AIR_LOCAL_CACHE_DIR
.
Ray Libraries
Ray Data
🎉 New Features:
- Streaming execution stability improvement to avoid memory issue, including per-operator resource reservation, streaming generator output buffer management, and better runtime resource estimation (#43026, #43171, #43298, #43299, #42930, #42504)
- Metadata read stability improvement to avoid AWS transient error, including retry on application-level exception, spread tasks across multiple nodes, and configure retry interval (#42044, #43216, #42922, #42759).
- Allow tasks concurrency control for read, map, and write APIs (#42849, #43113, #43177, #42637)
- Data dashboard and statistics improvement with more runtime metrics for each components (#43790, #43628, #43241, #43477, #43110, #43112)
- Allow to specify application-level error to retry for actor task (#42492)
- Add
num_rows_per_file
parameter to file-based writes (#42694) - Add
DataIterator.materialize
(#43210) - Skip schema call in
DataIterator.to_tf
iftf.TypeSpec
is provided (#42917) - Add option to append for
Dataset.write_bigquery
(#42584) - Deprecate legacy components and classes (#43575, #43178, #43347, #43349, #43342, #43341, #42936, #43144, #43022, #43023)
💫 Enhancements:
- Restructure stdout logging for better readability (#43360)
- Add a more performant way to read large TFRecord datasets (#42277)
- Modify
ImageDatasource
to useImage.BILINEAR
as the default image resampling filter (#43484) - Reduce internal stack trace output by default (#43251)
- Perform incremental writes to Parquet files (#43563)
- Warn on excessive driver memory usage during shuffle ops (#42574)
- Distributed reads for
ray.data.from_huggingface
(#42599) - Remove
Stage
class and related usages (#42685) - Improve stability of reading JSON files to avoid PyArrow errors (#42558, #42357)
🔨 Fixes:
- Turn off actor locality by default (#44124)
- Normalize block types before internal multi-block operations (#43764)
- Fix memory metrics for
OutputSplitter
(#43740) - Fix race condition issue in
OpBufferQueue
(#43015) - Fix early stop for multiple
Limit
operators. (#42958) - Fix deadlocks caused by
Dataset.streaming_split
for job hanging (#42601)
📖 Documentation:
Ray Train
🎉 New Features:
- Add support for accelerator types via
ScalingConfig(accelerator_type)
for improved worker scheduling (#43090)
💫 Enhancements:
- Add a backend-specific context manager for
train_func
for setup/teardown logic (#43209) - Remove
DEFAULT_NCCL_SOCKET_IFNAME
to simplify network configuration (#42808) - Colocate Trainer with rank 0 Worker for to improve scheduling behavior (#43115)
🔨 Fixes:
- Enable scheduling workers with
memory
resource requirements (#42999) - Make path behavior OS-agnostic by using
Path.as_posix
overos.path.join
(#42037) - [Lightning] Fix resuming from checkpoint when using
RayFSDPStrategy
(#43594) - [Lightning] Fix deadlock in
RayTrainReportCallback
(#42751) - [Transformers] Fix checkpoint reporting behavior when
get_latest_checkpoint
returns None (#42953)
📖 Documentation:
- Enhance docstring and user guides for
train_loop_config
(#43691) - Clarify in
ray.train.report
docstring that it is not a barrier (#42422) - Improve documentation for
prepare_data_loader
shuffle behavior andset_epoch
(#41807)
🏗 Architecture refactoring:
- Simplify XGBoost and LightGBM Trainer integrations. Implemented
XGBoostTrainer
andLightGBMTrainer
asDataParallelTrainer
. Removed dependency onxgboost_ray
andlightgbm_ray
. (#42111, #42767, #43244, #43424) - Refactor local staging directory to remove the need for
local_dir
andRAY_AIR_LOCAL_CACHE_DIR
. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written tostorage_path
, rather than having another copy in the user’s home directory (~/ray_results
). (#43369, #43403, #43689) - Split overloaded
ray.train.torch.get_device
into anotherget_devices
API for multi-GPU worker setup (#42314) - Refactor restoration configuration to be centered around
storage_path
(#42853, #43179) - Deprecations related to
SyncConfig
(#42909) - Remove deprecated
preprocessor
argument from Trainers (#43146, #43234) - Hard-deprecate
MosaicTrainer
and removeSklearnTrainer
(#42814)
Ray Tune
💫 Enhancements:
- Increase the minimum number of allowed pending trials for faster auto-scaleup (#43455)
- Add support to
TBXLogger
for logging images (#37822) - Improve validation of
Experiment(config)
to handle RLlibAlgorithmConfig
(#42816, #42116)
🔨 Fixes:
- Fix
reuse_actors
error on actor cleanup for function trainables (#42951) - Make path behavior OS-agnostic by using Path.as_posix over
os.path.join
(#42037)
📖 Documentation:
🏗 Architecture refactoring:
- Refactor local staging directory to remove the need for
local_dir
andRAY_AIR_LOCAL_CACHE_DIR
. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written tostorage_path
, rather than having another copy in the user’s home directory (~/ray_results
). (#43369, #43403, #43689) - Deprecations related to
SyncConfig
andchdir_to_trial_dir
(#42909) - Refactor restoration configuration to be centered around
storage_path
(#42853, #43179) - Add back
NevergradSearch
(#42305) - Clean up invalid
checkpoint_dir
andreporter
deprecation notices (#42698)
Ray Serve
🎉 New Features:
- Added support for active load shedding via
max_queued_requests
(#42950). - Added a default autoscaling policy set via
num_replicas=”auto”
(#42613).
🏗 API Changes:
- Renamed the following parameters. Each of the old names will be supported for another release before removal.
max_concurrent_queries
tomax_ongoing_requests
target_num_ongoing_requests_per_replica
totarget_ongoing_requests
downscale_smoothing_factor
todownscaling_factor
upscale_smoothing_factor
toupscaling_factor
- WARNING: the following default values will change in Ray 2.11:
- Default for
max_ongoing_requests
will change from 100 to 5. - Default for
target_ongoing_requests
will change from 1 to 2.
- Default for
💫 Enhancements:
- Add
RAY_SERVE_LOG_ENCODING
env to set the global logging behavior for Serve (#42781). - Config Serve's gRPC proxy to allow large payload (#43114).
- Add blocking flag to serve.run() (#43227).
- Add actor id and worker id to Serve structured logs (#43725).
- Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
max_ongoing_requests
(max_concurrent_queries
) is also now strictly enforced (#42947).- If you see any issues, please report them on GitHub and you can disable this behavior by setting:
RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0
.
- Autoscaling metrics (tracking ongoing and queued metrics) are now collected at deployment handles by default instead of at the Serve replicas (#42578).
- This means you can now set
max_ongoing_requests=1
for autoscaling deployments and still upscale properly, because requests queued at handles are properly taken into account for autoscaling. - You should expect deployments to upscale more aggressively during bursty traffic, because requests will likely queue up at handles during bursts of traffic.
- If you see any issues, please report them on GitHub and you can switch back to the old method of collecting metrics by setting the environment variable
RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0
- This means you can now set
- Improved the downscaling behavior of smoothing_factor for low numbers of replicas (#42612).
- Various logging improvements (#43707, #43708, #43629, #43557).
- During in-place upgrades or when replicas become unhealthy, Serve will no longer wait for old replicas to gracefully terminate before starting new ones (#43187). New replicas will be eagerly started to satisfy the target number of healthy replicas.
- This new behavior is on by default and can be turned off by setting
RAY_SERVE_EAGERLY_START_REPLACEMENT_REPLICAS=0
- This new behavior is on by default and can be turned off by setting
🔨 Fixes:
- Fix deployment route prefix override by default route prefix from serve run cli (#43805).
- Fixed a bug causing batch methods to hang upon cancellation (#42593).
- Unpinned FastAPI dependency version (#42711).
- Delay proxy marking itself as healthy until it has routes from the controller (#43076).
- Fixed an issue where multiplexed deployments could go into infinite backoff (#43965).
- Silence noisy
KeyError
on disconnects (#43713). - Fixed the prometheus counter metrics emitted as gauge bug (#43795, #43901).
- All the serve counter metrics are emitted as counters with _total suffix. The old gauge metrics are still emitted for compatibility.
📖 Documentation:
RLlib
🎉 New Features:
- The “new API stack” is now in alpha stage and available for PPO single- (#42272) and multi-agent and for SAC single-agent (#42571, #42570, #42568)
- In preparation of DQN on the new API stack: PrioritizedEpisodeReplayBuffer (#43258, #42832)
💫 Enhancements:
- Old API Stack cleanups:
- Learner/LearnerGroup APIs:
- In preparation of DQN on the new API stack: (#43199, #43196)
🔨 Fixes:
- New API Stack bug fixes: Fix
policy_to_train
logic (#41529), fix multi-APU for PPO on the new API stack. (#44001), Issue 40347: (#42090) - Other fixes: MultiAgentEnv would NOT call env.close() on a failed sub-env (#43664), Issue 42152 (#43317), issue 42396: (#43316), issue 41518 (#42011), issue 42385 (#43313)
📖 Documentation:
- New API Stack examples: Self-play and league-based self-play (#43276), MeanStdFilter (for both single-agent and multi-agent) (#43274), Prev-actions/prev-rewards for multi-agent (#43491)
- Other docs fixes and enhancements: (#43438, #41472, #42117, #43458)
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- Autoscaler v2 is in alpha and can be tried out with Kuberay.
- Introduced subreaper to prevent leaks of sub-processes created by user code. (#42992)
💫 Enhancements:
- Ray state api
get_task()
now accepts ObjectRef (#43507) - Add an option to disable task tracing for task/actor (#42431)
- Improved object transfer throughput. (#43434)
- Ray client now compares the Ray and Python version for compatibility with the remote Ray cluster. (#42760)
🔨 Fixes:
- Fixed several bugs for streaming generator (#43775, #43772, #43413)
- Fixed Ray counter metrics emitted as gauge bug (#43795)
- Fixed a bug where empty resource task doesn’t work with placement group (#43448)
- Fixed a bug where CPU resource is not released for a blocked worker inside placement group (#43270)
- Fixed GCS crashes when PG commit phase failed due to node failure (#43405)
- Fixed a bug where Ray memory monitor prematurely kill tasks (#43071)
- Fixed placement group resource leak (#42942)
- Upgraded cloudpickle to 3.0 which fixes the incompatibility with dataclasses (#42730)
📖 Documentation:
- Updated the doc for Ray accelerators support (#41849)
Ray Clusters
💫 Enhancements:
- [spark] Add
heap_memory
param forsetup_ray_cluster
API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster (#42604) - [spark] Add global mode for ray on spark cluster (#41153)
🔨 Fixes:
- [VSphere] Only deploy ovf to first host of cluster (#42258)
Thanks
Many thanks to all those who contributed to this release!
@ronyw7, @xsqian, @justinvyu, @matthewdeng, @sven1977, @thomasdesr, @veryhannibal, @klebster2, @can-anyscale, @simran-2797, @stephanie-wang, @simonsays1980, @kouroshHakha, @Zandew, @akshay-anyscale, @matschaffer-roblox, @WeichenXu123, @matthew29tang, @vitsai, @Hank0626, @anmyachev, @kira-lin, @ericl, @zcin, @sihanwang41, @peytondmurray, @raulchen, @aslonnie, @ruisearch42, @vszal, @pcmoritz, @rickyyx, @chrislevn, @brycehuang30, @alexeykudinkin, @vonsago, @shrekris-anyscale, @andrewsykim, @c21, @mattip, @hongchaodeng, @dabauxi, @fishbone, @scottjlee, @justina777, @surenyufuz, @robertnishihara, @nikitavemuri, @Yard1, @huchen2021, @shomilj, @architkulkarni, @liuxsh9, @Jocn2020, @liuyang-my, @rkooo567, @alanwguo, @KPostOffice, @woshiyyya, @n30111, @edoakes, @y-abe, @martinbomio, @jiwq, @arunppsg, @ArturNiederfahrenhorst, @kevin85421, @khluu, @JingChen23, @masariello, @angelinalg, @jjyao, @omatthew98, @jonathan-anyscale, @sjoshi6, @gaborgsomogyi, @rynewang, @ratnopamc, @chris-ray-zhang, @ijrsvt, @scottsun94, @raychen911, @franklsf95, @GeneDer, @madhuri-rai07, @scv119, @bveeramani, @anyscalesam, @zen-xu, @npuichigo