Release v0.10.1

New Features

Admin can configure MaxCapacity through REST API for a given Virtual Cluster so that the virtual cluster can use iddle resources as bonus. #2147
Support Azure RDMA. #2091; how-to doc
New Disk Cleaner for abnormal disk usage: The disk cleaner will check disk usage every 60 second(configurable), and if the disk usage is above 94%(configurable), it will kill the container that uses largest disk space using specific signal(10), the container will exit with code 1, and the related job will fail. Admin/User can track the reason in job logs. #2119
Web portal: add "My jobs" filter button. #2111
"Submit Simple Job" web portal plugin. #2131 Document

Improvements

Service

Hadoop: Improved log readability by disable a not in use HDFS shortcircuit setting. #2027
Extended the job log retention time from 7 days to 30 days. Enabled the log retain time as configurable settings for Admin. #2034
Optimized the RM and Yarn's default configurations for PAI to reduce the resource usage by AM. #2072
Pylon: WebHDFS library compatibility. #2134
Extend the NM expiry time from 15 mins to 60 mins to provide a better tolerable experience for NM downtime. #2142
Alart Manager: Make it more clear in service not up. #2105
Web Portal: Allow jsonc in job submission. #2084

Deployment

Only restart docker deamon, if the configuration is updated. #2138

Documentation

Update document about docker data root's configuration. #2052
Improved how-to-setup-dev-box.md with more details. #2087
Improved hdfs_service.md with more details. #2096

Examples

Add an exmaple of horovod with rdma & intel mpi. #2112

Others

Build: Add error message when image build failed. #2133

Bug Fixes

Issue #2099 is fixed by
- Launcher: Revise the definition of Framework running state. #2135
- REST server: Classify two states to WAITING. #2154
Kubernetes: Disable kubernetes's pod eviction. #2124
Grafana: Use yarn's metrics in cluster view. #2148
Add /usr/local/cuda/extras/CUPTI/lib64 to LD_LIBRARY_PATH. #2043

Upgrading from Earlier Release

Known Issue

Issue: There is a known issue #2433 in v0.10.1 upgrade, some users might hit this issue. When hitting the issue, deploy kubernetes cluster with OpenPAI will hang.
Resolution: We had provided an hotfix #2441 for it. But if your organization does not have any urgency to upgrade to v0.10.1 by end of March 2019, you can postpone the upgrade plan for a week, by when we will release v0.11.0 #2307 in which the known issue has been officially fixed.

Please follow the Upgrading to v0.10 for detailed instructions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.10.1: Mar. 2019 Release