This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
v0.10.1: Mar. 2019 Release
Release v0.10.1
New Features
- Admin can configure MaxCapacity through REST API for a given Virtual Cluster so that the virtual cluster can use iddle resources as bonus. #2147
- Support Azure RDMA. #2091; how-to doc
- New Disk Cleaner for abnormal disk usage: The disk cleaner will check disk usage every 60 second(configurable), and if the disk usage is above 94%(configurable), it will kill the container that uses largest disk space using specific signal(10), the container will exit with code 1, and the related job will fail. Admin/User can track the reason in job logs. #2119
- Web portal: add "My jobs" filter button. #2111
- "Submit Simple Job" web portal plugin. #2131 Document
Improvements
Service
- Hadoop: Improved log readability by disable a not in use HDFS shortcircuit setting. #2027
- Extended the job log retention time from 7 days to 30 days. Enabled the log retain time as configurable settings for Admin. #2034
- Optimized the RM and Yarn's default configurations for PAI to reduce the resource usage by AM. #2072
- Pylon: WebHDFS library compatibility. #2134
- Extend the NM expiry time from 15 mins to 60 mins to provide a better tolerable experience for NM downtime. #2142
- Alart Manager: Make it more clear in service not up. #2105
- Web Portal: Allow jsonc in job submission. #2084
Deployment
- Only restart docker deamon, if the configuration is updated. #2138
Documentation
- Update document about docker data root's configuration. #2052
- Improved how-to-setup-dev-box.md with more details. #2087
- Improved hdfs_service.md with more details. #2096
Examples
- Add an exmaple of horovod with rdma & intel mpi. #2112
Others
- Build: Add error message when image build failed. #2133
Bug Fixes
- Issue #2099 is fixed by
- Kubernetes: Disable kubernetes's pod eviction. #2124
- Grafana: Use yarn's metrics in cluster view. #2148
- Add /usr/local/cuda/extras/CUPTI/lib64 to LD_LIBRARY_PATH. #2043
Upgrading from Earlier Release
Known Issue
Issue: There is a known issue #2433 in v0.10.1 upgrade, some users might hit this issue. When hitting the issue, deploy kubernetes cluster with OpenPAI will hang.
Resolution: We had provided an hotfix #2441 for it. But if your organization does not have any urgency to upgrade to v0.10.1 by end of March 2019, you can postpone the upgrade plan for a week, by when we will release v0.11.0 #2307 in which the known issue has been officially fixed.
Please follow the Upgrading to v0.10 for detailed instructions.