Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

v0.10.1: Mar. 2019 Release

Compare
Choose a tag to compare
@hao1939 hao1939 released this 20 Mar 05:05
· 1 commit to pai-0.10.y since this release

Release v0.10.1

New Features

  • Admin can configure MaxCapacity through REST API for a given Virtual Cluster so that the virtual cluster can use iddle resources as bonus. #2147
  • Support Azure RDMA. #2091; how-to doc
  • New Disk Cleaner for abnormal disk usage: The disk cleaner will check disk usage every 60 second(configurable), and if the disk usage is above 94%(configurable), it will kill the container that uses largest disk space using specific signal(10), the container will exit with code 1, and the related job will fail. Admin/User can track the reason in job logs. #2119
  • Web portal: add "My jobs" filter button. #2111
  • "Submit Simple Job" web portal plugin. #2131 Document

Improvements

Service

  • Hadoop: Improved log readability by disable a not in use HDFS shortcircuit setting. #2027
  • Extended the job log retention time from 7 days to 30 days. Enabled the log retain time as configurable settings for Admin. #2034
  • Optimized the RM and Yarn's default configurations for PAI to reduce the resource usage by AM. #2072
  • Pylon: WebHDFS library compatibility. #2134
  • Extend the NM expiry time from 15 mins to 60 mins to provide a better tolerable experience for NM downtime. #2142
  • Alart Manager: Make it more clear in service not up. #2105
  • Web Portal: Allow jsonc in job submission. #2084

Deployment

  • Only restart docker deamon, if the configuration is updated. #2138

Documentation

  • Update document about docker data root's configuration. #2052
  • Improved how-to-setup-dev-box.md with more details. #2087
  • Improved hdfs_service.md with more details. #2096

Examples

  • Add an exmaple of horovod with rdma & intel mpi. #2112

Others

  • Build: Add error message when image build failed. #2133

Bug Fixes

  • Issue #2099 is fixed by
    • Launcher: Revise the definition of Framework running state. #2135
    • REST server: Classify two states to WAITING. #2154
  • Kubernetes: Disable kubernetes's pod eviction. #2124
  • Grafana: Use yarn's metrics in cluster view. #2148
  • Add /usr/local/cuda/extras/CUPTI/lib64 to LD_LIBRARY_PATH. #2043

Upgrading from Earlier Release

Known Issue

Issue: There is a known issue #2433 in v0.10.1 upgrade, some users might hit this issue. When hitting the issue, deploy kubernetes cluster with OpenPAI will hang.
Resolution: We had provided an hotfix #2441 for it. But if your organization does not have any urgency to upgrade to v0.10.1 by end of March 2019, you can postpone the upgrade plan for a week, by when we will release v0.11.0 #2307 in which the known issue has been officially fixed.

Please follow the Upgrading to v0.10 for detailed instructions.