storage manager crashed for unknown reason #4081

fanyangCS · 2019-12-28T09:01:57Z

the storage manager restarts for unknown reason, and the restart didn't succeed due to a failed apt update install. https://github.com/Microsoft/pai/blob/master/src/storage-manager/deploy/scripts/entrypoint.sh#L38

we need the watchdog to watch for the storage manager and send alerts when it is unhealthy
fix the apt install issue. we should lock the samba version and avoid apt update (include the pkg in the docker image) [Storage-Manager] move apt-get to docker image #4086
after the pod restart, the original mount point cannot be used. it seems that the previous resource didn't get released. need a way to fix it.
set the pod qos class of the storage manager to high, assign the right amount of resource (memory, CPU, etc.) to it
persist the log to host file system [Storage Manager] persist storage logs #4088

Binyang2014 · 2019-12-30T11:38:19Z

For issue 2: Already fixed in PR #4086
For issue 3: User could try flowing commands:

umount /data and umount /home. If executed successfully, redo the mount command, then nfs will mount into the container.
If encounter the error umount.nfs4: /home: device is busy Please run:
```
apt-get install lsof
lsof | grep /home
```
Find the process which occupy this handle, for example bash 1114 root cwd unknown /home/ kill the process. Then run umount /home again
Retry the mount command. Then the nfs will mount into the container successfully.

It doesn't umount the previous mount point, remount nfs will failed, and error message is: Reason given by server: No such file or directory.

If user can not umount the previous mount point. User still can mount the nfs but need to notice the nfs path is changed.

Binyang2014 · 2019-12-31T05:00:25Z

For issue 5: Will persist samba log for future debugging: #4088
After checking the source code, maybe k8s livenessProbe cause the pod restart. Here is a service healthy check every 10 second:

pai/src/storage-manager/deploy/scripts/check.sh

Lines 20 to 30 in 1201799

    
           # check nfs 
        
           # cannot use service nfs-kernel-server status directly 
        
           # because nfsd is running in host 
        
           ps -aux | grep -v grep | grep rpc.mountd &> /dev/null 
        
           nfsstatus=$? 
        
           # check smb 
        
           service smbd status $> /dev/null 
        
           smbstatus=$? 
        
           exit `expr $nfsstatus + $smbstatus`

Maybe samba service down for a while and cause pod restarts. So persist samba log here.

Persist stdout/stderr will not help us debug this problem, since main process will go to infinity sleep after pods starts.

pai/src/storage-manager/deploy/scripts/entrypoint.sh

Lines 62 to 64 in 1201799

    
           # sleep 
        
           echo "sleep infinity ----------" 
        
           sleep infinity

scarlett2018 · 2020-05-28T07:24:49Z

we need the watchdog to watch for the storage manager and send alerts when it is unhealthy

~(work around available, won't fix now) after the pod restart, the original mount point cannot be used. it seems that the previous resource didn't get released. need a way to fix it. ~
~~(work around available, won't fix now) set the pod qos class of the storage manager to high, assign the right amount of resource (memory, CPU, etc.) to it~~

fanyangCS assigned fanyangCS and Binyang2014 and unassigned fanyangCS Dec 28, 2019

scarlett2018 mentioned this issue Dec 30, 2019

Pure K8S Beta Release Plan - v0.17 #3872

Closed

54 tasks

scarlett2018 added this to the Pure K8S Beta Release milestone Dec 30, 2019

scarlett2018 added the bug label Dec 30, 2019

scarlett2018 changed the title ~~storage manager crashed for unknown reason~~ P0.5 - storage manager crashed for unknown reason Dec 30, 2019

scarlett2018 mentioned this issue Feb 10, 2020

Feb end game plan #4177

Closed

fanyangCS changed the title ~~P0.5 - storage manager crashed for unknown reason~~ storage manager crashed for unknown reason Feb 27, 2020

scarlett2018 added the pai-dev label Apr 17, 2020

hzy46 mentioned this issue Apr 27, 2020

add release note #4452

Merged

scarlett2018 mentioned this issue May 28, 2020

we need the watchdog to watch for the storage manager and send alerts when it is unhealthy #4577

Closed

1 task

scarlett2018 closed this as completed May 28, 2020

Binyang2014 mentioned this issue Feb 25, 2021

An unexpected power outage caused the openpai system to fail to run tasks. #5316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage manager crashed for unknown reason #4081

storage manager crashed for unknown reason #4081

fanyangCS commented Dec 28, 2019 •

edited

Loading

Binyang2014 commented Dec 30, 2019 •

edited

Loading

Binyang2014 commented Dec 31, 2019

scarlett2018 commented May 28, 2020

storage manager crashed for unknown reason #4081

storage manager crashed for unknown reason #4081

Comments

fanyangCS commented Dec 28, 2019 • edited Loading

Binyang2014 commented Dec 30, 2019 • edited Loading

Binyang2014 commented Dec 31, 2019

scarlett2018 commented May 28, 2020

fanyangCS commented Dec 28, 2019 •

edited

Loading

Binyang2014 commented Dec 30, 2019 •

edited

Loading