Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

storage manager crashed for unknown reason #4081

Closed
2 of 5 tasks
fanyangCS opened this issue Dec 28, 2019 · 3 comments
Closed
2 of 5 tasks

storage manager crashed for unknown reason #4081

fanyangCS opened this issue Dec 28, 2019 · 3 comments

Comments

@fanyangCS
Copy link
Contributor

fanyangCS commented Dec 28, 2019

the storage manager restarts for unknown reason, and the restart didn't succeed due to a failed apt update install. https://github.com/Microsoft/pai/blob/master/src/storage-manager/deploy/scripts/entrypoint.sh#L38

  • we need the watchdog to watch for the storage manager and send alerts when it is unhealthy
  • fix the apt install issue. we should lock the samba version and avoid apt update (include the pkg in the docker image) [Storage-Manager] move apt-get to docker image #4086
  • after the pod restart, the original mount point cannot be used. it seems that the previous resource didn't get released. need a way to fix it.
  • set the pod qos class of the storage manager to high, assign the right amount of resource (memory, CPU, etc.) to it
  • persist the log to host file system [Storage Manager] persist storage logs  #4088
@fanyangCS fanyangCS assigned fanyangCS and Binyang2014 and unassigned fanyangCS Dec 28, 2019
@scarlett2018 scarlett2018 added this to the Pure K8S Beta Release milestone Dec 30, 2019
@scarlett2018 scarlett2018 changed the title storage manager crashed for unknown reason P0.5 - storage manager crashed for unknown reason Dec 30, 2019
@Binyang2014
Copy link
Contributor

Binyang2014 commented Dec 30, 2019

For issue 2: Already fixed in PR #4086
For issue 3: User could try flowing commands:

  1. umount /data and umount /home. If executed successfully, redo the mount command, then nfs will mount into the container.
  2. If encounter the error umount.nfs4: /home: device is busy Please run:
    apt-get install lsof
    lsof | grep /home
    Find the process which occupy this handle, for example bash 1114 root cwd unknown /home/ kill the process. Then run umount /home again
  3. Retry the mount command. Then the nfs will mount into the container successfully.

It doesn't umount the previous mount point, remount nfs will failed, and error message is: Reason given by server: No such file or directory.

If user can not umount the previous mount point. User still can mount the nfs but need to notice the nfs path is changed.

@Binyang2014
Copy link
Contributor

For issue 5: Will persist samba log for future debugging: #4088
After checking the source code, maybe k8s livenessProbe cause the pod restart. Here is a service healthy check every 10 second:

# check nfs
# cannot use service nfs-kernel-server status directly
# because nfsd is running in host
ps -aux | grep -v grep | grep rpc.mountd &> /dev/null
nfsstatus=$?
# check smb
service smbd status $> /dev/null
smbstatus=$?
exit `expr $nfsstatus + $smbstatus`

Maybe samba service down for a while and cause pod restarts. So persist samba log here.

Persist stdout/stderr will not help us debug this problem, since main process will go to infinity sleep after pods starts.

# sleep
echo "sleep infinity ----------"
sleep infinity

@fanyangCS fanyangCS changed the title P0.5 - storage manager crashed for unknown reason storage manager crashed for unknown reason Feb 27, 2020
@scarlett2018
Copy link
Member

  • we need the watchdog to watch for the storage manager and send alerts when it is unhealthy

~(work around available, won't fix now) after the pod restart, the original mount point cannot be used. it seems that the previous resource didn't get released. need a way to fix it. ~
(work around available, won't fix now) set the pod qos class of the storage manager to high, assign the right amount of resource (memory, CPU, etc.) to it

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants