Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[Job Debugging] Provide detail information when the job container exits. #2218

Closed
ydye opened this issue Feb 26, 2019 · 3 comments
Closed

[Job Debugging] Provide detail information when the job container exits. #2218

ydye opened this issue Feb 26, 2019 · 3 comments
Assignees
Labels

Comments

@ydye
Copy link
Contributor

ydye commented Feb 26, 2019

What would you like to be added:

Sometimes, users' job may be kill by OpenPAI due to several reasons. It's different from
the users' error. We can class these error as system error. And the container can't be reserved 
if it's failed due to system error. So more detailed log should be provided. So that users could know the reason why their job container exits and can't be reserverd for job debugging. 

An example of system error which will make your job failed:
    - Disk pressure

Why is this needed:

If the job container can't be reserved if it failed due to system error. Users may feel confused. 

Without this feature, how does the current module work

User should investigate the job log.

Components that may involve changes:

TBD
@ydye ydye changed the title [Job Debugging] provide detail information when the job container is exit. [Job Debugging] Provide detail information when the job container exits. Feb 26, 2019
@Binyang2014 Binyang2014 self-assigned this Aug 5, 2019
@Binyang2014
Copy link
Contributor

refer to kubernetes/kubernetes#140

@Binyang2014
Copy link
Contributor

This item will work on K8S version. Several tasks:

  • Map the docker exit info mechanism from dockerScript to kube runtime
  • Reorganize the error spec. Remove yarn error, add new error type for k8s version

@Binyang2014
Copy link
Contributor

Binyang2014 commented Aug 13, 2019

Add failurePattern for runtime:

- errorType: user/system
  patterns:
    pattenExitCode:
      exitCode: 132
      userLog: ""
      runtimeLog: ""
      # more can be added here
    pattenUserLog:
      userLog: "This is an error"
  reason: 'User program terminated by SIGILL'
  solution: 'Please check the log and retry again'
  containerExitCode: 132

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants