Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] - Error Reporting and Normalization #126

Open
shawn-hurley opened this issue Jun 13, 2023 · 11 comments
Open

[RFE] - Error Reporting and Normalization #126

shawn-hurley opened this issue Jun 13, 2023 · 11 comments

Comments

@shawn-hurley
Copy link
Contributor

As a user, understanding what high-level errors mean can help you quickly debug and solve issues. While there will always be a need for deep dive debugging, expecting users to search through hundreds of log lines that are not for their application, I believe, is an unrealistic expectation.

We need a cascading set of error/warning information to display to users to give them a complete picture of the unhappy path scenarios.

Full Stop Errors (sometimes referred to as "fatal")

The most apparent set of errors is when one of the following occurs:

  1. The addon image is unable to be run.
  2. The addon code itself fails if there is any validation of the inputs that could cause this failure.
  3. The code the addon calls fails with validation error (rule invalid to be parsed, providers are not configured correctly or fail to start. Timeouts fall into this category as well.

For this set of errors, we want to consider that:

  1. this means the user will get 0 meaningful data from this run.
  2. That we probably want to have some defined types of errors, like but not matching "ImagePullBackoff" vs. the "ContainerFailed" or the "StartUpProbeFailed" like Kube will get you.
  3. The common "heuristic" for this type of error is that the user gets no data, and most likely, something must be fixed by the user in the input to the analysis run.

Fall Through Errors or "Soft Errors"

There is a second set of Errors that can occur. These errors will mostly happen when running the code the addon will call. The following are examples.

The clear case for this is:

  1. The Addon runs successfully
  2. The code the addon calls runs successfully but has "errors" in the output as well as data that complete
  3. The User gets some set of "good" data, but some rules have errored, and there is only a subset of the conclusions you can make from the analysis run.
  4. The rules could fail for a couple of different reasons, the rule is malformed for the capability, the provider fails for that particular condition or instance, the code location's/custom variables are unfindable, and therefore the message may be malformed.

These particular scenarios should be captured and displayed. They basically say that some information may be useful, but you can not say all the data that is returned is correct.

Warnings

We have the concept of warnings here that we should consider and categorize. I see this happening when something like the following occurs.

  1. The task is taking too long
  2. The addon completes with no issues being reported but has no violations to save.

Another concern in this vein is that it would be nice, to have some way to show the skipped and unmatched rules. This information will be vital for a user with " new-custom-rule-x " not being matched. If the only option is assuming this is the case because it is not in the issues screen, it seems like it would be a less desirable UX/UI. We need some way to note this information.

Skipped rules are filtered out by the user in the input. These should fall into the above category of Unmatched.

Because the searching for violations is not, AFAIK, tied to a specific analysis but rather to the latest run, we need to have a way for users to see why maybe specific violations/rules are disappeared. This could be our most significant source of customer cases.

@jwmatthews
Copy link
Member

Related to #123

@jortel
Copy link
Contributor

jortel commented Jun 15, 2023

For this set of errors, we want to consider that:

  1. this means the user will get 0 meaningful data from this run.

Can you elaborate on this?

@shawn-hurley
Copy link
Contributor Author

@jortel if an analysis run happens, and the thing is fatal, then nothing will be reported back right, and therefore nothing meaningful in terms of analysis was my thought process.

does that help or are we still missing something?

@jortel
Copy link
Contributor

jortel commented Jun 20, 2023

@shawn-hurley There are 3 levels of fatal.

  1. The task pod has fatal error (image pull error etc)
  2. The addon (adapter) has fatal error. (validation, git error etc)
  3. A command: analyzer or analyzer-dep exit != 0.

Though 2 & 3 are kind of the same thing.

All errors/warnings currently are (or can be) reported in Task.Error (with further details in Task.Activity).

Related:

Seems the UI can simply enable an icon and show the Task.Errors when not empty.

(Crude) Example of TaskReport (addon) errors:

Severity Description
Error Rule: Rule-001 has invalid syntax.
Warning Rule: Rule-002 label selector not matched.

(Crude) Example of Task errors:

Severity Description
Error Pod failed: Image quay.io/addon-XXX not found
Severity Description
Error Git: clone failed: fatal: repository 'http://github.com/xxxxx/' not found.

@shawn-hurley
Copy link
Contributor Author

shawn-hurley commented Jun 20, 2023

I agree this makes sense for the fatal errors, and if we can just show that error, along with some text, "Cluster was unable to pull the image " or "git credentials were unauthorized" or "git repository not found" the more we can bucket it and give user exact statements of what wrong the better, especially for the fatal errors.

Starting with the fatal errors, my goals are twofold:

  1. Make it obvious when simple things fail (git creds, repos not found, rules being malformed, etc.) and how to remedy the situation.
  2. Make it possible to debug more into things when hard/complicated things fail. Think of this as the escape hatch.

I would imagine the data we would want is something like this:

  1. The place in the process the error occurred, this would likely help us, if/when we get support cases
  2. The error bucketed in our predetermined bucket. This will allow us to write documentation that a user can easily find to help them on their way if we standardize this correctly. (Think ImagePullBackoff, or NodePressure statuses).
  3. A section for more information says something specific about this particular failure, "git failed authentication" or "git repo was not found at ." For the analyzer, it could be something like "rules are unparsable, here is your error" or "no provider found for cap X."
  4. Severity, probably one of fatal error, warning. It seems like we are starting to use this terminology already. Maybe we just stick with it.
  5. We want "more information, where we can show the full task log". I would imagine this would be useful for those cases we can not bucket. I think taking the task actions (sorry if this is the wrong word but the full accounting of everything that was run).
  6. I would love to ensure that we have some option to run things at a higher verbosity for the entire run; doing this will help on those calls w/ a customer for us or support to help diagnose the issue.

Note: we may only have a few buckets right now, but I think this is ok. With the extra escape hatches, users can still get to the hard debugging answers, while hopefully the easy issues are immediately diagonsable.

Adding lots of folks that also had to do this exercise with MTC, please tell me if I am missing a huge step that y'all had to take, or if this is over kill.

@jortel @jwmatthews @pranavgaikwad @fabianvf @mansam @djzager @jmontleon @JustinXHale

@JustinXHale
Copy link
Member

What currently happens when a user runs into an error?

@shawn-hurley
Copy link
Contributor Author

@JustinXHale I might be the worst person to ask, but my understanding as of today is there is a log of actions that a task took and that if the task fails, there is an error field, IIUC correctly based on the issues above I think it is a single string right now.

@jortel @mansam is this close to correct?

@jortel
Copy link
Contributor

jortel commented Jun 20, 2023

Correct.
An Error reported by the addon can be viewed though the UI kebab Analysis Details.

    status: Failed
    error: fatal: repository '[http://github.com/xxxxx/](https://github.com/xxxxx/)' not found.
    activity:
        - Fetching application.
        - '[CMD] Running: /usr/bin/ssh-agent -a /tmp/agent.1'
        - '[CMD] succeeded.'
        - '[SSH] Agent started.'
        - '[GIT] Cloning: https://github.com/WASdev/sample.daytrader7.git'
        - '[FILE] Created /addon/.gitconfig.'
        - '[CMD] Running: /usr/bin/git clone https://github.com/WASdev/sample.daytrader7.git /addon/source/sample'
        - '[CMD] failed: fatal: repository '[http://github.com/xxxxx/](https://github.com/xxxxx/)' not found..'
    taskid: 6

There is an open issue to propagate k8s errors in
Task.Error

There is another issue aligned to addon-analyzer to propagate soft errors reported (not sure how) by the analyzer as non-fatal severity errors in Task.Error.

@jortel
Copy link
Contributor

jortel commented Jun 20, 2023

Another thing to keep in mind is the quality of error information surfaced to the user depends on the quality of the the information reported by the error origin. For example: for credentials, (in many cases) both Git and Subversion (and SSH) can report almost nothing useful.
Further, they don't report in any structured way. This would require parsing of stderr which is notoriously brittle and non-deterministic. IMHO, it's better to just show the user what is reported by the command. Humans are better at interpreting the text. We should be cautious about doing this. This is why the current approach is to show the user the raw error in context of the Activity log.

@shawn-hurley
Copy link
Contributor Author

shawn-hurley commented Jun 21, 2023

Hmm, looking for a string like "Authentication Failed" when using git doesn't seem overly burdensome or brittle to me. Am I missing something? There are two things that I can think of to be worried about:

  1. False Positive - something else failed, but we determined it was an authentication failed problem.
  2. False Negative- it was an authentication problem, but we were unable to determine it.

for 2, I think it is reasonable to assume this will happen, and that is why we do have the escape hatches.

for 1, I think that we need to be careful here, but if it is a simple check like above, I have a tough time seeing how this could be overly brittle.

What I worry about is that because the user has no direct access to debug things, we control the way that all of these tools are called, that if we don't have actual steps for them to take, it could be very frustrating for users.

I think it is reasonable to not try and boil the ocean, but I do think we can take some steps for common situations to try and make this better for users.

@jortel I agree on a fallback to just show the information to the user FYI. I just think that we can try and provide a better experience for certain cases, especially for things that we are putting together. (setting up git/svn creds that are stored in some other place in the hub) or for the rules that are coming for X place.

@jortel
Copy link
Contributor

jortel commented Jun 22, 2023

So far, I have not read any comments that indicate opposition to the iteration proposed in the linked, hub, ui and addon issues. This would be an low investment evolution of how/what is reported currently and does not seem to paint us into a corner.
Is there any objection?

@shawn-hurley

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants