Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide more helpful error messages to the GitHub user #146

Closed
LilithHafner opened this issue Dec 3, 2022 · 8 comments · Fixed by #154
Closed

Provide more helpful error messages to the GitHub user #146

LilithHafner opened this issue Dec 3, 2022 · 8 comments · Fixed by #154

Comments

@LilithHafner
Copy link
Contributor

LilithHafner commented Dec 3, 2022

Edit by @maleadt:

Nanosoldier currently replies with a pretty unhelpful message when anything goes wrong: JuliaLang/julia#47788 (comment). This behavior was introduced in #114, because the error logging code had inadvertently logged environmental details, including AWS tokens. That's really bad, so we ripped out the functionality that reports errors back to the user.

Some of that functionality should be brought back, or a safe version of it at least. For example, we should be able to safely report when a failure happened (during parsing of the invocation, during test execution, etc).

@maleadt
Copy link
Member

maleadt commented Dec 3, 2022

There is an error message? It clearly says your job failed 🙂 We intentionally removed additional details, see #114, because otherwise we risk leaking environmental details into the reply comment (as has happened before). Easiest solution is to have an admin take a look.

Of course, I'm inferring that is what you're complaining about. It doesn't hurt to include a word or two what you're actually filing an issue about.

@maleadt maleadt closed this as completed Dec 3, 2022
@LilithHafner
Copy link
Contributor Author

Is there documentation anywhere on what to do when you receive the message "Your job failed."? How can I figure out why the job failed? How can I get it to not fail? It seems to me that having jobs fail without explanation is an issue.

@maleadt
Copy link
Member

maleadt commented Dec 3, 2022

How can I figure out why the job failed?

You can't, that's what I linked to above. If the job fails, an admin should be pinged (so that is missing from the comment here) to investigate. We could probably improve this, but the case is so rare that I don't think it's worth the development effort.

@maleadt
Copy link
Member

maleadt commented Dec 3, 2022

We could probably improve this, but the case is so rare that I don't think it's worth the development effort.

Might be worth keeping the issue open though, in case anybody wants to help out.

@maleadt maleadt reopened this Dec 3, 2022
@maleadt maleadt changed the title runtests job failed without error message Provide more helpful error messages to the GitHub user Dec 3, 2022
@LilithHafner
Copy link
Contributor Author

One workaround for folks without access to the nanosoldier machines is to utilize the CI at BaseBenchmarks.jl which provide nice stack traces (I suspect that the errors here fixed here are at the root of my use case)

@vtjnash
Copy link
Member

vtjnash commented Dec 7, 2022

Fwiw, I think the driver script segfaulted, not just the test failing on nanosoldier. The stacktrace generated on nanosoldier seemed pretty useless (I have posted it elsewhere somewhere)

@LilithHafner
Copy link
Contributor Author

Yes. The bug is an out of bounds array access in an @inbounds. It needs to run with --checkbounds=yes to get a usable stacktrace.

@vtjnash
Copy link
Member

vtjnash commented Dec 7, 2022

This line is throwing errors, which is preventing us from finishing cleanup and uploading the intended logs:

run(sudo(`$cset shield -e -- sudo -n -u $(cfg.user) -- $(shscriptpath)`))

While we should avoid trying to look for the comparison data if that line fails (and abort the run), we also should not throw a a Julia exception there if the script fails since then we also bypass all cleanup.

It shouldn't be possible for any secret data to leak into our log files, since the user task runs as a separate unprivileged user now, so we should try hard to upload them always.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants