-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Console doesn't connect when server is busy #3041
Comments
Log:
|
Hmm. Interesting problem. The server has a multithreaded executor pool that spawns four executors. Each executor here is stuck doing one of the sorts leaving the pool drained of resources. The pool is limited so that it is not easy for users to unintentionally steal all of the CPU from update graph processing. Perhaps we should add a high-priority single-threaded executor for export work that is guaranteed to have a short runtime (possibly focused on UI interactions like auto-complete, config service lookups, and creating new sessions). @niloc132 opinions? |
Nate and I discussed this briefly the other day, and the problem is defining "guaranteed to have a short runtime" - for example, the python autocomplete operation requires calling into python, which means taking the GIL - so that is non-interruptably blocked while a long script is running (and as such would block this high priority single thread). We would have to find a way to define these operations in a way that they cannot block for any reason. Start console, config service calls are probably safe, making a new auth session could be unsafe depending on how the authentication provider is implemented. And since an auth session can be created for any http call, technically any call might be unsafe... That isn't all that likely though (metrics/timing logging could help here for warnings/diagnosis). Looking specifically at the thread dump... This sort is blocked on an earlier sort, on the same table, with the same sort params. This is a "good thing" so that we don't cancel the existing work and restart it when the second call comes in, but is pretty far from ideal that we have two concurrent threads blocked waiting for one unit of work to complete.
The actual sort work is busy here, either attempting to take the GIL, or actually took it and momentarily busy running interpreted Python code.
Next thread (out of 4 total by default) is also working on sorting a table, but either not the same originating table, or not the same sort (perhaps the user clicked sort twice, and so both ascending and descending are running at the same time?). This code is also apparently in python, either has the GIL (and is blocking the other thread), or is waiting for the GIL (so cannot proceed and free up this thread until it has has the chance to take it).
The last thread is blocked on the previous one, waiting for it to finish its work applying the same sort to the same table.
The serial thread is doing nothing, and Perhaps this implies existence of some kind of "assertNoTableOperations"/"assertNoGil", and if you want to use the "fast lane" thread, you opt out of those... and possibly have a time budget, can't join/wait on other threads? Alternatively, just add more threads, and try to mitigate one and two? Or try permit a growing threadpool to handle "everything is blocked on something else"... and correctly implement canceling work? |
It probably doesn't matter too much what those threads are doing. Obviously, it's a shame if they are blocking each other's access to the GIL. I was able to reproduce this using a groovy REPL and forcing the UI to sort four distinct, large tables. I don't like the answer of adding more threads. Four parallel sorts is probably pretty heavy usage; remember we don't really want the client to so easily bring a server to its knees. Should we identify the set of calls that the UI needs to start up a new connection? Alternatively, we could isolate table instantiation (maybe also python auto-complete?), to a separate thread pool. I feel like a rough division is helpful for the typical user experience and that there is no need to plan for the odd scenario like auth handler blocking forever. |
@nbauernfeind and I discussed this today briefly, and are planning to add a method to |
I had an intentionally terrible sort that I hammered a few times and consumed all the threads the docker container had available. Then I refreshed the page. Unexpectedly the page never loads as it's waiting on the startConsole promise to return, and it never does (as the sorts are still consuming all the threads). I don't know what we can do to make this better, but at least better messaging would be nice.
Eventually if I wait for the 5 minutes for those sorts to return, the page does eventually load.
Steps to reproduce
Page is stuck waiting for console to connect on refresh:
The text was updated successfully, but these errors were encountered: