-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query for Spark application state failed #762
Comments
Thanks for opening this issue. Clearly, we should add protection around dereferencing the response. It looks like this may be necessary (along with some better handling) in a few places. This query is used by the framework to determine if the kernel is still alive - so the 3-4 seconds recurrence makes sense.
How do you do this from the stack trace? Also, I'm not following this portion "because of which the query is failing". Are you saying that It's very disturbing that It looks like the root of the issue may be that there are no current connections in the pool (this is being performed by the yarn-api-client package). In these cases, I wonder if we should track the last known state as a member variable and when exceptions occur retrieving the state (which I assume would be a temporary thing), we return that last-known-state instead. This would allow EG to continue running and not respond to false (or temporary) failures. Because, once we fix the blatant bug, and did nothing about last known state, then the framework will think the kernel is no longer running and try to auto-restart it. Yet, in reality, the kernel is fine and our communication to YARN is what is failing. An auto-restart in this case would exacerbate the situation. Therefore, I think returning the last known state in this particular case (from this particular method) is probably the right thing to do on failure. I don't mind spending time on this, or I'm happy to help if you want to do this. |
Hey Kevin, The Today, I faced the same issue again on one of the EG instances, where it was repeatedly querying for the state of the Spark Application. I was able to hit the same API from the machine where EG was running and from my local repeatedly without any failures, while the EG instance was still querying for the application state and it was failing. So I believe it's not a YARN server side issue because of which the new connection couldn't be established. I started probing if there's some other reason because of which the EG could not create a new HTTP Connection. I have multiple EG instances running on the same machine behind a load balancer, and the other EG instances were running fine. Based on |
OK. Glad to hear It might be interesting to monitor your EG instance file descriptors across all instances and see if they increase at the moment the failure starts occurring in one of the instances. |
I understand this might be difficult, but it might be worth trying this with EG 2.0. In that release, we use yarn-api-client 1.x vs. the 0.3.x version used in EG 1.x. In the updated yarn client, the requests are issued from a session that is associated with the resource manager instance. In the older version, it looks like requests are submitted individually. As a result, you might see more consistent behavior with EG 2.0. That said, the error handling issues still exist in EG 2.x but the underlying yarn client package is significantly different. |
Yes, I agree that we should firm up the error handling. It might stop the resource leak. We have already started migrating our systems to EG 2.0, which should be done by end of next week. So I will monitor the issue and update on this thread, if there's a more consistent behaviour |
Update: Today, I encountered the issue on another EG instance. The number of open file descriptors is 286, which hasn't increased in the past 20 mins. |
@amangarg96 - have you been able to make any headway into the resource issue after deploying the patch I included in this comment? Also, do you see the query_app_state_by_id issues in the logs of the other EG instances? If so, are the number of occurrences anywhere near those in the log of the problematic instance? I suspect that when this occurs, the restarter ioloop (that is created for each kernel and looks for its unexpected death) is essentially getting leaked. I suspect that given we've cleaned up the error handling, you may not encounter this issue (at least as frequently). That's the hope at least. 😄 |
Hey Kevin, I haven't deployed that patch yet, as we are moving towards JEG 2.0.0 and almost done with the deployment. I'll have to do some testing before I deploy the patch, which I will do if the release gets delayed. I'll update you on this thread, if I deploy it on JEG 1.2.0 setup and the issue is fixed/not fixed. Regarding the qeury_app_state_by_id issue - it's not specific to a particular JEG instance. It is observed on many JEG instances and seems to be random (we have 30 JEG instances running, and I have seen it in at least 6-7 different instances). |
Yes
No. I'm suggesting that when there was no error handling, the failure was essentially terminating the ioloop restarter for that kernel and, because of that, perhaps the appropriate measures were not occurring to appropriately cleanup resources and, therefore, a leak would occur. This implies that you'd only see resource issues in instances where the unhandled query_app_state_by_id occurs and I was suggesting that it might be helpful to apply some fancy log analysis (like Of course, if we don't see resource issues after applying the patch, then that would also lend credence to this theory - since the restarter loop is able to "gracefully" stop. |
Hey @amangarg96 - I've been running an experiment since last Wednesday (January 29) in which I've been running with 3 kernels in my (small) YARN/Spark cluster with the 1.x branch. Here's what I've experienced...
I believe this implies that something is causing your environment to lose connection to the YARN RM. Perhaps the cluster admins perform periodic (every 3-4 days) maintenance such that the RM needs to come down? This also implies that handling the exception in a better manner won't have any affect on the general issue, but should make the log files much cleaner - at least while the kernels are running. If you were to try to shutdown the kernel, and EG needed to resort to its fallback of needing to kill the YARN application via the RM API, I suspect you'd get some noise. However, that fallback shouldn't happen by default since we use a message-based shutdown request to the kernel running directly on the worker node (and not via the YARN RM API). Of course, attempting to start a kernel while the RM API is disconnected, won't go very well either. I hope you find this update helpful. |
Hey @kevin-bates, I'm skeptical about the YARN RM's down time being the reason for this because in my setup, we have multiple EGs behind an nginx and all of them are using the same YARN RM. Only some (mostly just 1) of the instances run into the errored state at a given time. Till now, I haven't found a way to reproduce this errored state. |
@amangarg96 - thanks for the update. It's interesting that my 1.x is working differently than yours with respect to steps 4 and 5. At any rate, I get the impression that EG 2.0 might be working a bit better. I'm hoping you're using EG from the tip (master branch) since the released EG 2.0 essentially uses the same yarn-api-client that EG 1.x used. I would recommend upgrading yarn-api-client to 1.x for any comparisons. In the meantime, we should cut a 2.1 release. |
@kevin-bates - Thanks for the callout about yarn-api-client, Kevin. I hadn't noticed that this. |
Source changes are required to move to yarn-api-client >= 1.0. We have no plans to update EG 2.0 with yarn-api-client >= 1.x. Instead, we will be creating EG 2.1 (from master) and EG >= 2.1 will use yarn-api-client >= 1.0. So, the answer to your question is yes. 😄 Sorry for not mentioning this earlier, but we should move forward and if there are potential issues in the yarn area, those should be investigated against yarn-api-client >= 1.0. |
Yes, I agree with moving forward with EG 2.x and investigate (if at all arises) the issue in that. I'll keep updating this thread with the observations. |
EG 2.1 is now available. |
Hey Kevin, Looks like the issue isn't fixed yet. I'm using EG 2.1.0, and ran into this issue (which I used to see in EG 1.2.0 too):
Could there be other parts of the code (besides yarn-api-client), where too many sockets are created? |
Hi @amangarg96 - I guess I'm not surprised since we haven't had anything to really grab onto for this. This is unfortunate. Based on the latest traceback, I'm unfamiliar with where this might be occurring in the jupyter stack since it contains zero references to any jupyter package. Is this scenario the same as previous occurrences - 3-4 days? How active is the system? Previously, the symptom was I'm running out of ideas and don't have the time to startup another 7-10 day test. On the bright side, the dependent PRs for async kernel management have been merged. Once a new version of notebook is available, we plan on cutting an EG release and you should then be able to support simultaneous kernel launches. As for this issue, I think we're going to have just keep plugging away and figure this out. Your patience has been appreciated. Thank you. |
The system has currently lesser number of users and we have an nginx proxy behind the EG instances, which we had scaled up. Yes, the number of file descriptors open was high (78), as compared to other EG instances on the same machine (14-28). My guess is that instead of yarn-api-client, there is some other part of the code which over time is opening a lot of sockets. What is the timeline for EG release with async kernel management? Roughly how much time would it take for the release? |
This is kind of what I'm thinking as well. I just can't tell you where to look. Since many resources are system-wide resources, and it sounds like you have multiple instances, the situation gets a little muddier. Are you running kernels on the same hosts as well?
We will generate a new EG release with async kernel management as soon as the underlying Notebook release is available. I'm hoping we can begin work on the NB release shortly, but wouldn't expect anything for a couple of weeks (sorry). Once that release is available, I would expect an EG release within a day or two. Because we need to build images and test things out, EG releases can be heavy.
|
By same hosts, if you meant the same cluster/node managers - No, it's on a different Hadoop cluster. The Async kernel management (PR - #580) looks really fascinating, can't wait to take it for a spin 😄 |
Hey Kevin, I have found a way to reproduce the issue on my system (and possibly reached close to finding the root cause). The number of file descriptors for the EG process increase, whenever EG is busy handling some other request and a websocket reconnection request is sent to it.
I observed an increase of (n-1)*10 fd (besides the kernel launch fd, which were ~30-40) where n is the number of times 'Reconnect to Kernel' was triggered. These extra fd do not get closed, even after shutting down the kernel. If n=1, no increase in fd was observed. Maybe it's an issue only if there was a pending websocket reconnection request? If the EG was not busy handling a request, there was no increase in the number of fd. For n=3, these are the logs from the Jupyter Notebook server:
The 'Websocket connection has been cancelled' comes only when the previous ws reconnection is still pending and a new ws reconnection request is sent. Finally, when the number of fd reaches the ulimit, the 'Too many files open' error pops up. |
I was asking if you're running kernels on the same hosts as the EG instances, or are the hosts running EG instances dedicated to EG (and other non-kernel apps). You're findings are very encouraging. I have found Lab handles the WebSocket connection differently and see you're using Lab. I tend to run the Notebook front-end. I'm curious if you're in a position to switch to Notebook to see if you see similar growth? The code in the notebook server is in the gateway package. We had made changes in this area some time ago. Hmm, I finding that this message...
implies you're using older code since we changed the message to "Websocket connection has been closed via client disconnect...". This change was made about 9 months ago via this commit. A similar change was made to NB2KG at that time. What version of notebook are you running with? |
EGs have dedicated hosts. Kernels are running remotely wrt to EG host. Yes, your observations are spot on! I was using older JupyterLab (0.33.12) and Jupyter Notebook (5.7.8) EG is 2.1.0. |
Thanks for the update. I'm seeing that the change we need is in master for NB2KG and not 0.7 (bummer). Are you able to build a new NB2KG master and try that out? I can work on getting a 0.7 release built, but it would be good to know if any additional changes might be necessary. Generally speaking, it would be good if you could move to notebook 6.0+ where NB2KG is embedded. However, I understand that's always easier said than done. |
Yes! The commit that you mentioned above is not part of the 0.7.0 release. The nb2kg 0.7.0 added a keepalive ping on websockets, and we had a similar feature of our own which interferes with 0.7.0. That's why my current system is still on nb2kg 0.6.0. The commit which you were referring, what has it changed? Do you think it would solve this issue?. |
I think these changes are probably more related to your comment here:
but I think it's in the right area and makes sense given you're experiencing connection issues this change addresses those by appropriately handling the previous requests. Perhaps @esevan could provide some insights - although I think the comment states things well.
This comment was more about getting you up on the latest than necessarily fixing this issue. I didn't mean to imply otherwise. When I get a chance, I'll try to spend time on your scenario that reproduces the open fd list but, since you know you're reproducing the scenario now, I encourage you to try things out with an updated NB2KG, if that's possible. |
Hi Kevin,
a. When the JEG server is not busy and i click on "Reconnect to Kernel". It is able to capture 2 requests. This is the first one What I understood from this is that it is a very common use case for upgrading an HTTP connection is to use WebSockets. And this is the second one What I understood here is this is connection close request b. But when the JEG server is busy and I click on "Reconnect to Kernel". Goreplay captured only first request. And the second request comes when the JEG finishes his job and becomes available to handle this request. And number of sockets increase after that only. I am not sure how this can help to figure out the issue but this is the whole data i could find. Thanks a lot |
@vikasgarg1996 Thank you for spending the time and detail on this - its greatly appreciated! A couple comments/questions.
Again, thank you for digging into this. Please continue if possible. 😄 |
Hi Kevin,
Thanks |
Thank you for the updates. The 2 minute startup time is quite long and, from all the prior analysis we've performed on that, seems to be something in the YARN interaction layer or YARN itself. I wish I had more things to suggest other than to continue digging. Websocket connection handling (and web layers in general) are things I just don't have very deep experiences with. It would not surprise me though if there was a leak in the framework. I had identified others previously and the framework just doesn't have the "airtime" with the kinds of things that EG is capable of introducing. |
@kevin-bates Observations: Started an EG instance. FD - 96 In my screenshots, I am ignoring the FDs corresponding to .so and .log files, just to identify the FD associated to kernels. While kernel launch: After 1st kernel launch: Out of these 26 FDs, After 2nd kernel launch: Out of these 25 FDs, After 3rd kernel launch: Out of these 25 FDs, This behaviour was consistent when we started from scratch (Start EG instance -> connect client -> launch 3 kernels one by one) To check whether FDs are closed when shutting down a kernel, Two kernels were launched on a client on a fresh EG instance, and the FD count reached 149 (consistent with above observations) On shutdown of 2nd kernel, Orphan FDs:
One thing I observed here was the TCP connection is in CLOSE_WAIT state. After shutting down the client (and both kernels), the FD count is 117 (+19 from start of JEG) |
Another connection leak source - Kernel reconnection (as discussed above) 10 FDs which do not get closed: With 3 kernel reconnect requests, FDs are 20, and with 4 -> 30 These do not come down when the kernel is shutdown, or the notebook client is closed |
Wow - this is excellent information. I think we're going to need to dig deeper into the Jupyter stack (in notebook and jupyter_client) and perhaps the yarn-api-client. I'm not saying that something in the EG repo isn't the thing holding these resources, just that we need to identify the code corresponding to the leaked FDs in order to figure out where its getting held onto. Have you tried performing these same experiments using only Any idea where the 16 STREAM connections/kernel are coming from? Here are some PRs from when I had looked into some leaks previously. I include these to demonstrate the kinds of changes I needed to make... However, I later found that part of 361 had to be reverted, and I'm wondering if this might be the source of at least one of the event poll leaks... I would also note the versions of the underlying packages, primarily notebook and jupyter_client (and tornado perhaps). Ideally, you should be running current versions of those. Thanks for the great work so far! |
@kevin-bates Two more observations -
I will keep posting on this thread as I make some progress. |
All 9 FD leaks on kernel shutdown are coming from control socket. The following FDs are created when a control socket is created. After shutdown, only the last 3 connections are released. FDs after shutdown kernel - python 23588 naresh.sankapelly 15u unix 0xffff9b86c185c000 0t0 1274634682 type=STREAM |
Thank you for all the great work here! Would you mind sharing how you determine this information? How many iterations are performed, etc.? We need to move to the phase of finding these leaks and the more datapoints we can gather the "easier" that will be. I would also suggest we focus on the NB leaks first since a) those are more critical to the community at large, but also, b) they will likely shed some insight into the others. It sounds like the control socket is one clue. |
Interesting and encouraging: jupyter-server/jupyter_server#234, jupyter/notebook#3748 (comment) Also, please provide the output of |
pip freeze output: |
Thanks - I would recommend moving to notebook == 6.0.3 - just so you're current there.
Can you please provide tools/commands used to determine the leaks - this way others can help. |
The issue persists with 6.0.3 as well. I have run EG on a server and connected to it from Pycharm using remote debugging to figure out where the FDs are being incremented and cleaned. This has helped in identifying that 9 FDs of control socket connection are not being released on shutdown. |
Great - thank you! |
Would it be possible for you to move to the current jupyter_client release? The 6.0 release contains the two PRs I reference above 360 and 361 - although, as noted in #762 (comment), a portion of 361 had to be backed out because a notebook test was failing. (If we see some success with jupyter_client 6, it might be worth experimenting adding that one line back in.) |
@nareshsankapelly, @amangarg96 Take this change! jupyter/jupyter_client#548 I'm showing all my open fds returning to the count at kernel startup! I was reproducing the leak with these kinds of results - which very much matches your results...
After applying the fix...
|
@kevin-bates Thanks. I will try these changes. |
@kevin-bates FD leaks are not there after taking those changes. However, I see the following errors in EG logs -
|
The cache_ports stuff is something that was added in jupyter_client relatively recently and should be disabled in EG: #790 Please try to run with EG master if possible. |
@nareshsankapelly @amangarg96 @vikasgarg1996 - please see PR #820. Since the discussions on jupyter/jupyter_client#548 have hit a snag and we've learned that the original code (from a year ago) seems to prevent leaks, I've added a means of enabling the global ZMQ context. I'm hoping to get that back-ported to 1.x as well and I have just realized we should get a 1.2.1 release published asap. cc: @lresende |
We have noted down our observations in this sheet There are still leaks in the first kernel launch (11 sockets) and subsequent kernel launches (2 for every new kernel). These sockets are not closed even after closing the jupyterlab (client) server. |
hmm - can you please produce the output of I don't think we should spend time worrying about the "first kernel" because there's probably some amount of "warmup" required within the various frameworks, however, following that, I would expect the open FDs to remain 0 once all kernels have stopped. Frankly, given the focus on this and the fact that I see instances where there are no increases, I'm fairly confident this is not an EG issue. I'm not saying I won't be tracking changes and ensuring any fixes don't break EG, but I believe the issue to be in the framework itself. If you try the proposed changes from the juptyer_client PR - where the zmq context is not shared AND is destroyed, do you still see leaks there? (I don't see leaks with either a global context or using the local context w/ destroy - per the PR.) |
pip freeze: In the pip freeze shared above (comment), the jupyter_client version should be 5.3.4 and notebook version should be 5.7.8 (same as my pip freeze). We had had some internal (non-functional) changes to these packages, hence created the confusion in the version numbers. |
Thanks @amangarg96 - As this is the older version of things, I probably won't spend much time looking at this, but will revisit the open fds once I have a chance using the ER release candidate. If I find the newer versions still show no leaking, then I'll build an environment using your versions and try to reproduce your issue. I assume your latest observations are with the enablement of |
Yes, the latest observations are using the above PR. Meanwhile, we'll try to repeat the above experiments with the more recent versions of nb2kg, jupyterlab, notebooks and jupyter_client. @kevin-bates - Could you share the pip freeze of the environment with which there were no leaks, for reference? |
Here's my current env. I haven't revisited the open fd leak issue in a few days but I think the versions are relatively the same:
|
I've backed out #820 since this has been addressed more appropriately in jupyter_client 6.1.5. As a result, I'm going to close this issue since it seems to have evolved across different scenarios and I believe the various issues have been resolved. If opinions differ, please open specific issues, siting the appropriate comments from this issue and we'll strive to resolve those. |
Description
The EG instance becomes unresponsive, returning HTTP 500: Internal Server Error to the client. In the EG logs, the EG is repeatedly querying for the state of the Spark Application (Every 3-4 second), which is failing.
I have not found a way to reproduce the issue. Restarting the EG instance works, but this is a frequently occurring issue (Once every 3-4 days).
The YARN API it's using to query the state of the application (http://:/ws/v1/cluster/apps/application_1573339086213_974645/state), is working and returning the state of the spark application.
Going through the stack trace, I observed that
resource_manager
object of YarnClusterProcessProxy becomesNone
, because of which the query is failing.Screenshots / Logs
Stack trace of EG:
Environment
The text was updated successfully, but these errors were encountered: