Query for Spark application state failed #762

amangarg96 · 2019-12-19T10:18:36Z

Description

The EG instance becomes unresponsive, returning HTTP 500: Internal Server Error to the client. In the EG logs, the EG is repeatedly querying for the state of the Spark Application (Every 3-4 second), which is failing.

I have not found a way to reproduce the issue. Restarting the EG instance works, but this is a frequently occurring issue (Once every 3-4 days).

The YARN API it's using to query the state of the application (http://:/ws/v1/cluster/apps/application_1573339086213_974645/state), is working and returning the state of the spark application.

Going through the stack trace, I observed that resource_manager object of YarnClusterProcessProxy becomes None, because of which the query is failing.

Screenshots / Logs

Stack trace of EG:

[W 2019-12-18 12:33:27.020 EnterpriseGatewayApp] Query for application 'application_1573339086213_974645' state failed with exception: 'HTTPConnectionPool(host=<rm-url>, port=<rm-port>): Max retries exceeded with url: /ws/v1/cluster/apps/application_1573339086213_974645/state (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f77d0408cf8>: Failed to establish a new connection: [Errno 11] Resource temporarily unavailable'))'.  Continuing...
[E 191218 12:33:27 ioloop:909] Exception in callback <bound method KernelRestarter.poll of <jupyter_client.ioloop.restarter.IOLoopKernelRestarter object at 0x7f782b480550>>
    Traceback (most recent call last):
      File "/usr/share/mlp-jeg/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
        return self.callback()
      File "/usr/share/mlp-jeg/lib/python3.7/site-packages/jupyter_client/restarter.py", line 93, in poll
        if not self.kernel_manager.is_alive():
      File "/usr/share/mlp-jeg/lib/python3.7/site-packages/jupyter_client/manager.py", line 453, in is_alive
        if self.kernel.poll() is None:
      File "/usr/share/mlp-jeg/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 152, in poll
        state = self.query_app_state_by_id(self.application_id)
      File "/usr/share/mlp-jeg/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 378, in query_app_state_by_id
        return response.data['state']
    AttributeError: 'NoneType' object has no attribute 'data'

Environment

Enterprise Gateway Version - 1.2.0
Platform: Spark YARN Cluster Mode
OS: Debian 8

The text was updated successfully, but these errors were encountered:

kevin-bates · 2019-12-19T17:15:11Z

Thanks for opening this issue. Clearly, we should add protection around dereferencing the response. It looks like this may be necessary (along with some better handling) in a few places.

This query is used by the framework to determine if the kernel is still alive - so the 3-4 seconds recurrence makes sense.

Going through the stack trace, I observed that resource_manager object of YarnClusterProcessProxy becomes None, because of which the query is failing.

How do you do this from the stack trace?

Also, I'm not following this portion "because of which the query is failing". Are you saying that self.resource_mgr moves to None when the query fails?

It's very disturbing that self.resource_mgr becomes None at all since that never explicitly happens, so I'm a little confused about that.

It looks like the root of the issue may be that there are no current connections in the pool (this is being performed by the yarn-api-client package). In these cases, I wonder if we should track the last known state as a member variable and when exceptions occur retrieving the state (which I assume would be a temporary thing), we return that last-known-state instead. This would allow EG to continue running and not respond to false (or temporary) failures. Because, once we fix the blatant bug, and did nothing about last known state, then the framework will think the kernel is no longer running and try to auto-restart it. Yet, in reality, the kernel is fine and our communication to YARN is what is failing. An auto-restart in this case would exacerbate the situation. Therefore, I think returning the last known state in this particular case (from this particular method) is probably the right thing to do on failure.

I don't mind spending time on this, or I'm happy to help if you want to do this.

amangarg96 · 2019-12-20T17:52:18Z

Hey Kevin,

The self.resource_mgr becoming None is something I blatantly assumed looking at the query_app_state_by_id method of YarnClusterProcessProxy. Please ignore the last statement of the description.

Today, I faced the same issue again on one of the EG instances, where it was repeatedly querying for the state of the Spark Application. I was able to hit the same API from the machine where EG was running and from my local repeatedly without any failures, while the EG instance was still querying for the application state and it was failing. So I believe it's not a YARN server side issue because of which the new connection couldn't be established.

I started probing if there's some other reason because of which the EG could not create a new HTTP Connection. I have multiple EG instances running on the same machine behind a load balancer, and the other EG instances were running fine. Based on [Errno 11] Resource temporarily unavailable, I found one stack overflow thread which talked about reaching the limit of number of file descriptors open for a given process (Found using ls -l /proc/<PID>/fd | wc -l). The limit for a process was 1024, and that particular process had 945 file descriptors. All the other processes had below 200. So this was one correlation I found.
I have restarted the EG instance, and increased the file descriptor limit for a process on the machine to 4096. Let's see if we can see the correlation again when I encounter this issue next time.

kevin-bates · 2019-12-20T19:53:37Z

OK. Glad to hear self.resource_mgr isn't going to None. We should also firm up the error handling in these instances. As I alluded to previously, I think that it's probably okay to return the previous state value from query_app_state_by_id - but only if there had been a state obtained from a previous call. i have some suspicion that perhaps the faulty error handling could be leading to perhaps a resource leak that triggers the errno 11 issue.

It might be interesting to monitor your EG instance file descriptors across all instances and see if they increase at the moment the failure starts occurring in one of the instances.

kevin-bates · 2019-12-20T20:04:52Z

I understand this might be difficult, but it might be worth trying this with EG 2.0. In that release, we use yarn-api-client 1.x vs. the 0.3.x version used in EG 1.x. In the updated yarn client, the requests are issued from a session that is associated with the resource manager instance. In the older version, it looks like requests are submitted individually. As a result, you might see more consistent behavior with EG 2.0.

That said, the error handling issues still exist in EG 2.x but the underlying yarn client package is significantly different.

amangarg96 · 2019-12-21T04:54:09Z

Yes, I agree that we should firm up the error handling. It might stop the resource leak.
I wonder how should we handle the error, if there's no previous state for query_app_state_by_id?

We have already started migrating our systems to EG 2.0, which should be done by end of next week. So I will monitor the issue and update on this thread, if there's a more consistent behaviour

amangarg96 · 2019-12-23T09:38:27Z

Update: Today, I encountered the issue on another EG instance. The number of open file descriptors is 286, which hasn't increased in the past 20 mins.

kevin-bates · 2020-01-15T00:05:32Z

@amangarg96 - have you been able to make any headway into the resource issue after deploying the patch I included in this comment?

Also, do you see the query_app_state_by_id issues in the logs of the other EG instances? If so, are the number of occurrences anywhere near those in the log of the problematic instance? I suspect that when this occurs, the restarter ioloop (that is created for each kernel and looks for its unexpected death) is essentially getting leaked. I suspect that given we've cleaned up the error handling, you may not encounter this issue (at least as frequently). That's the hope at least. 😄

amangarg96 · 2020-01-16T08:39:51Z

Hey Kevin,

I haven't deployed that patch yet, as we are moving towards JEG 2.0.0 and almost done with the deployment. I'll have to do some testing before I deploy the patch, which I will do if the release gets delayed. I'll update you on this thread, if I deploy it on JEG 1.2.0 setup and the issue is fixed/not fixed.

Regarding the qeury_app_state_by_id issue - it's not specific to a particular JEG instance. It is observed on many JEG instances and seems to be random (we have 30 JEG instances running, and I have seen it in at least 6-7 different instances).
Are you referring to the IOLoopKernelRestarter of jupyter_client? Are you suspecting the polling function has the leakage? If that is the case, then the error handling would only avoid the frequent polling, with the resource leakage in it still present?

kevin-bates · 2020-01-16T17:07:49Z

Are you referring to the IOLoopKernelRestarter of jupyter_client?

Yes

Are you suspecting the polling function has the leakage?

No. I'm suggesting that when there was no error handling, the failure was essentially terminating the ioloop restarter for that kernel and, because of that, perhaps the appropriate measures were not occurring to appropriately cleanup resources and, therefore, a leak would occur.

This implies that you'd only see resource issues in instances where the unhandled query_app_state_by_id occurs and I was suggesting that it might be helpful to apply some fancy log analysis (like grep something-unique-to-the-unhandled-exception-in-query-app-state-by-id | wc -l 😄) to see if there is any correlation to the number of "hits" and the number of open fds, etc.

Of course, if we don't see resource issues after applying the patch, then that would also lend credence to this theory - since the restarter loop is able to "gracefully" stop.

kevin-bates · 2020-02-06T21:15:20Z

Hey @amangarg96 - I've been running an experiment since last Wednesday (January 29) in which I've been running with 3 kernels in my (small) YARN/Spark cluster with the 1.x branch. Here's what I've experienced...

I've been monitoring the file descriptors relative to EG and they've remained constant.
After letting the kernels sit idle, I've found I needed to perform a 'reconnect' in order to reestablish the websocket. (Which I think is as-designed behavior).
Nothing out of the ordinary has been logged by EG and, once reconnected, the kernels properly service cell execution, etc.
Today I decided to stop my YARN Resource Manager and found I got the exact same error message as you show above.
I then started the YARN Resource Manager and the EG log messages subsided and returned to their previous (silent) state.

I believe this implies that something is causing your environment to lose connection to the YARN RM. Perhaps the cluster admins perform periodic (every 3-4 days) maintenance such that the RM needs to come down?

This also implies that handling the exception in a better manner won't have any affect on the general issue, but should make the log files much cleaner - at least while the kernels are running. If you were to try to shutdown the kernel, and EG needed to resort to its fallback of needing to kill the YARN application via the RM API, I suspect you'd get some noise. However, that fallback shouldn't happen by default since we use a message-based shutdown request to the kernel running directly on the worker node (and not via the YARN RM API).

Of course, attempting to start a kernel while the RM API is disconnected, won't go very well either.

I hope you find this update helpful.

amangarg96 · 2020-02-12T15:10:56Z

Hey @kevin-bates,
Thanks for this reply!

I'm skeptical about the YARN RM's down time being the reason for this because in my setup, we have multiple EGs behind an nginx and all of them are using the same YARN RM. Only some (mostly just 1) of the instances run into the errored state at a given time.
Also, we are currently testing EG 2.0.0 (which is pointing to a different RM and compute cluster) and I was able to reproduce your observations in point 4 and 5. When the YARN RM was brought up again, the EG was responsive again and logs had subsided. But in the 1.x case, the EG instance which went into the errored state does not recover back (Even though the API it's using to query the state of the Spark application is responsive).

Till now, I haven't found a way to reproduce this errored state.

kevin-bates · 2020-02-12T15:35:48Z

@amangarg96 - thanks for the update. It's interesting that my 1.x is working differently than yours with respect to steps 4 and 5. At any rate, I get the impression that EG 2.0 might be working a bit better. I'm hoping you're using EG from the tip (master branch) since the released EG 2.0 essentially uses the same yarn-api-client that EG 1.x used. I would recommend upgrading yarn-api-client to 1.x for any comparisons. In the meantime, we should cut a 2.1 release.

amangarg96 · 2020-02-12T18:56:22Z

@kevin-bates - Thanks for the callout about yarn-api-client, Kevin. I hadn't noticed that this.
The v2.0.0 has restricted yarn-api-client's version to be <0.4.0(https://github.com/jupyter/enterprise_gateway/blob/v2.0.0/setup.py#L61). Would you recommend building EG from the master branch over upgrading yarn-api-client of EG 2.0.0 to 1.x?

kevin-bates · 2020-02-12T19:10:48Z

Source changes are required to move to yarn-api-client >= 1.0. We have no plans to update EG 2.0 with yarn-api-client >= 1.x. Instead, we will be creating EG 2.1 (from master) and EG >= 2.1 will use yarn-api-client >= 1.0.

So, the answer to your question is yes. 😄

Sorry for not mentioning this earlier, but we should move forward and if there are potential issues in the yarn area, those should be investigated against yarn-api-client >= 1.0.

amangarg96 · 2020-02-13T08:56:44Z

Yes, I agree with moving forward with EG 2.x and investigate (if at all arises) the issue in that. I'll keep updating this thread with the observations.

kevin-bates · 2020-02-14T01:29:41Z

EG 2.1 is now available.

amangarg96 · 2020-04-10T11:30:51Z

Hey Kevin,

Looks like the issue isn't fixed yet.

I'm using EG 2.1.0, and ran into this issue (which I used to see in EG 1.2.0 too):

[E 200410 13:50:17 base_events:1285] Exception in callback BaseAsyncIOLoop._handle_events(6, 1)
   handle: <Handle BaseAsyncIOLoop._handle_events(6, 1)>
   Traceback (most recent call last):
     File "/usr/share/mlp-jeg/lib/python3.6/asyncio/events.py", line 145, in _run
       self._callback(*self._args)
     File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 138, in _handle_events
       handler_func(fileobj, events)
     File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/netutil.py", line 260, in accept_handler
       connection, address = sock.accept()
     File "/usr/share/mlp-jeg/lib/python3.6/socket.py", line 205, in accept
       fd, addr = self._accept()
   OSError: [Errno 24] Too many open files

Could there be other parts of the code (besides yarn-api-client), where too many sockets are created?

kevin-bates · 2020-04-10T16:48:29Z

Hi @amangarg96 - I guess I'm not surprised since we haven't had anything to really grab onto for this. This is unfortunate.

Based on the latest traceback, I'm unfamiliar with where this might be occurring in the jupyter stack since it contains zero references to any jupyter package.

Is this scenario the same as previous occurrences - 3-4 days? How active is the system?

Previously, the symptom was [Errno 11] Resource temporarily unavailable (although I don't see where that may have appeared in logs) and now it's [Errno 24] Too many open files. Have you performed the open files check?

I'm running out of ideas and don't have the time to startup another 7-10 day test.

On the bright side, the dependent PRs for async kernel management have been merged. Once a new version of notebook is available, we plan on cutting an EG release and you should then be able to support simultaneous kernel launches.

As for this issue, I think we're going to have just keep plugging away and figure this out. Your patience has been appreciated. Thank you.

amangarg96 · 2020-04-14T09:34:36Z

The system has currently lesser number of users and we have an nginx proxy behind the EG instances, which we had scaled up.
The day I posted the issue was the first time we faced it on one instance, and faced it on the same instance the following day. Haven't faced it again since then.

Yes, the number of file descriptors open was high (78), as compared to other EG instances on the same machine (14-28).

My guess is that instead of yarn-api-client, there is some other part of the code which over time is opening a lot of sockets.

What is the timeline for EG release with async kernel management? Roughly how much time would it take for the release?
Would EG get rid of the Jupyter stack, in this release?

kevin-bates · 2020-04-14T15:43:17Z

My guess is that instead of yarn-api-client, there is some other part of the code which over time is opening a lot of sockets.

This is kind of what I'm thinking as well. I just can't tell you where to look. Since many resources are system-wide resources, and it sounds like you have multiple instances, the situation gets a little muddier. Are you running kernels on the same hosts as well?

What is the timeline for EG release with async kernel management? Roughly how much time would it take for the release?

We will generate a new EG release with async kernel management as soon as the underlying Notebook release is available. I'm hoping we can begin work on the NB release shortly, but wouldn't expect anything for a couple of weeks (sorry). Once that release is available, I would expect an EG release within a day or two. Because we need to build images and test things out, EG releases can be heavy.

Would EG get rid of the Jupyter stack, in this release?
No. EG is a Jupyter application and will always be at the mercy of the Jupyter stack.

amangarg96 · 2020-04-17T09:40:46Z

By same hosts, if you meant the same cluster/node managers - No, it's on a different Hadoop cluster.

The Async kernel management (PR - #580) looks really fascinating, can't wait to take it for a spin 😄

amangarg96 · 2020-04-17T10:01:14Z

Hey Kevin,

I have found a way to reproduce the issue on my system (and possibly reached close to finding the root cause).

The number of file descriptors for the EG process increase, whenever EG is busy handling some other request and a websocket reconnection request is sent to it.
For this, I perform the following experiment:

Launch a kernel and let it come to idle state
Observe the number of file descriptors (fd) for the EG instance (ls /proc//fd | wc -l)
Launch another kernel, so that the EG is busy handling the kernel launch request
Trigger the 'Reconnect to kernel' multiple times (say 20-30 times) in JupyterLab from the running kernel, while the EG is still busy handling the kernel launch request.
Let the launched kernel come to idle state.
Observe the fd count again.

I observed an increase of (n-1)*10 fd (besides the kernel launch fd, which were ~30-40) where n is the number of times 'Reconnect to Kernel' was triggered. These extra fd do not get closed, even after shutting down the kernel.

If n=1, no increase in fd was observed. Maybe it's an issue only if there was a pending websocket reconnection request?

If the EG was not busy handling a request, there was no increase in the number of fd.

For n=3, these are the logs from the Jupyter Notebook server:

[I 13:35:10.185 LabApp] Connecting to ws://10.34.17.76:8897/api/kernels/87762252-5768-4fcb-bb8a-6e97c7613367/channels
[W 13:35:10.804 LabApp] Websocket connection has been cancelled via client disconnect before its establishment.  Kernel with ID '87762252-5768-4fcb-bb8a-6e97c7613367' may not be terminated on Gateway: http://10.34.17.76:8897
[I 13:35:10.810 LabApp] Connecting to ws://10.34.17.76:8897/api/kernels/87762252-5768-4fcb-bb8a-6e97c7613367/channels
[W 13:35:16.754 LabApp] Websocket connection has been cancelled via client disconnect before its establishment.  Kernel with ID '87762252-5768-4fcb-bb8a-6e97c7613367' may not be terminated on Gateway: http://10.34.17.76:8897
[I 13:35:16.760 LabApp] Connecting to ws://10.34.17.76:8897/api/kernels/87762252-5768-4fcb-bb8a-6e97c7613367/channels

The 'Websocket connection has been cancelled' comes only when the previous ws reconnection is still pending and a new ws reconnection request is sent.

Finally, when the number of fd reaches the ulimit, the 'Too many files open' error pops up.

kevin-bates · 2020-04-17T15:34:15Z

By same hosts, if you meant the same cluster/node managers - No, it's on a different Hadoop cluster.

I was asking if you're running kernels on the same hosts as the EG instances, or are the hosts running EG instances dedicated to EG (and other non-kernel apps).

You're findings are very encouraging. I have found Lab handles the WebSocket connection differently and see you're using Lab. I tend to run the Notebook front-end. I'm curious if you're in a position to switch to Notebook to see if you see similar growth?

The code in the notebook server is in the gateway package. We had made changes in this area some time ago. Hmm, I finding that this message...

Websocket connection has been cancelled via client disconnect before its establishment. Kernel with ID '87762252-5768-4fcb-bb8a-6e97c7613367' may not be terminated on Gateway:

implies you're using older code since we changed the message to "Websocket connection has been closed via client disconnect...". This change was made about 9 months ago via this commit.

A similar change was made to NB2KG at that time.

What version of notebook are you running with?
Are you using the embedded gateway client (via --gateway-url) or the NB2KG server extension? If the latter, what version of NB2KG are you using?

amangarg96 · 2020-04-17T17:41:42Z

EGs have dedicated hosts. Kernels are running remotely wrt to EG host.

Yes, your observations are spot on! I was using older JupyterLab (0.33.12) and Jupyter Notebook (5.7.8)
Had restricted these versions because some Jupyter LabExtensions were built on top of this, which would have needed migration effort.
We are still using NB2KG 0.6.0.

EG is 2.1.0.

kevin-bates · 2020-04-17T17:57:44Z

Thanks for the update.

I'm seeing that the change we need is in master for NB2KG and not 0.7 (bummer). Are you able to build a new NB2KG master and try that out? I can work on getting a 0.7 release built, but it would be good to know if any additional changes might be necessary.

Generally speaking, it would be good if you could move to notebook 6.0+ where NB2KG is embedded. However, I understand that's always easier said than done.

amangarg96 · 2020-04-18T08:17:41Z

Yes! The commit that you mentioned above is not part of the 0.7.0 release.

The nb2kg 0.7.0 added a keepalive ping on websockets, and we had a similar feature of our own which interferes with 0.7.0. That's why my current system is still on nb2kg 0.6.0.

The commit which you were referring, what has it changed? Do you think it would solve this issue?.
I think I can work towards upgrading nb2kg (and possibly notebooks, if that's the only way left) if it resolves this issue for me.

kevin-bates · 2020-04-18T14:31:40Z

The commit which you were referring, what has it changed? Do you think it would solve this issue?

I think these changes are probably more related to your comment here:

The 'Websocket connection has been cancelled' comes only when the previous ws reconnection is still pending and a new ws reconnection request is sent.

but I think it's in the right area and makes sense given you're experiencing connection issues this change addresses those by appropriately handling the previous requests. Perhaps @esevan could provide some insights - although I think the comment states things well.

I'm seeing that the change we need is in master for NB2KG and not 0.7

This comment was more about getting you up on the latest than necessarily fixing this issue. I didn't mean to imply otherwise. When I get a chance, I'll try to spend time on your scenario that reproduces the open fd list but, since you know you're reproducing the scenario now, I encourage you to try things out with an updated NB2KG, if that's possible.

vikasgarg1996 · 2020-05-07T11:39:19Z

Hi Kevin,
I was debugging this issue of "OSError: [Errno 24] Too many open files"
These are my findings as of now

"fd" count is increasing because of sockets and Unix considers sockets as file descriptors only.
I tried to add breakpoints in JEG code and identify where these sockets are being opened so I found that sockets are being opened before the request is coming to JEG API handler
I was trying to capture HTTP traffic on the JEG machine using goreplay tool, I figured that

a. When the JEG server is not busy and i click on "Reconnect to Kernel". It is able to capture 2 requests. This is the first one
Authorization: token
Upgrade: websocket
Connection: Upgrade
Sec-Websocket-Key: 5qy2n72Vf+Kg7xwHjR5H2A==
Sec-Websocket-Version: 13

What I understood from this is that it is a very common use case for upgrading an HTTP connection is to use WebSockets.

And this is the second one
Authorization: token
Connection: close

What I understood here is this is connection close request

b. But when the JEG server is busy and I click on "Reconnect to Kernel". Goreplay captured only first request. And the second request comes when the JEG finishes his job and becomes available to handle this request. And number of sockets increase after that only.

I am not sure how this can help to figure out the issue but this is the whole data i could find.

Thanks a lot
Vikas

kevin-bates · 2020-05-07T14:35:47Z

@vikasgarg1996 Thank you for spending the time and detail on this - its greatly appreciated!

A couple comments/questions.

Are you able to reproduce what @amangarg96 is seeing? It looks like you found a potential leak but "Reconnect to Kernel" is so infrequent (and user initiated) that I think we'd need to find something that occurs more frequently and automatically.
What constitutes a JEG server is busy scenario?
Are you able to identify the layer of software in which the WebSocket handling is happening? I suspect this would be in Notebook's ZMQChannelsHandler.
Are you running Notebook (with --gateway-url or NB2KG) on the same server as EG? If so, the gateway code also handles websocket interactions and it might be a good datapoint to move the Notebook to a different server, then monitory fds on each.

Again, thank you for digging into this. Please continue if possible. 😄

vikasgarg1996 · 2020-05-07T16:47:35Z

Hi Kevin,

Me and Aman work together and I was trying to identify the root cause of the same issue only
I tested 2 scenarios of JEG server busy
a. When JEG server is starting another Kernel (It takes around 2 minutes)
b. I was running enterprise gateway in debug mode using Pycharm Remote debugging and JEG was stuck at a break point.
No. I tried but i didn't have much knowledge regarding that
We are using NB2KG to connect to JEG instances which are running on VM's while we are running notebook on the local.

Thanks
Vikas

kevin-bates · 2020-05-07T17:02:02Z

Thank you for the updates.

The 2 minute startup time is quite long and, from all the prior analysis we've performed on that, seems to be something in the YARN interaction layer or YARN itself.

I wish I had more things to suggest other than to continue digging. Websocket connection handling (and web layers in general) are things I just don't have very deep experiences with. It would not surprise me though if there was a leak in the framework. I had identified others previously and the framework just doesn't have the "airtime" with the kinds of things that EG is capable of introducing.

amangarg96 · 2020-05-18T07:01:55Z

@kevin-bates
Hey Kevin,
We are looking into the FD(File descriptor, aka sockets) handling of JEG, and trying to identify what all FDs are opened and closed during the lifecycle of a kernel.

Observations:

Started an EG instance. FD - 96
Connection a Jupyter server client to the EG instance. FD - 98 (+2)

In my screenshots, I am ignoring the FDs corresponding to .so and .log files, just to identify the FD associated to kernels.

While kernel launch:
FD count - 127 (+3)

After 1st kernel launch:
FD count - 124 (+26)

Out of these 26 FDs,
5 + 2 TCP connections
16 type=STREAM connections
2 event poll
1 could not be identified

After 2nd kernel launch:
FD count - 149 (+25)

Out of these 25 FDs,
5 + 2 TCP connections
16 type=STREAM connections
2 event poll
This time, the unidentified FD was not there.

After 3rd kernel launch:
FD count - 174 (+25)

Out of these 25 FDs,
5 + 2 TCP connections
16 type=STREAM connections
2 event poll
Consistent with the second kernel launch

This behaviour was consistent when we started from scratch (Start EG instance -> connect client -> launch 3 kernels one by one)

To check whether FDs are closed when shutting down a kernel, Two kernels were launched on a client on a fresh EG instance, and the FD count reached 149 (consistent with above observations)

On shutdown of 2nd kernel,
FD count - 133 (-16)
9 FD seem to be unclosed.

Orphan FDs:
6 stream
2 event polls
1 TCP connect (RM:omniorb)

python  2488 root   27u     unix 0xffff9b86bd132000      0t0 1256269776 type=STREAM
python  2488 root   28u  a_inode               0,11        0       9705 [eventpoll]
python  2488 root   24u     IPv4         1256267126      0t0        TCP prod-mlp-jeg-none-1702246:62548->prod-fdphadoop-bheema-rm-0001:omniorb (CLOSE_WAIT)

One thing I observed here was the TCP connection is in CLOSE_WAIT state.

After shutting down the client (and both kernels), the FD count is 117 (+19 from start of JEG)
2 kernel shutdowns are expected to have 18 FD leak, and there's one extra FD.

amangarg96 · 2020-05-18T08:42:42Z

Another connection leak source - Kernel reconnection (as discussed above)
Observations:
When the EG is busy executing a kernel launch request and 2 kernel reconnect requests are sent,
there is an increase in 10 FDs which do not come down.

10 FDs which do not get closed:
6 type=STREAM
3 TCP prod-mlp-jeg-none-1702246:25674->prod-fdphadoop-bheema-nm-mlp-nb-0013:43107
(ESTABLISHED)
1 TCP prod-mlp-jeg-none-1702246:8881->172.20.99.64:52342 (ESTABLISHED)

With 3 kernel reconnect requests, FDs are 20, and with 4 -> 30
So basically 10(n-1) for n kernel reconnect requests

These do not come down when the kernel is shutdown, or the notebook client is closed

kevin-bates · 2020-05-18T14:21:14Z

Wow - this is excellent information. I think we're going to need to dig deeper into the Jupyter stack (in notebook and jupyter_client) and perhaps the yarn-api-client. I'm not saying that something in the EG repo isn't the thing holding these resources, just that we need to identify the code corresponding to the leaked FDs in order to figure out where its getting held onto.

Have you tried performing these same experiments using only jupyter notebook and a local python kernel? That might provide a good datapoint since that would infer any leaks found via that experiment are in notebook, jupyter_client or below.

Any idea where the 16 STREAM connections/kernel are coming from?

Here are some PRs from when I had looked into some leaks previously. I include these to demonstrate the kinds of changes I needed to make...
jupyter/notebook#3424
jupyter/jupyter_client#361
jupyter/jupyter_client#360

However, I later found that part of 361 had to be reverted, and I'm wondering if this might be the source of at least one of the event poll leaks...
jupyter/jupyter_client#536

I would also note the versions of the underlying packages, primarily notebook and jupyter_client (and tornado perhaps). Ideally, you should be running current versions of those.

Thanks for the great work so far!

nareshsankapelly · 2020-05-20T03:50:37Z

@kevin-bates Two more observations -

Jupyter Notebook (without EG) also has 5 connection leaks on shutdown.
Upgrading to latest versions of jupyter_client(6.1.3), jupyter_core(4.6.3) and notebook(6.0.3) in EG has not helped. The number of connections leaks on shutdown is 9.

I will keep posting on this thread as I make some progress.

nareshsankapelly · 2020-05-20T07:53:57Z

All 9 FD leaks on kernel shutdown are coming from control socket. The following FDs are created when a control socket is created.
python 23588 naresh.sankapelly 15u unix 0xffff9b86c185c000 0t0 1274634682 type=STREAM
python 23588 naresh.sankapelly 16u unix 0xffff9b86bfcdb400 0t0 1274634683 type=STREAM
python 23588 naresh.sankapelly 17r CHR 1,9 0t0 1033 /dev/urandom
python 23588 naresh.sankapelly 18u unix 0xffff9b86bf722400 0t0 1274634684 type=STREAM
python 23588 naresh.sankapelly 19u unix 0xffff9b86bd945400 0t0 1274634685 type=STREAM
python 23588 naresh.sankapelly 20u a_inode 0,11 0 9705 [eventpoll]
python 23588 naresh.sankapelly 21u unix 0xffff9b86c011f000 0t0 1274634686 type=STREAM
python 23588 naresh.sankapelly 22u unix 0xffff9b86715e4000 0t0 1274634687 type=STREAM
python 23588 naresh.sankapelly 23u a_inode 0,11 0 9705 [eventpoll]
python 23588 naresh.sankapelly 24u unix 0xffff9b86bfc6ec00 0t0 1274634688 type=STREAM
python 23588 naresh.sankapelly 25u unix 0xffff9b86676acc00 0t0 1274634689 type=STREAM
python 23588 naresh.sankapelly 26u IPv4 1274637846 0t0 TCP localhost:11236->localhost:45611
(ESTABLISHED)

After shutdown, only the last 3 connections are released. FDs after shutdown kernel -

python 23588 naresh.sankapelly 15u unix 0xffff9b86c185c000 0t0 1274634682 type=STREAM
python 23588 naresh.sankapelly 16u unix 0xffff9b86bfcdb400 0t0 1274634683 type=STREAM
python 23588 naresh.sankapelly 17r CHR 1,9 0t0 1033 /dev/urandom
python 23588 naresh.sankapelly 18u unix 0xffff9b86bf722400 0t0 1274634684 type=STREAM
python 23588 naresh.sankapelly 19u unix 0xffff9b86bd945400 0t0 1274634685 type=STREAM
python 23588 naresh.sankapelly 20u a_inode 0,11 0 9705 [eventpoll]
python 23588 naresh.sankapelly 21u unix 0xffff9b86c011f000 0t0 1274634686 type=STREAM
python 23588 naresh.sankapelly 22u unix 0xffff9b86715e4000 0t0 1274634687 type=STREAM
python 23588 naresh.sankapelly 23u a_inode 0,11 0 9705 [eventpoll]

kevin-bates · 2020-05-20T22:52:07Z

Thank you for all the great work here! Would you mind sharing how you determine this information? How many iterations are performed, etc.?

We need to move to the phase of finding these leaks and the more datapoints we can gather the "easier" that will be. I would also suggest we focus on the NB leaks first since a) those are more critical to the community at large, but also, b) they will likely shed some insight into the others.

It sounds like the control socket is one clue.

kevin-bates · 2020-05-21T02:43:18Z

Interesting and encouraging: jupyter-server/jupyter_server#234, jupyter/notebook#3748 (comment)

Also, please provide the output of pip freeze in whichever python env you're experiencing this in.

nareshsankapelly · 2020-05-22T00:56:04Z

pip freeze output:
asn1crypto==0.24.0
attrs==19.3.0
backcall==0.1.0
bcrypt==3.1.6
bleach==3.1.0
boto==2.49.0
cachetools==3.1.1
certifi==2019.11.28
cffi==1.12.3
chardet==3.0.4
Click==7.0
cloudpickle==1.2.1
conda-pack==0.4.0
cryptography==2.7
decorator==4.4.1
defusedxml==0.6.0
docker==4.0.2
docopt==0.6.2
entrypoints==0.3
filechunkio==1.8
future==0.17.1
gitdb2==2.0.5
GitPython==2.1.11
google-auth==1.6.3
hdfs==2.5.2
idna==2.8
importlib-metadata==0.23
ipykernel==5.1.3
ipython==7.9.0
ipython-genutils==0.2.0
jedi==0.15.1
Jinja2==2.10.3
jsonschema==3.2.0
jupyter-client==5.3.5
jupyter-core==4.6.1
jupyter-enterprise-gateway==2.1.1
kubernetes==9.0.0
MarkupSafe==1.1.1
mistune==0.8.4
mlplatformstack==1.0
mlsdkpython===5.12.5-SNAPSHOT
mock==3.0.5
more-itertools==7.2.0
nbconvert==5.6.1
nbformat==4.4.0
notebook==5.7.9
numpy==1.16.3
oauthlib==3.0.1
pandas==0.25.3
pandocfilters==1.4.2
paramiko==2.6.0
parso==0.5.1
pexpect==4.7.0
pickleshare==0.7.5
pipdeptree==0.13.1
prometheus-client==0.7.1
prompt-toolkit==2.0.10
psutil==5.6.7
ptyprocess==0.6.0
pyasn1==0.4.6
pyasn1-modules==0.2.6
pycparser==2.19
pycrypto==2.6.1
pycryptodomex==3.9.7
Pygments==2.4.2
PyNaCl==1.3.0
pyrsistent==0.15.6
python-dateutil==2.8.1
pytz==2019.3
PyYAML==5.1.2
pyzmq==18.1.1
requests==2.22.0
requests-oauthlib==1.2.0
retrying==1.3.3
rsa==4.0
Send2Trash==1.5.0
simplejson==3.16.0
six==1.13.0
smmap2==2.0.5
statsd==3.2.1
terminado==0.8.3
testpath==0.4.4
tornado==6.0.3
traitlets==4.3.3
urllib3==1.25.3
wcwidth==0.1.7
webencodings==0.5.1
websocket-client==0.56.0
yarn-api-client==1.0.2
zipp==0.6.0

kevin-bates · 2020-05-22T02:16:02Z

Thanks - I would recommend moving to notebook == 6.0.3 - just so you're current there.

Would you mind sharing how you determine this information? How many iterations are performed, etc.?

Can you please provide tools/commands used to determine the leaks - this way others can help.

nareshsankapelly · 2020-05-22T03:24:15Z

The issue persists with 6.0.3 as well.
The following command was used to observe the count of FDs.
watch -n 1 "sudo lsof -p pid | grep -vE '\.log' | grep -vE '\.so' | wc -l"
And to get the exact details of FDs -
watch -n 1 "sudo lsof -p pid | grep -vE '\.log' | grep -vE '\.so'".

I have run EG on a server and connected to it from Pycharm using remote debugging to figure out where the FDs are being incremented and cleaned. This has helped in identifying that 9 FDs of control socket connection are not being released on shutdown.

kevin-bates · 2020-05-22T03:37:15Z

Great - thank you!

kevin-bates · 2020-05-22T14:15:17Z

Would it be possible for you to move to the current jupyter_client release? The 6.0 release contains the two PRs I reference above 360 and 361 - although, as noted in #762 (comment), a portion of 361 had to be backed out because a notebook test was failing. (If we see some success with jupyter_client 6, it might be worth experimenting adding that one line back in.)

kevin-bates · 2020-05-24T15:13:50Z

@nareshsankapelly, @amangarg96 Take this change! jupyter/jupyter_client#548 I'm showing all my open fds returning to the count at kernel startup!

I was reproducing the leak with these kinds of results - which very much matches your results...

Before Start	(No WS, REST API)	Local	Remote	Shutdown	Leaked
25	39	n/a	n/a	33	8
33	47	n/a	n/a	41	8
41	55	n/a	n/a	49	8
64	n/a	91	n/a	72	8
48	n/a	n/a	76	56	8
56	n/a	n/a	84	46	8

After applying the fix...

Before Start	(No WS, REST API)	Local	Remote	Shutdown
16	30	n/a	n/a	16
16	30	n/a	n/a	16
16	30	n/a	n/a	16
16	n/a	43	n/a	16
16	n/a	43	n/a	16
16	n/a	n/a	44	16
16	n/a	n/a	44	16

nareshsankapelly · 2020-05-24T16:05:57Z

@kevin-bates Thanks. I will try these changes.

nareshsankapelly · 2020-05-25T05:05:31Z

@kevin-bates FD leaks are not there after taking those changes. However, I see the following errors in EG logs -

[I 2020-05-25 10:31:16.775 EnterpriseGatewayApp] Kernel shutdown: 722bc34b-d59b-4946-9593-1d3f22470bb4
[E 200525 10:31:17 web:1788] Uncaught exception DELETE /api/kernels/722bc34b-d59b-4946-9593-1d3f22470bb4 (172.21.99.39)
    HTTPServerRequest(protocol='http', host='<jeg-ip>:8950', method='DELETE', uri='/api/kernels/722bc34b-d59b-4946-9593-1d3f22470bb4', version='HTTP/1.1', remote_ip='172.21.99.39')
    Traceback (most recent call last):
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/web.py", line 1699, in _execute
        result = await result
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
        yielded = next(result)
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/notebook/services/kernels/handlers.py", line 72, in delete
        yield gen.maybe_future(km.shutdown_kernel(kernel_id))
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/notebook/services/kernels/kernelmanager.py", line 281, in shutdown_kernel
        return super(MappingKernelManager, self).shutdown_kernel(kernel_id, now=now)
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/jupyter_client/multikernelmanager.py", line 189, in shutdown_kernel
        self.currently_used_ports.remove(port)
    KeyError: 21615
[E 200525 10:31:17 web:2246] 500 DELETE /api/kernels/722bc34b-d59b-4946-9593-1d3f22470bb4 (172.21.99.39) 304.67ms
[W 200525 10:31:21 web:1782] 404 GET /api/kernels/722bc34b-d59b-4946-9593-1d3f22470bb4 (172.21.99.39): Kernel does not exist: 722bc34b-d59b-4946-9593-1d3f22470bb4
[W 200525 10:31:21 web:2246] 404 GET /api/kernels/722bc34b-d59b-4946-9593-1d3f22470bb4 (172.21.99.39) 1.10ms
[E 200525 10:31:21 web:1788] Uncaught exception GET /api/kernels/722bc34b-d59b-4946-9593-1d3f22470bb4/channels (172.21.99.39)
    HTTPServerRequest(protocol='http', host='<jeg-ip>:8950', method='GET', uri='/api/kernels/722bc34b-d59b-4946-9593-1d3f22470bb4/channels', version='HTTP/1.1', remote_ip='172.21.99.39')
    Traceback (most recent call last):
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/web.py", line 1699, in _execute
        result = await result
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/gen.py", line 742, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/notebook/services/kernels/handlers.py", line 257, in get
        yield super(ZMQChannelsHandler, self).get(kernel_id=kernel_id)
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
        value = future.result()
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/gen.py", line 742, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/notebook/base/zmqhandlers.py", line 297, in get
        yield gen.maybe_future(res)
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
        value = future.result()
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/websocket.py", line 278, in get
        await self.ws_connection.accept_connection(self)
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/websocket.py", line 881, in accept_connection
        await self._accept_connection(handler)
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/websocket.py", line 964, in _accept_connection
        await self._receive_frame_loop()
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/websocket.py", line 1121, in _receive_frame_loop
        self.handler.on_ws_connection_close(self.close_code, self.close_reason)
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/websocket.py", line 578, in on_ws_connection_close
        self.on_connection_close()
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/tornado/websocket.py", line 570, in on_connection_close
        self.on_close()
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/notebook/services/kernels/handlers.py", line 468, in on_close
        stream.on_recv(None)
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 187, in on_recv
        self._drop_io_state(zmq.POLLIN)
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 532, in _drop_io_state
        self._update_handler(self._state)
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 539, in _update_handler
        if state & self.socket.events:
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/zmq/sugar/attrsettr.py", line 48, in __getattr__
        return self._get_attr_opt(upper_key, opt)
      File "/usr/share/mlp-jeg/lib/python3.6/site-packages/zmq/sugar/attrsettr.py", line 52, in _get_attr_opt
        return self.get(opt)
      File "zmq/backend/cython/socket.pyx", line 472, in zmq.backend.cython.socket.Socket.get
      File "zmq/backend/cython/socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
    zmq.error.ZMQError: Socket operation on non-socket

kevin-bates · 2020-05-25T14:15:20Z

The cache_ports stuff is something that was added in jupyter_client relatively recently and should be disabled in EG: #790

Please try to run with EG master if possible.

kevin-bates · 2020-06-05T23:54:44Z

@nareshsankapelly @amangarg96 @vikasgarg1996 - please see PR #820.

Since the discussions on jupyter/jupyter_client#548 have hit a snag and we've learned that the original code (from a year ago) seems to prevent leaks, I've added a means of enabling the global ZMQ context. I'm hoping to get that back-ported to 1.x as well and I have just realized we should get a 1.2.1 release published asap.

cc: @lresende

amangarg96 · 2020-06-11T17:45:35Z

@kevin-bates ,

We have noted down our observations in this sheet

There are still leaks in the first kernel launch (11 sockets) and subsequent kernel launches (2 for every new kernel). These sockets are not closed even after closing the jupyterlab (client) server.

kevin-bates · 2020-06-11T19:08:50Z

hmm - can you please produce the output of pip freeze? Is this the same as what you have above? If so, where is juptyer_client == 5.3.5 coming from? I don't see that as being a valid release.

I don't think we should spend time worrying about the "first kernel" because there's probably some amount of "warmup" required within the various frameworks, however, following that, I would expect the open FDs to remain 0 once all kernels have stopped.

Frankly, given the focus on this and the fact that I see instances where there are no increases, I'm fairly confident this is not an EG issue. I'm not saying I won't be tracking changes and ensuring any fixes don't break EG, but I believe the issue to be in the framework itself.

If you try the proposed changes from the juptyer_client PR - where the zmq context is not shared AND is destroyed, do you still see leaks there? (I don't see leaks with either a global context or using the local context w/ destroy - per the PR.)

amangarg96 · 2020-06-12T05:56:25Z

pip freeze:
asn1crypto==0.24.0
attrs==19.3.0
backcall==0.1.0
bcrypt==3.1.6
bleach==3.1.0
boto==2.49.0
boto3==1.9.159
botocore==1.12.163
cachetools==3.1.1
certifi==2019.11.28
cffi==1.12.3
chardet==3.0.4
Click==7.0
cloudpickle==1.2.1
conda-pack==0.4.0
cryptography==2.7
decorator==4.4.1
defusedxml==0.6.0
docker==4.0.2
docopt==0.6.2
docutils==0.14
entrypoints==0.3
filechunkio==1.8
gitdb2==2.0.5
GitPython==2.1.11
google-auth==1.6.3
hdfs==2.5.2
idna==2.8
importlib-metadata==0.23
ipykernel==5.1.4.dev2
ipython==7.9.0
ipython-genutils==0.2.0
jedi==0.15.1
Jinja2==2.10.3
jmespath==0.9.4
jsonschema==3.2.0
jupyter-client==5.3.4
jupyter-core==4.6.1
jupyter-enterprise-gateway==2.1.1
kubernetes==9.0.0
MarkupSafe==1.1.1
mistune==0.8.4
mlplatformstack==1.0
mlsdkpython==5.13.0
mock==3.0.5
more-itertools==7.2.0
nbconvert==5.6.1
nbformat==4.4.0
notebook==5.7.8
numpy==1.16.3
oauthlib==3.0.1
pandas==0.25.3
pandocfilters==1.4.2
paramiko==2.6.0
parso==0.5.1
pexpect==4.7.0
pickleshare==0.7.5
pipdeptree==0.13.1
prometheus-client==0.7.1
prompt-toolkit==2.0.10
psutil==5.6.7
ptyprocess==0.6.0
pyasn1==0.4.6
pyasn1-modules==0.2.6
pycparser==2.19
pycrypto==2.6.1
Pygments==2.4.2
PyNaCl==1.3.0
pyrsistent==0.15.6
python-dateutil==2.8.1
pytz==2019.3
PyYAML==5.1.2
pyzmq==18.1.1
requests==2.22.0
requests-oauthlib==1.2.0
retrying==1.3.3
rsa==4.0
s3transfer==0.2.0
Send2Trash==1.5.0
simplejson==3.16.0
six==1.13.0
smmap2==2.0.5
statsd==3.2.1
terminado==0.8.3
testpath==0.4.4
tornado==6.0.3
tqdm==4.39.0
traitlets==4.3.3
urllib3==1.25.3
wcwidth==0.1.7
webencodings==0.5.1
websocket-client==0.56.0
yarn-api-client==1.0.2
zipp==0.6.0

In the pip freeze shared above (comment), the jupyter_client version should be 5.3.4 and notebook version should be 5.7.8 (same as my pip freeze). We had had some internal (non-functional) changes to these packages, hence created the confusion in the version numbers.

kevin-bates · 2020-06-12T13:38:15Z

Thanks @amangarg96 - As this is the older version of things, I probably won't spend much time looking at this, but will revisit the open fds once I have a chance using the ER release candidate. If I find the newer versions still show no leaking, then I'll build an environment using your versions and try to reproduce your issue.

I assume your latest observations are with the enablement of use-global-zmq-context delivered in #820 - correct?

amangarg96 · 2020-06-12T16:37:53Z

I assume your latest observations are with the enablement of use-global-zmq-context delivered in #820 - correct

Yes, the latest observations are using the above PR.

Meanwhile, we'll try to repeat the above experiments with the more recent versions of nb2kg, jupyterlab, notebooks and jupyter_client.

@kevin-bates - Could you share the pip freeze of the environment with which there were no leaks, for reference?

kevin-bates · 2020-06-12T17:11:23Z

Here's my current env. I haven't revisited the open fd leak issue in a few days but I think the versions are relatively the same:

adal==1.2.1
asn1crypto==0.24.0
async-generator==1.10
attrs==19.1.0
backcall==0.1.0
bcrypt==3.1.6
bleach==3.1.0
cachetools==3.1.0
certifi==2019.3.9
cffi==1.12.2
chardet==3.0.4
Click==7.0
cloudpickle==0.8.1
cryptography==2.6.1
dask==1.2.0
dask-yarn==0.5.2
decorator==4.3.2
defusedxml==0.5.0
Deprecated==1.2.10
distributed==1.27.0
docker==3.7.0
docker-pycreds==0.4.0
entrypoints==0.3
future==0.18.2
google-auth==1.6.3
grpcio==1.20.0
HeapDict==1.0.0
idna==2.8
ipykernel==5.1.0
ipython==7.13.0
ipython-genutils==0.2.0
jedi==0.13.3
Jinja2==2.10
json5==0.9.5
jsonschema==3.0.1
jupyter-client==6.1.3
jupyter-core==4.6.3
jupyter-enterprise-gateway==2.2.0.dev0
jupyterlab==2.1.3
jupyterlab-server==1.1.5
kubernetes==8.0.1
MarkupSafe==1.1.1
mistune==0.8.4
msgpack==0.6.1
nbclient==0.2.0
nbconvert==5.4.1
nbformat==5.0.6
nest-asyncio==1.3.3
notebook==6.1.0rc1
oauthlib==3.0.1
pandocfilters==1.4.2
paramiko==2.4.2
parso==0.3.4
pexpect==4.6.0
pickleshare==0.7.5
prometheus-client==0.6.0
prompt-toolkit==2.0.9
protobuf==3.7.1
psutil==5.6.1
ptyprocess==0.6.0
pyasn1==0.4.5
pyasn1-modules==0.2.4
pycparser==2.19
pycrypto==2.6.1
pycryptodome==3.9.7
pycryptodomex==3.9.7
Pygments==2.3.1
PyJWT==1.7.1
PyNaCl==1.3.0
pyrsistent==0.14.11
python-dateutil==2.8.0
PyYAML==3.13
pyzmq==18.0.1
requests==2.23.0
requests-oauthlib==1.2.0
rsa==4.0
Send2Trash==1.5.0
six==1.12.0
skein==0.6.1
sortedcontainers==2.1.0
tblib==1.3.2
terminado==0.8.3
testpath==0.4.2
toolz==0.9.0
tornado==5.1.1
traitlets==4.3.3
urllib3==1.24.1
wcwidth==0.1.7
webencodings==0.5.1
websocket-client==0.55.0
wrapt==1.12.1
yarn-api-client==2.0.0.dev0
zict==0.1.4

kevin-bates · 2020-07-13T17:37:07Z

I've backed out #820 since this has been addressed more appropriately in jupyter_client 6.1.5. As a result, I'm going to close this issue since it seems to have evolved across different scenarios and I believe the various issues have been resolved.

If opinions differ, please open specific issues, siting the appropriate comments from this issue and we'll strive to resolve those.

kevin-bates mentioned this issue Jan 3, 2020

Fix YARN API error handling #764

Merged

kevin-bates added the hadoop yarn label Jan 3, 2020

kevin-bates mentioned this issue May 24, 2020

[BugFix] [Resource Leak] Gracefully Close ZMQ Context upon kernel shutdown jupyter/jupyter_client#548

Merged

kevin-bates closed this as completed Jul 13, 2020

Query for Spark application state failed #762

Query for Spark application state failed #762

Comments

amangarg96 commented Dec 19, 2019

Description

Screenshots / Logs

Environment

kevin-bates commented Dec 19, 2019

amangarg96 commented Dec 20, 2019 • edited Loading

kevin-bates commented Dec 20, 2019

kevin-bates commented Dec 20, 2019

amangarg96 commented Dec 21, 2019

amangarg96 commented Dec 23, 2019

kevin-bates commented Jan 15, 2020 • edited Loading

amangarg96 commented Jan 16, 2020

kevin-bates commented Jan 16, 2020

kevin-bates commented Feb 6, 2020

amangarg96 commented Feb 12, 2020

kevin-bates commented Feb 12, 2020

amangarg96 commented Feb 12, 2020

kevin-bates commented Feb 12, 2020

amangarg96 commented Feb 13, 2020

kevin-bates commented Feb 14, 2020

amangarg96 commented Apr 10, 2020

kevin-bates commented Apr 10, 2020

amangarg96 commented Apr 14, 2020

kevin-bates commented Apr 14, 2020

amangarg96 commented Apr 17, 2020

amangarg96 commented Apr 17, 2020

kevin-bates commented Apr 17, 2020

amangarg96 commented Apr 17, 2020

kevin-bates commented Apr 17, 2020

amangarg96 commented Apr 18, 2020

kevin-bates commented Apr 18, 2020

vikasgarg1996 commented May 7, 2020 • edited Loading

kevin-bates commented May 7, 2020

vikasgarg1996 commented May 7, 2020

kevin-bates commented May 7, 2020

amangarg96 commented May 18, 2020

amangarg96 commented May 18, 2020 • edited Loading

kevin-bates commented May 18, 2020

nareshsankapelly commented May 20, 2020

nareshsankapelly commented May 20, 2020 • edited Loading

kevin-bates commented May 20, 2020

kevin-bates commented May 21, 2020

nareshsankapelly commented May 22, 2020

kevin-bates commented May 22, 2020

nareshsankapelly commented May 22, 2020

kevin-bates commented May 22, 2020

kevin-bates commented May 22, 2020

kevin-bates commented May 24, 2020

nareshsankapelly commented May 24, 2020

nareshsankapelly commented May 25, 2020 • edited Loading

kevin-bates commented May 25, 2020

kevin-bates commented Jun 5, 2020

amangarg96 commented Jun 11, 2020 • edited Loading

kevin-bates commented Jun 11, 2020

amangarg96 commented Jun 12, 2020

kevin-bates commented Jun 12, 2020

amangarg96 commented Jun 12, 2020

kevin-bates commented Jun 12, 2020

kevin-bates commented Jul 13, 2020

amangarg96 commented Dec 20, 2019 •

edited

Loading

kevin-bates commented Jan 15, 2020 •

edited

Loading

vikasgarg1996 commented May 7, 2020 •

edited

Loading

amangarg96 commented May 18, 2020 •

edited

Loading

nareshsankapelly commented May 20, 2020 •

edited

Loading

nareshsankapelly commented May 25, 2020 •

edited

Loading

amangarg96 commented Jun 11, 2020 •

edited

Loading