-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client.pods().inNamespace("x").list(); timeouts in version 6.6.0 #5117
Comments
Same issue with version 6.6.1. |
What is your setting for the Config.requestTimeout? If you haven't explicitly set it then it should be 10 seconds. What about the retry backoff settings? This information and the a debug log from 6.6.1 would help us determine what is occurring in your case. |
on 6.6.1. |
No logging at all with trace enabled for the client.pods().inNamespace("x").list(); command: log.info(marker, "getPodList 1"); The log only prints "getPodList 1" the it is quiet and I aborted it after 15 minutes. 2023-05-11 14:14:32,373 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor -HTTP START- |
For 6.6.1 what is your requestTimeout set to? At the very least I would have expected the request to timeout. If it did not presumably this means that the initial response has been received, but that the full response has not been processed. The logic is simply building up a byte[] of the response, so there's not really a reason why this won't complete. What is your httpclient - there will probably be additional logger there that's helpful? If it's vertx and some other operation is blocking the io thread needed to process these results, then you'll see behavior like this. Can you provide or examine a thread dump to confirm if that is the case? |
OkHttp is the httpclient. Here are the threads: Thread list: Thread [main] (Suspended) Thread [ReaderThread] (Suspended) Thread [Timer-0] (Suspended) Daemon Thread [OkHttp TaskRunner] (Suspended) Added trace logging for |
Still no good indication of what you are seeing unfortunately. If there were a stuck thread you'd expect to see something in OkHttpClientImpl.doConsume - that is responsible for pull results from the okhttp source and delivering them downstream for building the byte[] that will get parsed as the response. Other than trying a different httpclient to compare behavior this will take a reproducer or the use of debugger to investigate further. |
If I can send you an flight recording (jfr file) will that help? |
I tested it now against a namespace containing a lot fewer pods and then it worked. |
I'm not aware of a size related limitation with building the result. I'll try a similar reproducer locally with okhttp and see if it shows the same behavior. |
Without additional factors, size isn't an issue locally. I can transfer/parse a list of 3,000,000 small configmaps totaling over 300MB without an problem. For even larger sizes I am seeing some thrashing when jackson is parsing the result - but in the thread dump I do see a thread doing that processing, which doesn't match your scenario.
Probably. The next thing I would look at is what the ByteArrayBodyHandler is doing - is onBodyDone ever called for this request? If it has been, then your problem is happening downstream with parsing. If it has not, then the problem is with the transfer of the response. |
onBodyDone is never called. This is the thread dump: "main" #1 prio=5 os_prio=0 cpu=2758.14ms elapsed=248.09s tid=0x00007f2580026f70 nid=0x20cc waiting on condition [0x00007f258aa72000] "Reference Handler" #2 daemon prio=10 os_prio=0 cpu=0.85ms elapsed=248.07s tid=0x00007f258013d290 nid=0x20d4 waiting on condition [0x00007f25492fc000] "Finalizer" #3 daemon prio=8 os_prio=0 cpu=0.15ms elapsed=248.07s tid=0x00007f258013e670 nid=0x20d5 in Object.wait() [0x00007f25491fb000] "Signal Dispatcher" #4 daemon prio=9 os_prio=0 cpu=0.21ms elapsed=248.06s tid=0x00007f2580144be0 nid=0x20d6 waiting on condition [0x0000000000000000] "Service Thread" #5 daemon prio=9 os_prio=0 cpu=0.83ms elapsed=248.06s tid=0x00007f2580145ea0 nid=0x20d7 runnable [0x0000000000000000] "Monitor Deflation Thread" #6 daemon prio=9 os_prio=0 cpu=4.18ms elapsed=248.06s tid=0x00007f25801472b0 nid=0x20d8 runnable [0x0000000000000000] "C2 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=4509.13ms elapsed=248.06s tid=0x00007f2580148bb0 nid=0x20d9 waiting on condition [0x0000000000000000] "C1 CompilerThread0" #9 daemon prio=9 os_prio=0 cpu=1320.18ms elapsed=248.06s tid=0x00007f258014a0e0 nid=0x20da waiting on condition [0x0000000000000000] "Sweeper thread" #10 daemon prio=9 os_prio=0 cpu=19.82ms elapsed=248.06s tid=0x00007f258014b550 nid=0x20db runnable [0x0000000000000000] "Common-Cleaner" #11 daemon prio=8 os_prio=0 cpu=1.53ms elapsed=247.98s tid=0x00007f2580157cc0 nid=0x20de in Object.wait() [0x00007f25487cb000] "JDWP Transport Listener: dt_socket" #12 daemon prio=10 os_prio=0 cpu=23.83ms elapsed=247.95s tid=0x00007f258019c640 nid=0x20e0 runnable [0x0000000000000000] "JDWP Event Helper Thread" #13 daemon prio=10 os_prio=0 cpu=3.05ms elapsed=247.94s tid=0x00007f25801afef0 nid=0x20e3 runnable [0x0000000000000000] "JDWP Command Reader" #14 daemon prio=10 os_prio=0 cpu=7.22ms elapsed=247.94s tid=0x00007f2518000f60 nid=0x20e4 runnable [0x0000000000000000] "Notification Thread" #15 daemon prio=9 os_prio=0 cpu=2.93ms elapsed=247.25s tid=0x00007f2581a08ce0 nid=0x20ec runnable [0x0000000000000000] "ReaderThread" #17 prio=5 os_prio=0 cpu=0.75ms elapsed=247.07s tid=0x00007f2581a964b0 nid=0x20ee runnable [0x00007f250b1b9000] "Attach Listener" #18 daemon prio=9 os_prio=0 cpu=365.45ms elapsed=246.69s tid=0x00007f252c000e20 nid=0x20f3 runnable [0x0000000000000000] "RMI TCP Accept-0" #19 daemon prio=9 os_prio=0 cpu=1.80ms elapsed=246.38s tid=0x00007f24e80671a0 nid=0x20f8 runnable [0x00007f250abac000] "JFR Recorder Thread" #20 daemon prio=5 os_prio=0 cpu=23.97ms elapsed=180.24s tid=0x00007f24e8f40cd0 nid=0x2174 waiting on condition [0x0000000000000000] "JFR Periodic Tasks" #21 daemon prio=9 os_prio=0 cpu=279.39ms elapsed=180.04s tid=0x00007f24e8fda340 nid=0x2175 waiting for monitor entry [0x00007f250a9aa000] "JFR Recording Scheduler" #24 daemon prio=9 os_prio=0 cpu=15.37ms elapsed=180.00s tid=0x00007f24d40031e0 nid=0x2178 waiting on condition [0x00007f2509dfb000] "RMI TCP Connection(1)-10.120.185.4" #25 daemon prio=9 os_prio=0 cpu=112.66ms elapsed=154.46s tid=0x00007f24dc002b10 nid=0x21aa in Object.wait() [0x00007f25481c0000] "RMI Scheduler(0)" #26 daemon prio=9 os_prio=0 cpu=0.41ms elapsed=154.43s tid=0x00007f25041d9870 nid=0x21ab waiting on condition [0x00007f2509efc000] "JMX server connection timeout 27" #27 daemon prio=9 os_prio=0 cpu=4.69ms elapsed=154.43s tid=0x00007f2504154eb0 nid=0x21ac in Object.wait() [0x00007f2509cfa000] "RMI TCP Connection(2)-10.120.185.4" #28 daemon prio=9 os_prio=0 cpu=490.97ms elapsed=154.30s tid=0x00007f24dc007690 nid=0x21b0 runnable [0x00007f2509af7000] "OkHttp TaskRunner" #31 daemon prio=5 os_prio=0 cpu=79.20ms elapsed=146.01s tid=0x00007f24fc0a20d0 nid=0x21cf waiting on condition [0x00007f2508307000] "OkHttp TaskRunner" #33 daemon prio=5 os_prio=0 cpu=3.05ms elapsed=146.00s tid=0x00007f250c007010 nid=0x21d1 waiting on condition [0x00007f25085fc000] "OkHttp TaskRunner" #35 daemon prio=5 os_prio=0 cpu=1.13ms elapsed=145.98s tid=0x00007f2518010250 nid=0x21d3 in Object.wait() [0x00007f2508206000] "VM Thread" os_prio=0 cpu=36.04ms elapsed=248.07s tid=0x00007f25801391e0 nid=0x20d3 runnable "GC Thread#0" os_prio=0 cpu=38.15ms elapsed=248.09s tid=0x00007f2580053be0 nid=0x20ce runnable "GC Thread#1" os_prio=0 cpu=35.95ms elapsed=247.49s tid=0x00007f2540005120 nid=0x20e8 runnable "GC Thread#2" os_prio=0 cpu=45.39ms elapsed=247.49s tid=0x00007f2540005b50 nid=0x20e9 runnable "GC Thread#3" os_prio=0 cpu=28.77ms elapsed=247.49s tid=0x00007f2540006580 nid=0x20ea runnable "G1 Main Marker" os_prio=0 cpu=0.58ms elapsed=248.08s tid=0x00007f2580064d10 nid=0x20cf runnable "G1 Conc#0" os_prio=0 cpu=32.97ms elapsed=248.08s tid=0x00007f2580065c70 nid=0x20d0 runnable "G1 Refine#0" os_prio=0 cpu=1.14ms elapsed=248.08s tid=0x00007f258010a8a0 nid=0x20d1 runnable "G1 Refine#1" os_prio=0 cpu=0.03ms elapsed=247.48s tid=0x00007f2544000ce0 nid=0x20eb runnable "G1 Service" os_prio=0 cpu=32.46ms elapsed=248.08s tid=0x00007f258010b790 nid=0x20d2 runnable "VM Periodic Task Thread" os_prio=0 cpu=102.13ms elapsed=247.25s tid=0x00007f2580028360 nid=0x20ed waiting on condition JNI global refs: 90, weak refs: 135 In the same namespace It fails to get secrets that is about 9Mb but it manages to get It actually freezes on client.raw("https://x.x.x.x:yyy/api/v1/namespaces/xxx/pods"); that have to reduce the code involved. I'm running the tests using Java 17.0.6 and Junit5 on a linux environment if it matters. |
The next thought is that somewhere OkHttpClientImpl consume (a problem with the executor) or doConsume (a problem reading the okhttp source) is throwing an Error rather than an Exception - those methods are only catching Exception. That would give the appearance of things hanging from the client side - but you'd expect that to be logged by the default uncaught exception handling. Are you customizing the okhttp dispatcher executorservice? |
At the moment I'm only running a minimalistic test case that only imports
the kube client so no modifications that I can think of.
|
I found the problem, it is a dependency conflict, when I created a new empty project and run the test from there it worked and I added one dep at the time and run the test until it froze then added some excludes and continued. Thank you for your help! |
Interesting that there's some extremely subtle behavioral change rather than a more noticeable error. If possible can you provide a pom or the dependencies that caused this, so that we can see if there's something more defensive that would be easy to include? |
Sorry I can not provide you with an pom file as it is using internal libs but here is the parts of dep tree and pom that in the end helped me find the problem, I have renamed the library:library and removed all non relevant deps from it here as it is an internal one.
So the solution was to set the pom file to something like this (this is only the relevant parts):
I think that this is enough for you to recreate the problem, the strange part is that it worked as intended as long as the response from kubernetes was below a certain size, the limit was somewhere between 1.5Mb and 6.8Mb. |
Quite strange. Tried to reproduce a couple of times with 10000 configmaps making for over 7MB in a single list, but no luck. My best guess is still a Throwable that is not being logged by OkHttpClientImpl nor the default uncaught exception handler. If we see something like this again, we should definitely go ahead and change those to catch throwable just in case. |
I think that I found an other thing that can be related, we use log4j in our project and when I enabled that in my test project it stops working. |
Found the problem with the logging, we have log level to trace globally and save the logs to a database and when |
Thank you for the follow up, it's good to have a definitive diagnosis. We'll probably want to be more conservative with the size of message we're attempting to log. |
Still do not know why io.fabric8.kubernetes.client.http.HttpLoggingInterceptor fails to log a ~7Mb log row, |
It's a problem with the interceptor logic. It's skipping the call to the actual consumer once the response goes over 2MiB - this will need to be fixed. For now you should disable the trace logging, at least for the http logger. |
Describe the bug
I can not get list of pods in version 6.6.0 but it works in 6.5.0 and on 6.5.1 it happens on a few percent of the attempts.
I have tried to set log level to trace to get more information but that do not add any additional information regarding what is happening.
This is the request that timeout:
client.pods().inNamespace("x").list();
Kubernetes version not in list:
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1", GitCommit:"e01e27ba1fcd49adae44ef0984abfc6ee999c99a", GitTreeState:"clean", BuildDate:"2023-03-13T18:01:50Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Fabric8 Kubernetes Client version
6.6.0
Steps to reproduce
create a client then run:
client.pods().inNamespace("x").list();
Expected behavior
Get a PodList object.
Runtime
Kubernetes (vanilla)
Kubernetes API Server version
next (development version)
Environment
Linux
Fabric8 Kubernetes Client Logs
Additional context
The namespace is containing 104 pods.
Pod Security Standards is enabled in the namespace at baseline level.
The text was updated successfully, but these errors were encountered: