-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.0 stabilization #124
Comments
I think we'd probably recommend documenting what the compatibility expectations of that feature are going forward (in a doc in that repo), make sure there is a process for API changes reasonably consistent with the use of the goals, and then make sure the feature repo contains an issue to the graduated feature. |
@loburm will help with scalability tests |
@brancz for the scalability testing we need to have some scenario to test against. I think we should concentrate only on testing metric related to nodes and pods mostly (I assume that other parts should consume significantly smaller amount of resources). How many nodes and pod should be present in the test scenario? |
@loburm I'm completely new to the load tests, so I suggest to start with whatever seems reasonable to you. My thoughts are the same as yours, the number of pods metrics are expected to increase linearly with the number of other objects, so focusing on those and nodes sounds perfect for our load scenarios. Testing with the recommended upper bound of recommended pods/nodes in a single cluster would be best to see if we can actually handle this, but I'm not sure that's reasonable given that we have never performed load tests before. |
We had a chat offline and we will try to test the following scenarios:
@loburm will verify:
|
One issue (that we can only do so much about) is the size of `/metrics` and
the time it takes Prometheus to scrape it. Putting some bound on that could
inform future decisions on adding metrics.
…On Thu, Jul 20, 2017, 11:51 Frederic Branczyk ***@***.***> wrote:
Great thanks a lot @piosz <https://github.com/piosz> and @loburm
<https://github.com/loburm> !
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAICBrPEar4LErT__nbok0UGNaJcxtfSks5sPyMdgaJpZM4M_5pt>
.
|
Thanks for the heads up @matthiasr! Yes that's one of the bottle-necks I can see happening. We may have to start thinking of sharding strategies for |
Do you think it would be possible to have some kind of pagination support for |
How about we split the scrape endpoints according to different collectors, i.e., pods state metrics is available on Or, how about we support both Cons: this will make Prometheus configuration more complicated. |
What @andyxning is suggesting is certainly possible, but is likely to just postpone the problem. @piosz I'm not aware of any precedence of that, but paging within the same instance of kube-state-metrics would also just postpone the problem, as I can imagine that the memory consumption is also very large in cases where response timeouts are hit. |
@andyxning I think that will add unnecessary complexity and as I have understood common rule is to expose metrics on And let me first perform some tests and once we have some real numbers, we can start thinking about possible issues and how they can be solved. |
Completely agree with @loburm, measure first. |
yeah, I didn't mean this as "needs immediate changes", but it would be good to measure and monitor for regression. Our cluster is fairly sizable, and the response is big, but not unmanageable. For now I'd just like to have a rough idea of what to expect as we grow the cluster more :) Even "if your cluster has >10k pods, raise the scrape timeout to at least 20s" is something to work with. |
Aggreed with @loburm. Btw, It still needs to add more configurations to Prometheus for one cluster . :) |
@lobrum any updates on how the scalability tests are coming along? |
Yesterday I have finished testing kube-state-metrics on 100 and 500 node clusters. Today trying to perform it on 1000 node cluster, but have small problems with density test. But base on the first numbers I can say that memory, cpu, latency depend on the number of nodes almost linearly. I'll prepare small report soon and will share with you. |
Sorry that it took so much time, running scalability test on 1000 node cluster was a bit tricky. I have written all numbers down in the doc: https://docs.google.com/document/d/1hm5XrM9dYYY085yOnmMDXu074E4RxjM7R5FS4-WOflo/edit?usp=sharing |
Thank you very much @loburm. Overall I see no concerns around scalability. In fact, we are quite surprised the memory usage stays that low. That should make us good to go for 1.0 soon. |
I am curious about the three stages. Can you please explain it more detailly. :)
|
Sweet! should we distill this into a recommendation for resources? 2MB per
node (minimum 200MB) + 0.001 cores per node (0.01 minimum)?
…On Fri, Jul 28, 2017, 17:51 Ning Xie ***@***.***> wrote:
@loburm <https://github.com/loburm>
Empty cluster - cluster without pods (only a system one present).
Loaded - 30 pods per node in average.
After request - cpu and memory usage during metrics fetching.
I am curious about the three stages. Can you please explain it more
detailly. :)
- only a system one present.
- only one system pod?
- what is the difference about Loaded and After request?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#124 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAICBtAKQDG6FSZ_zsaQ-YzU0avM8EoDks5sSgNugaJpZM4M_5pt>
.
|
@andyxning empty cluster has only pods that belong to kube-system namespace and created by scalability test at the beginning:
in average it's near 4-5 pods at the beginning. So at the end we really have 34-35 pods per node. "Loaded" was measured when cluster was stabilized after all pods created. "After request" - after fetching metrics from "/metrics" it really increases memory usage and gives a short peak in cpu usage. |
@loburm Got it. Thanks for the detailed explaination. |
Thanks @loburm. It seems from scalability point of view kube-state-metrics is ready for 1.0. |
let's do RCs |
rc.1 is out: I published quay.io/coreos/kube-state-metrics:v1.0.0-rc.1 for testing, and @loburm will publish the image on gcr.io within the next half an hour. |
@loburm now has published the image on gcr: gcr.io/google_containers/kube-state-metrics:v1.0.0-rc.1 |
Some additional metrics from a reasonably large production cluster (on 1.0.0+ fix for owner NPE)
|
@smarterclayton good to know |
OK,I confirm the last unchecked item which is scaling with cluster size using pod nanny is done |
Since #200 has added support for providing deployment manifest that scales with cluster size using pod nanny. Closing this now. |
As discussed in the last SIG instrumentation meeting, we plan to do a first stable release of kube-state-metrics.
As we have been mostly adding functionality for a while, rather than changing existing one, there's nothing fundamental to change here.
The text was updated successfully, but these errors were encountered: