-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML autoscaling decider always reports a storage capacity of 0 #72452
Comments
Pinging @elastic/ml-core (Team:ML) |
Pinging @elastic/es-distributed (Team:Distributed) |
Does returning 0 for storage cause a problem today? The ML nodes are dedicated ML nodes, so store no indices. We need to carefully consider whether having ML return some arbitrary value for storage is going to cause more problems in the future than it solves today. For example, suppose we change ML to return the current amount of disk space in that field. Then in the future somebody thinks that means ML actually needs that amount of disk space and changes the Cloud code to take it into account. Then that restricts what nodes ML can scale onto because some of them don't have enough storage space even though ML doesn't actually need it. Would a good solution be that the |
I think I'll have to rely on this value for ECK because the observed capacity is not always the one claimed on a K8S cluster.
|
@barkbay, ML doesn't reliably get the storage information all the time and ML doesn't require storage (at least not with scaling). I believe there were discussions in the past of having that value be I do think the better "fix" is for the ML decider to not return storage at all. But, the overall decision returned (top level object), may still return storage information due to other deciders. |
I created this issue on the ECK side to document various problems we have when handling autoscaling and storage capacity on K8S. The problem I was referring to is mostly the first one:
👍 I think I can just ignore the observed capacity if the required storage capacity is 0, it would solve the problem in the context of ML. |
We'd think that signalling "0b" for storage should mean that no If so, I think "0b" is the right response here and perhaps ECK should interpret that as no data path needed? |
ML does need a data path. It's used for disk to temporarily overflow forecast requests. This means it could go in the temp directory, but I believe there was a discussion between Hendrik and Simon about 5 years ago where they decided to put it in the data directory instead of the temp directory as the temp directory might not have very much space at all and it was expected at the time that the data directory would always have several gigabytes free. There doesn't need to be much space in the data directory for this purpose. The ML forecasting code will just fail a forecast if it won't fit in memory or on disk. But it's better to have some space than none. If supplying |
That makes sense to me. ML seems to be the right place to make such a decision on the required size (and can base it on the jobs etc if necessary). In the same change, it would be nice to clarify the need for a |
Elasticsearch version (
bin/elasticsearch --version
): v7.12.1OS version (
uname -a
if on a Unix-like system): Elastic CloudDescription of the problem including expected versus actual behavior:
The machine learning decider always reports a current storage capacity of 0 which, I think, is wrong:
GET _nodes/<node_id>/stats/fs
reports the actual capacity, I guess it's a bug in the decider:I understand that storage is not a resource on which the ML decider may request some capacity, nevertheless I'm wondering if this should not be fixed.
The text was updated successfully, but these errors were encountered: