ML autoscaling decider always reports a storage capacity of 0 #72452

barkbay · 2021-04-29T09:09:20Z

Elasticsearch version (bin/elasticsearch --version): v7.12.1

OS version (uname -a if on a Unix-like system): Elastic Cloud

Description of the problem including expected versus actual behavior:

The machine learning decider always reports a current storage capacity of 0 which, I think, is wrong:

    "ml" : {
      "required_capacity" : {
        "node" : {
          "storage" : 0,
          "memory" : 101373116416
        },
        "total" : {
          "storage" : 0,
          "memory" : 101373116416
        }
      },
      "current_capacity" : {
        "node" : {
          "storage" : 0,
          "memory" : 101373116416
        },
        "total" : {
          "storage" : 0,
          "memory" : 101373116416
        }
      },
      "current_nodes" : [
        {
          "name" : "instance-0000000003"
        }
      ]

GET _nodes/<node_id>/stats/fs reports the actual capacity, I guess it's a bug in the decider:

{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "...",
  "nodes" : {
    "..." : {
      "name" : "instance-0000000003",
      "roles" : [
        "ml",
        "remote_cluster_client"
      ],
      "attributes" : {...},
      "fs" : {
        "timestamp" : 1619687039503,
        "total" : {
          "total_in_bytes" : 8589934592,
          "free_in_bytes" : 8589746176,
          "available_in_bytes" : 8589746176
        },
        "data" : [
          {
            "path" : "/app/data/nodes/0",
            "mount" : "QuotaAwareFileStore(/app (/dev/mapper/lxc-data))",
            "type" : "xfs",
            "total_in_bytes" : 8589934592,
            "free_in_bytes" : 8589746176,
            "available_in_bytes" : 8589746176
          }
        ],
        "io_stats" : { }
      }
    }
  }
}

I understand that storage is not a resource on which the ML decider may request some capacity, nevertheless I'm wondering if this should not be fixed.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-04-29T09:09:23Z

Pinging @elastic/ml-core (Team:ML)

elasticmachine · 2021-04-29T09:45:23Z

Pinging @elastic/es-distributed (Team:Distributed)

droberts195 · 2021-04-29T09:58:42Z

Does returning 0 for storage cause a problem today?

The ML nodes are dedicated ML nodes, so store no indices.

We need to carefully consider whether having ML return some arbitrary value for storage is going to cause more problems in the future than it solves today. For example, suppose we change ML to return the current amount of disk space in that field. Then in the future somebody thinks that means ML actually needs that amount of disk space and changes the Cloud code to take it into account. Then that restricts what nodes ML can scale onto because some of them don't have enough storage space even though ML doesn't actually need it.

Would a good solution be that the storage field doesn't exist at all in ML responses?

barkbay · 2021-04-29T10:08:14Z

Does returning 0 for storage cause a problem today?

I think I'll have to rely on this value for ECK because the observed capacity is not always the one claimed on a K8S cluster.
For correctness I think it would be safe to either:

Not return a request and a capacity for a resource that is not used by the decider
Return the correct value

benwtrent · 2021-04-29T11:10:16Z

@henningandersen ^

@barkbay, ML doesn't reliably get the storage information all the time and ML doesn't require storage (at least not with scaling).

I believe there were discussions in the past of having that value be null if not set at all, but that was decided as not the best course of action.

I do think the better "fix" is for the ML decider to not return storage at all. But, the overall decision returned (top level object), may still return storage information due to other deciders.

barkbay · 2021-05-03T10:36:44Z

I created this issue on the ECK side to document various problems we have when handling autoscaling and storage capacity on K8S. The problem I was referring to is mostly the first one:
ECK doesn't know, or can't predict, the available capacity of a volume because of the reserved space for some filesystems (mostly the case for ext4, 5% by defaults, only a few MB for xfs). Because of the fs reserved capacity, the capacity available to Elasticsearch might be smaller than the one in the K8S claim: it may delay a scale up event. We should compare the required capacity to the "observed" capacity as reported by the autoscaling API to understand when a scale up must be triggered.

I do think the better "fix" is for the ML decider to not return storage at all. But, the overall decision returned (top level object), may still return storage information due to other deciders.

👍

I think I can just ignore the observed capacity if the required storage capacity is 0, it would solve the problem in the context of ML.

henningandersen · 2022-07-28T09:40:10Z

We'd think that signalling "0b" for storage should mean that no path.data is needed. I think ML can run without a data path, but I am not 100% certain. Can someone from the ML team confirm?

If so, I think "0b" is the right response here and perhaps ECK should interpret that as no data path needed?

droberts195 · 2022-07-28T09:54:33Z

ML does need a data path. It's used for disk to temporarily overflow forecast requests. This means it could go in the temp directory, but I believe there was a discussion between Hendrik and Simon about 5 years ago where they decided to put it in the data directory instead of the temp directory as the temp directory might not have very much space at all and it was expected at the time that the data directory would always have several gigabytes free.

There doesn't need to be much space in the data directory for this purpose. The ML forecasting code will just fail a forecast if it won't fit in memory or on disk. But it's better to have some space than none.

If supplying 0b is going to cause a problem because zero will have a special meaning then we could have ML supply some simple number that is higher than zero.

henningandersen · 2022-07-28T10:03:53Z

That makes sense to me. ML seems to be the right place to make such a decision on the required size (and can base it on the jobs etc if necessary).

In the same change, it would be nice to clarify the need for a path.data in the node documentation, it sort of hints that no data path is needed for ml.

barkbay added >bug :ml Machine learning needs:triage Requires assignment of a team area label labels Apr 29, 2021

elasticmachine added the Team:ML Meta label for the ML team label Apr 29, 2021

jasontedor added the :Distributed Coordination/Autoscaling label Apr 29, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 29, 2021

gwbrown removed the needs:triage Requires assignment of a team area label label Apr 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML autoscaling decider always reports a storage capacity of 0 #72452

ML autoscaling decider always reports a storage capacity of 0 #72452

barkbay commented Apr 29, 2021

elasticmachine commented Apr 29, 2021

elasticmachine commented Apr 29, 2021

droberts195 commented Apr 29, 2021

barkbay commented Apr 29, 2021

benwtrent commented Apr 29, 2021

barkbay commented May 3, 2021

henningandersen commented Jul 28, 2022

droberts195 commented Jul 28, 2022

henningandersen commented Jul 28, 2022

ML autoscaling decider always reports a storage capacity of 0 #72452

ML autoscaling decider always reports a storage capacity of 0 #72452

Comments

barkbay commented Apr 29, 2021

elasticmachine commented Apr 29, 2021

elasticmachine commented Apr 29, 2021

droberts195 commented Apr 29, 2021

barkbay commented Apr 29, 2021

benwtrent commented Apr 29, 2021

barkbay commented May 3, 2021

henningandersen commented Jul 28, 2022

droberts195 commented Jul 28, 2022

henningandersen commented Jul 28, 2022