Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML autoscaling decider always reports a storage capacity of 0 #72452

Open
barkbay opened this issue Apr 29, 2021 · 9 comments
Open

ML autoscaling decider always reports a storage capacity of 0 #72452

barkbay opened this issue Apr 29, 2021 · 9 comments
Labels
>bug :Distributed Coordination/Autoscaling :ml Machine learning Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:ML Meta label for the ML team

Comments

@barkbay
Copy link
Contributor

barkbay commented Apr 29, 2021

Elasticsearch version (bin/elasticsearch --version): v7.12.1

OS version (uname -a if on a Unix-like system): Elastic Cloud

Description of the problem including expected versus actual behavior:

The machine learning decider always reports a current storage capacity of 0 which, I think, is wrong:

    "ml" : {
      "required_capacity" : {
        "node" : {
          "storage" : 0,
          "memory" : 101373116416
        },
        "total" : {
          "storage" : 0,
          "memory" : 101373116416
        }
      },
      "current_capacity" : {
        "node" : {
          "storage" : 0,
          "memory" : 101373116416
        },
        "total" : {
          "storage" : 0,
          "memory" : 101373116416
        }
      },
      "current_nodes" : [
        {
          "name" : "instance-0000000003"
        }
      ]

GET _nodes/<node_id>/stats/fs reports the actual capacity, I guess it's a bug in the decider:

{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "...",
  "nodes" : {
    "..." : {
      "name" : "instance-0000000003",
      "roles" : [
        "ml",
        "remote_cluster_client"
      ],
      "attributes" : {...},
      "fs" : {
        "timestamp" : 1619687039503,
        "total" : {
          "total_in_bytes" : 8589934592,
          "free_in_bytes" : 8589746176,
          "available_in_bytes" : 8589746176
        },
        "data" : [
          {
            "path" : "/app/data/nodes/0",
            "mount" : "QuotaAwareFileStore(/app (/dev/mapper/lxc-data))",
            "type" : "xfs",
            "total_in_bytes" : 8589934592,
            "free_in_bytes" : 8589746176,
            "available_in_bytes" : 8589746176
          }
        ],
        "io_stats" : { }
      }
    }
  }
}

I understand that storage is not a resource on which the ML decider may request some capacity, nevertheless I'm wondering if this should not be fixed.

@barkbay barkbay added >bug :ml Machine learning needs:triage Requires assignment of a team area label labels Apr 29, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Apr 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@droberts195
Copy link
Contributor

Does returning 0 for storage cause a problem today?

The ML nodes are dedicated ML nodes, so store no indices.

We need to carefully consider whether having ML return some arbitrary value for storage is going to cause more problems in the future than it solves today. For example, suppose we change ML to return the current amount of disk space in that field. Then in the future somebody thinks that means ML actually needs that amount of disk space and changes the Cloud code to take it into account. Then that restricts what nodes ML can scale onto because some of them don't have enough storage space even though ML doesn't actually need it.

Would a good solution be that the storage field doesn't exist at all in ML responses?

@barkbay
Copy link
Contributor Author

barkbay commented Apr 29, 2021

Does returning 0 for storage cause a problem today?

I think I'll have to rely on this value for ECK because the observed capacity is not always the one claimed on a K8S cluster.
For correctness I think it would be safe to either:

  • Not return a request and a capacity for a resource that is not used by the decider
  • Return the correct value

@benwtrent
Copy link
Member

@henningandersen ^

@barkbay, ML doesn't reliably get the storage information all the time and ML doesn't require storage (at least not with scaling).

I believe there were discussions in the past of having that value be null if not set at all, but that was decided as not the best course of action.

I do think the better "fix" is for the ML decider to not return storage at all. But, the overall decision returned (top level object), may still return storage information due to other deciders.

@gwbrown gwbrown removed the needs:triage Requires assignment of a team area label label Apr 29, 2021
@barkbay
Copy link
Contributor Author

barkbay commented May 3, 2021

I created this issue on the ECK side to document various problems we have when handling autoscaling and storage capacity on K8S. The problem I was referring to is mostly the first one:
ECK doesn't know, or can't predict, the available capacity of a volume because of the reserved space for some filesystems (mostly the case for ext4, 5% by defaults, only a few MB for xfs). Because of the fs reserved capacity, the capacity available to Elasticsearch might be smaller than the one in the K8S claim: it may delay a scale up event. We should compare the required capacity to the "observed" capacity as reported by the autoscaling API to understand when a scale up must be triggered.

I do think the better "fix" is for the ML decider to not return storage at all. But, the overall decision returned (top level object), may still return storage information due to other deciders.

👍

I think I can just ignore the observed capacity if the required storage capacity is 0, it would solve the problem in the context of ML.

@henningandersen
Copy link
Contributor

We'd think that signalling "0b" for storage should mean that no path.data is needed. I think ML can run without a data path, but I am not 100% certain. Can someone from the ML team confirm?

If so, I think "0b" is the right response here and perhaps ECK should interpret that as no data path needed?

@droberts195
Copy link
Contributor

ML does need a data path. It's used for disk to temporarily overflow forecast requests. This means it could go in the temp directory, but I believe there was a discussion between Hendrik and Simon about 5 years ago where they decided to put it in the data directory instead of the temp directory as the temp directory might not have very much space at all and it was expected at the time that the data directory would always have several gigabytes free.

There doesn't need to be much space in the data directory for this purpose. The ML forecasting code will just fail a forecast if it won't fit in memory or on disk. But it's better to have some space than none.

If supplying 0b is going to cause a problem because zero will have a special meaning then we could have ML supply some simple number that is higher than zero.

@henningandersen
Copy link
Contributor

That makes sense to me. ML seems to be the right place to make such a decision on the required size (and can base it on the jobs etc if necessary).

In the same change, it would be nice to clarify the need for a path.data in the node documentation, it sort of hints that no data path is needed for ml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Autoscaling :ml Machine learning Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

7 participants