You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem?
We released ml-commons plugin in OpenSearch 1.3. It supports training model and predicting. ML model generally consuming more resources, especially for training process. The community wants to support bigger ML models which might require more resources and special hardware like GPU.
As OpenSearch doesn’t support ML node, we dispatch ML task to data node only. That means if user wants to train a large model, they need to scale up all data nodes which can be costly. And ML tasks will use shared resources on data nodes which may impact the core searching/indexing function.
What solution would you like?
Support a dedicated ML node, users don’t need to scale up their data node at all. Instead just configure a new ML node (with different settings, more powerful instance type) and add it to cluster via the YAML file (requires a cluster restart). By doing so, users can separate resource usage better by running ML task on dedicated node which can reduce impact to other critical tasks like search/ingestion.
OpenSearch core will check node role when start node. If role is not built-in roles like data role, it will throw exception and node can't start. To support dedicated ML node, we have to remove this limitation in OpenSearch core. That is done with this PR which supports dynamic node role in OpenSearch opensearch-project/OpenSearch#3436.
With that we can enhance ml-commons code to dispatch task to ml nodes first. If no ml nodes we can fall back to data nodes.
Is your feature request related to a problem?
We released ml-commons plugin in OpenSearch 1.3. It supports training model and predicting. ML model generally consuming more resources, especially for training process. The community wants to support bigger ML models which might require more resources and special hardware like GPU.
As OpenSearch doesn’t support ML node, we dispatch ML task to data node only. That means if user wants to train a large model, they need to scale up all data nodes which can be costly. And ML tasks will use shared resources on data nodes which may impact the core searching/indexing function.
What solution would you like?
Support a dedicated ML node, users don’t need to scale up their data node at all. Instead just configure a new ML node (with different settings, more powerful instance type) and add it to cluster via the YAML file (requires a cluster restart). By doing so, users can separate resource usage better by running ML task on dedicated node which can reduce impact to other critical tasks like search/ingestion.
OpenSearch core will check node role when start node. If role is not built-in roles like
data
role, it will throw exception and node can't start. To support dedicated ML node, we have to remove this limitation in OpenSearch core. That is done with this PR which supports dynamic node role in OpenSearch opensearch-project/OpenSearch#3436.With that we can enhance ml-commons code to dispatch task to
ml
nodes first. If noml
nodes we can fall back to data nodes.Do you have any additional context?
Original Proposal
The text was updated successfully, but these errors were encountered: