Support dynamic node role #2877

ylwu-amzn · 2022-04-12T20:30:32Z

OpenSearch doesn't support ML node role currently, check DiscoveryNodeRole.java, no ML role defined.

We need dedicated ML node for ML plugin

We have released a new ML plugin ml-commons in 1.3 and we are planning to adding more models.

ML model generally consuming more resources, especially for training process. We are going to support bigger ML models which might require more resources and special hardware like GPU. As OpenSearch doesn’t support ML node, we dispatch ML task to data node only. That means if user want to train some big model, they need to scale up all data nodes which seems costly and not reasonable. If we can support dedicated ML node, user don’t need to scale up their data node at all, just need to configure a new ML node(with different settings, more powerful instance type) and add it to cluster. And we can separate resource usage better by running ML task on dedicated node which can reduce impact to other tasks like search/ingestion.

What changes we need to make

OpenSearch

Add new node type ml in DiscoveryNodeRole.java
Support configuring ML node role in opensearch.yml
Security: should be ok as ML node communicate with other nodes via transport action.

[Updated solution]: We plan to support user configuring any node role name in opensearch.yml file. Node will read the unknown roles as dynamic role rather than break node start process. Dynamic role abbreviations will be the same with full name. To avoid confusion, we enforce the node role full name and abbreviation case-insensitive(must be lower-case). We will return all role's full name including both built-in and dynamic roles in a new field node.roles when user call _cat/nodes API. Check more details in discussions below.

ML plugin

If ML node exists, dispatch ML task to ML node; otherwise, dispatch to data node. Support dedicated ml node ml-commons#79

The text was updated successfully, but these errors were encountered:

dblock · 2022-04-13T18:39:25Z

I like the problem that is being solved here (some tasks may compete/work better on specialized hardware and provisioning only some nodes to be GPU-enabled vs. others not be GPU enabled is cheaper) and understand what we're trying to accomplish (send certain tasks to certain nodes).

My suggestion is to generalize this to avoid solving this problem for each type of workload.

This can be treated as a request routing problem. For example, I'd like to be able to direct all search requests to search-only nodes (separate read/write). In #2859 I'd like to route requests in a weighted round robin. In #2447 I'd like not to have to install plugins on all nodes, and thus isolate execution between plugins across different hardware nodes (same as only some nodes are "ml" nodes because they have the "ml" plugin installed).

I think the generic version of this is to 1) make node roles dynamic (vs. hard-coded in Java), 2) make nodes advertise roles at runtime and be able to drop/add roles without restarts, 3) allow plugins to register and advertise roles at runtime, 4) enable routing to nodes that support a set of roles.

It doesn't have to be an existing "role" concept, it could be a new "capability" or "tag" concept.

If we implement this generically, a node that has the ML plugin installed would advertise a "plugin:ml" role and requests can be routed to it. A node that has a specific kind of GPU could advertise "os:gpu" and "os:gpu:NVIDIA T4". Round robin routing could then be policy-based with potential fallback (e.g. prefer nodes with "os:gpu" at 2x weight). Will need a way to say that an ML workload must be routed to an ML-capable node, but some other workload has a preference of ML-capable node and can fallback to a non-ML-capable data node.

penghuo · 2022-04-14T00:49:41Z

Two questions,

Does ML node contains index data?
Do we consider term compute node instead of ml node. In my option, ML algorithms is a type of computation task. but we could schedule any computation task on compute node, e.g. OpeSearch pipeline aggregation could execute distributively on compute node.

jainankitk · 2022-04-14T23:54:42Z

Can you add some more details on ML node interaction with storage layer? More specifically:

How is the ML node handled from shard allocation perspective? If ML nodes don't have any shards like master, do they fetch data from other nodes while processing ML task? Or we temporarily relocate the shard to ML nodes?

reta · 2022-04-18T15:18:03Z

It doesn't have to be an existing "role" concept, it could be a new "capability" or "tag" concept.

I like the idea of introducing the capabilities (fe search, ingest, data, ...). In this case the built-in roles could be assembled out of capabilities (to support existing features) but the routing / placement policies could rely on capabilities instead. For some capabilities we still need the support in core (fe could node store data or not), but others could be added dynamically (like gpu, ssd, ...) at any time.

ylwu-amzn · 2022-04-20T21:55:44Z

@penghuo Thanks for the question/suggestion

Does ML node contains index data?

Node can have 1 or multiple node roles. If a node has only "ml" node role, it won't store index data (no shard on it). If a node has both "data" and "ml" node roles, shard will be allocated to it. Basically only allocate shard to node which has "data" node role.

Do we consider term compute node instead of ml node. In my option, ML algorithms is a type of computation task. but we could schedule any computation task on compute node, e.g. OpeSearch pipeline aggregation could execute distributively on compute node.

Good point. I think it's good to use a more generic role name like "computation", so any computation-intensive tasks can run on "computation" nodes.

ylwu-amzn · 2022-04-20T22:02:53Z

Can you add some more details on ML node interaction with storage layer? More specifically:

How is the ML node handled from shard allocation perspective? If ML nodes don't have any shards like master, do they fetch data from other nodes while processing ML task? Or we temporarily relocate the shard to ML nodes?

@jainankitk thanks, this is a good question.
We are not going to change the shard allocation part, at least for the short-term view. It's possible that we meet data transfer bottleneck in future, then we may make some improvement like move ML model closer to data(like run model on shard level) or move data closer to ML model (transfer data to ML node). But this is not one-way-door. We can make continuous improvement and the improvement could be in ML plugin rather than OpenSearch core, unless that improvement is general and will benefit other use cases.

ylwu-amzn · 2022-04-20T22:36:42Z

@reta @dblock thanks for your suggestion. Like this

It doesn't have to be an existing "role" concept, it could be a new "capability" or "tag" concept.

After reading all comments and brainstorming possible solutions, I think we have multiple options to support dedicated node to run specific tasks, and ML task is just one example, maybe we can support other special tasks like image processing.

The main reasons why we need dedicated nodes for specific tasks like ML are

ML task uses shared resource of data nodes which may impact core functions. We can limit the memory usage and thread pool of ML tasks, but still possible to impact core function like searching and indexing.
Costly to scale. If some ML tasks needs special hardware like GPU, we need to scale up all data nodes which is costly and unreasonable.

We may have these options to solve the problems. Welcome any more suggestions/options.

Option1: add new node role

OpenSearch doesn't support ML node role currently, check DiscoveryNodeRole.java , no ML role defined.
Add a new role for ML/computation-intensive tasks like “ml” or more generally “computation”. Prototype code change link . I started a multi-node cluster (1ML node, 2 data nodes, 3 master nodes) and it works well, all ML tasks routes to ML node and no shard allocated to ML node.

Pros:

Less effort. Just leverage current node role framework.
ML node doesn't has shards. ML node failure has no impact on cluster state.

Cons:

Node roles are configured in opensearch.yml file. User can’t change node role on the fly.
If we have a new task type which needs to run on its own dedicated node in future, we need to add a new node role. But that depends on how possible it looks that we are going to add many new task types and if the new task types can use the general "computation" node role.

Option 2: enhance current node role framework

We keep some reserved/predefined roles like master/data/ingest. Enhance current node role framework to support any custom role and can assign/change node role on the fly.

Need to consider some challenges:

If change ML node role to data node role, shard will be allocated to it. If change data node role to ML, shard will be relocated to other data nodes. Shard moving between nodes may cause cluster performance down or unstable state. We can add some limitation like user can’t change role between data role with other roles, or user can’t change from/to reserved roles like master/data.
Assume one node role is dedicated for some kind of task, for example “ML” node role for ML tasks, “image_processing” role is for processing images tasks. If we change a node role from “ML” to “image_processing”, the ML tasks on it must stop immediately and maybe migrate to other ML node. This flexibility is good to let user repurpose node efficiently, but also risky to stop some running tasks especially if the task doesn’t support migration to other node. So user should be careful about changing node roles.
We need to persist the node role information somewhere (like OpenSearch index, or some local state file), so node will pick up correct latest node role after restarting. Compared with the static opensearch.yml file, we need to pay more attention on data consistency.

Pros:

Support assigning/changing node role on the fly. User can repurpose node easily.

Cons

There are some challenges to solve.
User needs to pay attention when assign/change node role on the fly. More flexibility, also more risk.

Option 3: keep current node role framework and add new node tag/capability

We build a brand new concept: "capability" or "tag". For easier discussion, here I just use "tag".

This option doesn’t change node role, so it’s safe for shard allocation and cluster management. But still have same second&third challenge of Option2.

Pros:

No impact to current node role
Support assigning/changing node tags on the fly.

Cons:

If a cluster only has master or data nodes. Adding new node tag can’t separate ML task from data node, it still uses shared resources of data nodes.

Option 4: get rid of current node role framework, build brand new node tag/capability framework

This options looks similar to option2, just name role as "tag"/"capability". We may think what other feature we can build in this new framework.

Pros:

Flexible. We can build any new features in this new framework.

Cons:

BWC: support old node roles in previous versions.
Same challenges of Option2

My conclusion by comparing these options

After comparing these options, I think Option1 is the easiest, and Option2 and Option3 will support more use case like assigning/changing node role/tag on the fly, but also bring more challenges. Option4 is something similar to Option2, but we may have more flexibility to build new features.

So, I wound suggest follow Option1 for short-term goal. And we can support Option2 later so user have flexibility to change node roles easily with some limitation (fe, can't change data roles). I think Option3 and Option4 are some long term solution, we can follow these options if we have some user cases which can't be solved by node role framework.

dblock · 2022-04-21T20:26:37Z

Option 1 solves a super narrow use-case, and its main advantage is "less effort". Is the prototype misleading? It doesn't say how calls will be routed to that node. I imagine you'll modify ml-commons/k-nn to say "route this to an ML node" - do you have that code? What happens if I don't have any nodes in my cluster with such role? Will that node sit idle when it's not busy? Does introducing a new node type have other impact, such as upgrades?

Is there a simple version of option 2 where there's no API to change node roles on the fly, but you can define custom node roles in the node's opensearch.yml and have the ML plugin prefer routing requests to a node that has the "ml" role?

node.roles: [ data, ingest, ml, image_processing, fizzbuzz ]

reta · 2022-04-21T21:39:41Z

Is there a simple version of option 2 where there's no API to change node roles on the fly, but you can define custom node roles in the node's opensearch.yml and have the ML plugin prefer routing requests to a node that has the "ml" role?

I think this would be the simplest starting point, we could also bake in the "dynamic" nature of roles (on the fly) for all custom ones but not for builtin ones (those deemed "static").

ylwu-amzn · 2022-04-22T16:35:54Z

Thanks @dblock .

I imagine you'll modify ml-commons/k-nn to say "route this to an ML node" - do you have that code?

Yes, in my prototype we need to filter out ml nodes, then route request. This is the code change ylwu-amzn/ml-commons@6883925.

What happens if I don't have any nodes in my cluster with such role?

That depends on plugin's logic. In my prototype, if ml node doesn't exist, will route request to data nodes. This prototype just demonstrate we can route request to ml node only if it exists. We may change this later, like only route request to ml nodes, and will throw exception if no ml node.

Will that node sit idle when it's not busy?

Yes, that's true. We should explain this in our documentation. If user don't run heavy ML tasks, maybe just run it on data node is enough. We need more experiments to find out the best configuration.

Does introducing a new node type have other impact, such as upgrades?

When upgrade, we need to consider BWC for the data layer and code logic. For data layer, the shard only allocated on data nodes, should be ok. For code logic, each plugin should consider BWC for their own case especially for mixed cluster. For ml-commons, ML tasks on old version routes to data nodes, so old tasks still consumes shared resource on data nodes, but new tasks only routes to ml nodes, both tasks can run independently as we don't have distributed model run on both data and ml nodes.

Is there a simple version of option 2 where there's no API to change node roles on the fly, but you can define custom node roles in the node's opensearch.yml and have the ML plugin prefer routing requests to a node that has the "ml" role?

Agree with @reta , this is the simplest and we can add "dynamic" nature for custom roles later (this is not one-way-door). Actually this is what I mean in Option1, In my prototype, I configured ml roles in opensearch.yml file, like this

node.name: ml-node-1
node.roles: [ ml ]

And add ml node role in OpenSearch core, check code change in my repo ylwu-amzn@b0915f4

dblock · 2022-04-22T18:00:03Z

Both @reta and I are saying that there's no public static final DiscoveryNodeRole ML_ROLE = in code, which is not option 1. Want to try to code the slightly more dynamic version @ylwu-amzn?

ylwu-amzn · 2022-04-22T19:26:25Z

there's no public static final DiscoveryNodeRole ML_ROLE = in code

Got it. If we don't add public static final DiscoveryNodeRole ML_ROLE = in code, we need to read node role from opensearch.yml dynamically and set it when node start. Looks more dynamic than option1 and less than option2. Sounds good to me. Thanks @dblock @reta . Let's keep this topic open for a while to check if others have any concern.

xinlamzn · 2022-04-22T23:33:12Z

Looks like we are aligned on the modified option #2. Could we summarize the proposal?

Also, what is the impact of the ML node failure to the cluster status? If a node is configured to be ML-only and failure, is there a signal on the cluster state? We need a generic way to detect any node failure.

dblock · 2022-04-25T18:22:20Z

Also, what is the impact of the ML node failure to the cluster status? If a node is configured to be ML-only and failure, is there a signal on the cluster state? We need a generic way to detect any node failure.

I think cluster state needs to get smarter. Naively, I would do this (bold would be new):

RED: some or all of (primary) shards are not ready, no available ml nodes
YELLOW: all of the primary shards allocated, but some/all of the replicas have not been allocated, some ml nodes down
GREEN: fully operational, all ml nodes up

There's a chicken and egg problem with this approach because the cluster doesn't know anything about ml nodes until a node dynamically publishes such a type. It doesn't know the total number of ml nodes, nor does it know whether the number is sufficient to execute tasks. So I think this needs to queue off actual tasks that require an ml node.

RED: some or all of (primary) shards are not ready, there are tasks that require ml nodes, but these cannot be executed at all because they cannot find an ml node and new tasks are being dropped/rejected
YELLOW: all of the primary shards allocated, but some/all of the replicas have not been allocated, tasks that prefer ml nodes are queued excessively, starving for more ml nodes
GREEN: fully operational, tasks that prefer ml nodes don't queue up

I feel like there must be more holes with this approach, so needs more thinking.

ylwu-amzn · 2022-04-27T17:50:29Z

Thanks @xinlamzn, good question! Thanks @dblock for the proposal, that's definitely an option.

I brainstormed possible options and listed as bellow, please help review:

Option1: user monitor state of custom role nodes by themselves

Keep current cluster state as is. ML node failure doesn’t impact cluster state. User should know how many ML nodes are expected to be running in cluster, and they can monitor ML nodes by calling GET _cat/nodes. For example, a cluster have 10 ML nodes in total, user may expect 10 ML nodes running is “GREEN”, 5 ML nodes running is “YELLOW” and no ML nodes running is “RED”.

Pros:

No impact to current cluster state.

Cons:

User needs to define state rules and do some calculation to tell the state

Option2: add new cluster state for custom role node

Keep current cluster state as is. Define new cluster state for different custom node roles. If the custom node role is defined in code like DiscoveryNodeRole.java , we can define corresponding state for each role. If we don’t define roles in code, node just pick up custom node roles from opensearch.yml dynamically, we should generate custom cluster state for the custom node role dynamically. And user can define rules for each custom role.

For example, node picks up ml role, then we can generate ml_nodes_status . User can define state rules for ml nodes like :

"green": 10, // >= 10
"yellow": 5, // <= 5 and >0
"red": 0, // <=0

If all data nodes running correctly and all primary/replica shards are good, but only 3 running ml nodes in cluster when call GET _cluster/health, then we will return "ml_nodes_status":"yelllow" together with the current cluster state "status":"green"

Pros:

No impact to current cluster state.
Compared with Option1, user don’t need to calculate cluster states

Cons:

User needs to define state rules
More cluster states.

UPDATE: : We can enhance this option to avoid the cons "More cluster states.", list this as Option2.1

Option2.1: add new cluster status for node, keep current cluster status for shard

As current cluster status from _cluster/health is only for shard health status (for example, if a data node doesn't have any shard on it, kill this data node won't impact cluster status). We can build a general node status for node failures. This node failure will cover all node roles. For example, user can configure min_nodes for ml role as 5, then if cluster running ml node >= 5, we set node_status as green; if running ml node less than 5 but greater than 0, then set node_status as yellow; if there is no running ml node, set node_status as red. If we have multiple data roles, we can calculate each node role status, then pick up the worst one. For example ml node is red and image_processing node is yellow, then the final node_status will be red. We don't return separate node status for each node role. User can deep dive into each node role by calling _cat/nodes.

Pros:

No impact to current cluster state, still means shard health status
Compared with Option2, user just need to monitor one node_status for all nodes. No need to monitor multiple node status for node roles.

Cons:

Compared with Option3 and Option4, user need to monitor 1 more status for node. But this option keeps the shard health status, will be easier for user if they need to monitor shard health status and node status separately.

Option3: enrich current cluster state to include failure of nodes with custom role, but user define rules for custom role node

User define the rule for each custom role like Option2. Then we can combine all status of custom role nodes and original cluster status and choose the worst one. For example, ml nodes status is yellow , image_processing nodes status is red and general cluster status is green (all data nodes running correctly). Then the final status will be red

Pros:

Just one cluster status returned

Cons:

This will break current cluster state semantic, will break the BWC if user only needs to monitor primary/replica shards.
User get a combined status, need to dive deep to know what’s the root cause. So maybe we should also add separate status for each custom role finally to avoid this.
User need to define cluster state rules

Option4: enrich current cluster state to include failure of nodes with custom role, user don’t need to define rules for custom role node

This is to improve Option3 to reduce the effort to define state rules for custom roles. This is what db mentioned :

RED: some or all of (primary) shards are not ready, no available ml nodes
YELLOW: all of the primary shards allocated, but some/all of the replicas have not been allocated, some ml nodes down
GREEN: fully operational, all ml nodes up

Like Db said, this is chicken and egg problem as we don’t know how many ml nodes are expected in a cluster.
or

RED: some or all of (primary) shards are not ready, there are tasks that require ml nodes, but these cannot be executed at all because they cannot find an ml node and new tasks are being dropped/rejected
YELLOW: all of the primary shards allocated, but some/all of the replicas have not been allocated, tasks that prefer ml nodes are queued excessively, starving for more ml nodes
GREEN: fully operational, tasks that prefer ml nodes don't queue up

This seems to be monitoring the ml node load, rather than node failure. If there are too many requests and ML tasks queue up or rejected, that may be by design, rather than caused by some ML node failure.

Pros:

Just one cluster status returned
No need to define rules for custom roles

Cons

This will break current cluster state semantic, will break the BWC if user only needs to monitor primary/replica shards.
User get a combined status, need to dive deep to know what’s the root cause. So maybe we should also add separate status for each custom role finally to avoid this.
There are some challenges like how do we know how many ml nodes are expected to be running in cluster.

My conclusion by comparing these options

I think we should not break the current cluster state semantic, which seems breaking the BWC if user only needs to monitor if primary/replica shards looks good or not. So I prefer not to choose option3 and 4.

For option1 it needs some manual effort from user side to calculate and monitor and less effort for development.
As option2 will add new status for each node role, seems not so scalable. So prefer to use Option2.1, just return one node_status for all nodes. For Option2.1, we needs more development effort than option1, but will reduce some calculation effort for user. Option1 is not a one way door. For agile development, I think it's ok to choose option1 to speed up the development and feature release. But not against option2.1 at all if we have enough bandwidth/time.

dblock · 2022-04-27T21:08:13Z

Two questions re:pros/cons of the last option.

This will break current cluster state semantic, will break the BWC if user only needs to monitor primary/replica shards.

Regarding backwards compatibility, those who don't have ml/custom nodes are 100% bcw, so is this valid?

User get a combined status, need to dive deep to know what’s the root cause.

Isn't this the case today where you have to dive deep under "yellow" to see, for example, that some master and some data nodes are down? So is this a new con?

ylwu-amzn · 2022-04-28T00:23:01Z

Thanks @dblock .

Regarding backwards compatibility, those who don't have ml/custom nodes are 100% bcw, so is this valid?

Agree, if cluster doesn't have ML node, there is no BWC issue.
I'm thinking this case: User may upgrade their cluster, then add one ML node to their existing cluster. Then the "red" cluster state doesn't mean bad primary shard any more. It may be caused by ML node failure. That means existing metrics monitoring shard health status will be impacted by ML node failure. But maybe it's ok or won't hurt too much if we document this clearly and users can have some other way to monitor shard status.

I added some updates under option2, just list as Option2.1. Also paste here

As current cluster status from _cluster/health is only for shard health status (for example, if a data node doesn't have any shard on it, kill this data node won't impact cluster status). We can build a general node status for node failures. This node failure will cover all node roles. For example, user can configure min_nodes for ml role as 5, then if cluster running ml node >= 5, we set node_status as green; if running ml node less than 5 but greater than 0, then set node_status as yellow; if there is no running ml node, set node_status as red. If we have multiple data roles, we can calculate each node role status, then pick up the worst one. For example ml node is red and image_processing node is yellow, then the final node_status will be red. We don't return separate node status for each node role. User can deep dive into each node role by calling _cat/nodes.

Isn't this the case today where you have to dive deep under "yellow" to see, for example, that some master and some data nodes are down? So is this a new con?

I think that depends on which level user are more interested to monitor. Current cluster status means the shard health status. So I think the shard level health info should be kept as this is important to current user.

dblock · 2022-04-28T16:42:07Z

Isn't this the case today where you have to dive deep under "yellow" to see, for example, that some master and some data nodes are down? So is this a new con?

I think that depends on which level user are more interested to monitor. Current cluster status means the shard health status. So I think the shard level health info should be kept as this is important to current user.

We do document this status as shard health in https://opensearch.org/docs/latest/opensearch/rest-api/cluster-health/. I do think a major release can absolutely expand on this definition if it makes sense.

ylwu-amzn · 2022-04-28T18:43:48Z

Isn't this the case today where you have to dive deep under "yellow" to see, for example, that some master and some data nodes are down? So is this a new con?

I think that depends on which level user are more interested to monitor. Current cluster status means the shard health status. So I think the shard level health info should be kept as this is important to current user.

We do document this status as shard health in https://opensearch.org/docs/latest/opensearch/rest-api/cluster-health/. I do think a major release can absolutely expand on this definition if it makes sense.

I'm ok to have some breaking part in a major release. I prefer to listen to community user's feedback to check if it's ok to

change the cluster status semantic to cover both shard health status and node status.
don't return shard health status in _cluster/health response

@reta do you have any suggestion?

andrross · 2022-04-28T23:21:45Z

Regarding the cluster health discussion, are these new ml nodes comparable to ingest nodes? Like ml nodes, I don't think the cluster knows how many ingest nodes should be present until they register themselves and it also doesn't know how many are actually needed to perform the ingest work. Is this an apt comparison? If so, how does a failure of a dedicated ingest node impact cluster health?

reta · 2022-04-29T07:34:43Z

@ylwu-amzn thanks for excellent summary, somewhat in the same vein as @andrross said, I think we should not alter the cluster health semantics from what it is right now (there are a lot of operational mitigation procedures which will be invalidated). However, I like the Option2 where we could extend the cluster heath response to include the nodes state by custom roles.

ylwu-amzn · 2022-04-29T17:39:01Z

Regarding the cluster health discussion, are these new ml nodes comparable to ingest nodes? Like ml nodes, I don't think the cluster knows how many ingest nodes should be present until they register themselves and it also doesn't know how many are actually needed to perform the ingest work. Is this an apt comparison? If so, how does a failure of a dedicated ingest node impact cluster health?

@andrross Thanks, agree that we actually don't know how many ingest or ml node should be present in a cluster due to the distributed nature (fe, 10 ml nodes configured, 5 failed to start, then we have 5 ml nodes running after cluster started, but later user can fix the failed ml node, then the running ml node will increase to 6, 7, to 10).
One options is to let user configure threshold like option2 or option2.1, so cluster can know which node status should be mapped to.

how does a failure of a dedicated ingest node impact cluster health?

If we keep current cluster health semantics (it means shard health status), the dedicated ingest node failure should not impact cluster health as it doesn't contain any shards. Like @reta said, we can keep cluster health semantics.

@reta , thanks for your opinion, for

I like the Option2 where we could extend the cluster heath response to include the nodes state by custom roles.

Do you mean we should add new node status for each custom role, like ml_node_status for ml node, and image_procesing_node_status for image_processing node? If user add more and more custom roles, the cluster health response will grow bigger. Do you think this will cause any scalability problem? If yes, maybe we should go with option2.1, just add one general node status for all nodes.

reta · 2022-05-02T13:04:27Z

Do you mean we should add new node status for each custom role, like ml_node_status for ml node, and image_procesing_node_status for image_processing node? If user add more and more custom roles, the cluster health response will grow bigger. Do you think this will cause any scalability problem? If yes, maybe we should go with option2.1, just add one general node status for all nodes.

This is a good point. My perspective was/is primarily on what we are looking from health API with respect to custom roles. Fe. it should be fairly trivial to spot there are no nodes with specific roles in the cluster, may be something like a new JSON element:

"nodes_by_roles": {
    "ml": 1,
    "ingest": 1,
    "...": ..
}

It should be fairly easy to troubleshoot, I am not sure it is going to be a scalability problem since cluster discovery collects this details anyway, having dozens of roles is realistic, but hundreds and more - does not look like it. @ylwu-amzn @dblock @xinlamzn what do you think guys?

ylwu-amzn · 2022-05-10T18:20:36Z

@andrross Thanks, good point. Built-in roles are mainly for OpenSearch core function, they are hard-coded and can't be changed. Wrong built-in roles will fail to start cluster or impact core function. For custom roles, they are mainly for plugin/extension function and wrong custom role only impact specific plugin, not the core function. But maybe the end user should consider the built-in and custom/extra roles the same as they treat the OpenSearch and plugin as a whole.

Is the difference between "built in" roles and these "extra" roles important for a user?

I think this is valid concern and may bring confusion/difficulty for user to understand. Missed this cons for adding new field.

creating a new node.roles field (without single-letter abbreviations) and deprecating node.role

I think this is an option. Like @dblock said, the current design to use one char as abbreviation seems not reasonable. Maybe good to deprecate node.role to avoid the confusion of two role columns.

reta · 2022-05-10T18:23:18Z

@ylwu-amzn I also agree, deprecation + node.roles is good tradeoff

ylwu-amzn · 2022-05-10T18:28:10Z

Thanks @reta for sharing your opinion. What do you think @dblock about "deprecation + node.roles"? I guess you are good for this option, just confirm.

dblock · 2022-05-10T18:35:54Z

Thanks @reta for sharing your opinion. What do you think @dblock about "deprecation + node.roles"? I guess you are good for this option, just confirm.

That would be my favorite and most looking forward option.

ylwu-amzn · 2022-05-10T19:02:15Z

So now we have conclusion for the open question: "how to extend _cat/nodes API?":

We will add a new filed node.roles and put full name of all roles (both built-in and custom roles) there. And we will deprecate node.role.

If anyone has new concerns, please don't hesitate to call out. We are open to tune or change direction.

Currently OpenSearch only supports several built-in nodes like data node role. If specify unknown node role, OpenSearch node will fail to start. This limit how to extend OpenSearch to support some extension function. For example, user may prefer to run ML tasks on some dedicated node which doesn't serve as any built-in node roles. So the ML tasks won't impact OpenSearch core function. This PR removed the limitation and user can specify any node role and OpenSearch will start node correctly with that unknown role. This opens the door for plugin developer to run specific tasks on dedicated nodes. Issue: opensearch-project#2877 Signed-off-by: Yaliang Wu <ylwu@amazon.com>

dtaivpp · 2022-06-10T19:31:30Z

@reta or @dblock can we tag this for 2.1 release? I talked with @ylwu-amzn and it is ready to roll.

reta · 2022-06-10T20:19:32Z

@reta or @dblock can we tag this for 2.1 release? I talked with @ylwu-amzn and it is ready to roll.

Sorry for that @dtaivpp , lost this one somehow, looked and commented

dtaivpp · 2022-06-10T20:22:13Z

@reta all good I cant keep track of any of it 😅

brijos · 2022-06-13T17:50:26Z

Adding a meta issue to track the engineering work as well as future documentation and blog posts.

dblock · 2022-06-14T18:18:07Z

I CRed #3436, just waiting on another pair of eyes from @reta and it's good to go.

* Support unknown node role Currently OpenSearch only supports several built-in nodes like data node role. If specify unknown node role, OpenSearch node will fail to start. This limit how to extend OpenSearch to support some extension function. For example, user may prefer to run ML tasks on some dedicated node which doesn't serve as any built-in node roles. So the ML tasks won't impact OpenSearch core function. This PR removed the limitation and user can specify any node role and OpenSearch will start node correctly with that unknown role. This opens the door for plugin developer to run specific tasks on dedicated nodes. Issue: #2877 Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix cat nodes rest API spec Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix mixed cluster IT failure Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add DynamicRole Signed-off-by: Yaliang Wu <ylwu@amazon.com> * change generator method name Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix failed docker test Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform role name to lower case to avoid confusion Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform the node role abbreviation to lower case Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix checkstyle Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add test for case-insensitive role name change Signed-off-by: Yaliang Wu <ylwu@amazon.com>

* Support unknown node role Currently OpenSearch only supports several built-in nodes like data node role. If specify unknown node role, OpenSearch node will fail to start. This limit how to extend OpenSearch to support some extension function. For example, user may prefer to run ML tasks on some dedicated node which doesn't serve as any built-in node roles. So the ML tasks won't impact OpenSearch core function. This PR removed the limitation and user can specify any node role and OpenSearch will start node correctly with that unknown role. This opens the door for plugin developer to run specific tasks on dedicated nodes. Issue: #2877 Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix cat nodes rest API spec Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix mixed cluster IT failure Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add DynamicRole Signed-off-by: Yaliang Wu <ylwu@amazon.com> * change generator method name Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix failed docker test Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform role name to lower case to avoid confusion Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform the node role abbreviation to lower case Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix checkstyle Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add test for case-insensitive role name change Signed-off-by: Yaliang Wu <ylwu@amazon.com> (cherry picked from commit e9c5ce3)

ylwu-amzn · 2022-06-15T19:11:37Z

PR merged. Close this issue. Thanks a lot for your help!

* Support unknown node role Currently OpenSearch only supports several built-in nodes like data node role. If specify unknown node role, OpenSearch node will fail to start. This limit how to extend OpenSearch to support some extension function. For example, user may prefer to run ML tasks on some dedicated node which doesn't serve as any built-in node roles. So the ML tasks won't impact OpenSearch core function. This PR removed the limitation and user can specify any node role and OpenSearch will start node correctly with that unknown role. This opens the door for plugin developer to run specific tasks on dedicated nodes. Issue: #2877 Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix cat nodes rest API spec Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix mixed cluster IT failure Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add DynamicRole Signed-off-by: Yaliang Wu <ylwu@amazon.com> * change generator method name Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix failed docker test Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform role name to lower case to avoid confusion Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform the node role abbreviation to lower case Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix checkstyle Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add test for case-insensitive role name change Signed-off-by: Yaliang Wu <ylwu@amazon.com>

* Support unknown node role Currently OpenSearch only supports several built-in nodes like data node role. If specify unknown node role, OpenSearch node will fail to start. This limit how to extend OpenSearch to support some extension function. For example, user may prefer to run ML tasks on some dedicated node which doesn't serve as any built-in node roles. So the ML tasks won't impact OpenSearch core function. This PR removed the limitation and user can specify any node role and OpenSearch will start node correctly with that unknown role. This opens the door for plugin developer to run specific tasks on dedicated nodes. Issue: opensearch-project#2877 Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix cat nodes rest API spec Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix mixed cluster IT failure Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add DynamicRole Signed-off-by: Yaliang Wu <ylwu@amazon.com> * change generator method name Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix failed docker test Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform role name to lower case to avoid confusion Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform the node role abbreviation to lower case Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix checkstyle Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add test for case-insensitive role name change Signed-off-by: Yaliang Wu <ylwu@amazon.com>

* Bump reactor-netty-core from 1.0.16 to 1.0.19 in /plugins/repository-azure (#3360) * Bump reactor-netty-core in /plugins/repository-azure Bumps [reactor-netty-core](https://github.com/reactor/reactor-netty) from 1.0.16 to 1.0.19. - [Release notes](https://github.com/reactor/reactor-netty/releases) - [Commits](reactor/reactor-netty@v1.0.16...v1.0.19) --- updated-dependencies: - dependency-name: io.projectreactor.netty:reactor-netty-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * [Type removal] _type removal from mocked responses of scroll hit tests (#3377) Signed-off-by: Suraj Singh <surajrider@gmail.com> * [Type removal] Remove _type deprecation from script and conditional processor (#3239) * [Type removal] Remove _type deprecation from script and conditional processor Signed-off-by: Suraj Singh <surajrider@gmail.com> * Spotless check apply Signed-off-by: Suraj Singh <surajrider@gmail.com> * [Type removal] Remove _type from _bulk yaml test, scripts, unused constants (#3372) * [Type removal] Remove redundant _type deprecation checks in bulk request Signed-off-by: Suraj Singh <surajrider@gmail.com> * [Type removal] bulk yaml tests validating deprecation on _type and removal from scripts Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix Lucene-snapshots repo for jdk 17. (#3396) Signed-off-by: Marc Handalian <handalm@amazon.com> * Replace internal usages of 'master' term in 'server/src/internalClusterTest' directory (#2521) Signed-off-by: Tianli Feng <ftianli@amazon.com> * [REMOVE] Cleanup deprecated thread pool types (FIXED_AUTO_QUEUE_SIZE) (#3369) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * [Type removal] _type removal from tests of yaml tests (#3406) * [Type removal] _type removal from tests of yaml tests Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix spotless failures Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix assertion failures Signed-off-by: Suraj Singh <surajrider@gmail.com> * Fix assertion failures in DoSectionTests Signed-off-by: Suraj Singh <surajrider@gmail.com> * Add release notes for version 2.0.0 (#3410) Signed-off-by: Rabi Panda <adnapibar@gmail.com> * [Upgrade] Lucene-9.2.0-snapshot-ba8c3a8 (#3416) Upgrades to latest snapshot of lucene 9.2.0 in preparation for GA release. Signed-off-by: Nicholas Walter Knize <nknize@apache.org> * Fix release notes for 2.0.0-rc1 version (#3418) This change removes some old commits from the 2.0.0-rc1 release notes. These commits were already released as part of 1.x releases. Add back some missing type removal commits to the 2.0.0 release notes Signed-off-by: Rabi Panda <adnapibar@gmail.com> * Bump version 2.1 to Lucene 9.2 after upgrade (#3424) Bumps Version.V_2_1_0 lucene version to 9.2 after backporting upgrage. Signed-off-by: Nicholas Walter Knize <nknize@apache.org> * Bump com.gradle.enterprise from 3.10 to 3.10.1 (#3425) Bumps com.gradle.enterprise from 3.10 to 3.10.1. --- updated-dependencies: - dependency-name: com.gradle.enterprise dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump reactor-core from 3.4.17 to 3.4.18 in /plugins/repository-azure (#3427) Bumps [reactor-core](https://github.com/reactor/reactor-core) from 3.4.17 to 3.4.18. - [Release notes](https://github.com/reactor/reactor-core/releases) - [Commits](reactor/reactor-core@v3.4.17...v3.4.18) --- updated-dependencies: - dependency-name: io.projectreactor:reactor-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Bump gax-httpjson from 0.101.0 to 0.103.1 in /plugins/repository-gcs (#3426) Bumps [gax-httpjson](https://github.com/googleapis/gax-java) from 0.101.0 to 0.103.1. - [Release notes](https://github.com/googleapis/gax-java/releases) - [Changelog](https://github.com/googleapis/gax-java/blob/main/CHANGELOG.md) - [Commits](https://github.com/googleapis/gax-java/commits) --- updated-dependencies: - dependency-name: com.google.api:gax-httpjson dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * [segment replication]Introducing common Replication interfaces for segment replication and recovery code paths (#3234) * RecoveryState inherits from ReplicationState + RecoveryTarget inherits from ReplicationTarget Signed-off-by: Poojita Raj <poojiraj@amazon.com> * Refactoring: mixedClusterVersion error fix + move Stage to ReplicationState Signed-off-by: Poojita Raj <poojiraj@amazon.com> * pull ReplicationListener into a top level class + add javadocs + address review comments Signed-off-by: Poojita Raj <poojiraj@amazon.com> * fix javadoc Signed-off-by: Poojita Raj <poojiraj@amazon.com> * review changes Signed-off-by: Poojita Raj <poojiraj@amazon.com> * Refactoring the hierarchy relationship between repl and recovery Signed-off-by: Poojita Raj <poojiraj@amazon.com> * style fix Signed-off-by: Poojita Raj <poojiraj@amazon.com> * move package common under replication Signed-off-by: Poojita Raj <poojiraj@amazon.com> * rename to replication Signed-off-by: Poojita Raj <poojiraj@amazon.com> * rename and doc changes Signed-off-by: Poojita Raj <poojiraj@amazon.com> * [Type removal] Remove type from BulkRequestParser (#3423) * [Type removal] Remove type handling in bulk request parser Signed-off-by: Suraj Singh <surajrider@gmail.com> * [Type removal] Remove testTypesStillParsedForBulkMonitoring as it is no longer present in codebase Signed-off-by: Suraj Singh <surajrider@gmail.com> * Adding CheckpointRefreshListener to trigger when Segment replication is turned on and Primary shard refreshes (#3108) * Intial PR adding classes and tests related to checkpoint publishing Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Putting a Draft PR with all changes in classes. Testing is still not included in this commit. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Wiring up index shard to new engine, spotless apply and removing unnecessary tests and logs Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding Unit test for checkpointRefreshListener Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Applying spotless check Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Fixing import statements * Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * removing unused constructor in index shard Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Addressing comments from last commit Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding package-info.java files for two new packages Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding test for null checkpoint publisher and addreesing PR comments Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Add docs for indexshardtests and remove shard.refresh Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Add a new Engine implementation for replicas with segment replication enabled. (#3240) * Change fastForwardProcessedSeqNo method in LocalCheckpointTracker to persisted checkpoint. This change inverts fastForwardProcessedSeqNo to fastForwardPersistedSeqNo for use in Segment Replication. This is so that a Segrep Engine can match the logic of InternalEngine where the seqNo is incremented with each operation, but only persisted in the tracker on a flush. With Segment Replication we bump the processed number with each operation received index/delete/noOp, and invoke this method when we receive a new set of segments to bump the persisted seqNo. Signed-off-by: Marc Handalian <handalm@amazon.com> * Extract Translog specific engine methods into an abstract class. This change extracts translog specific methods to an abstract engine class so that other engine implementations can reuse translog logic. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add a separate Engine implementation for replicas with segment replication enabled. This change adds a new engine intended to be used on replicas with segment replication enabled. This engine does not wire up an IndexWriter, but still writes all operations to a translog. The engine uses a new ReaderManager that refreshes from an externally provided SegmentInfos. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix spotless checks. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix :server:compileInternalClusterTestJava compilation. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix failing test naming convention check. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. - Removed isReadOnlyReplica from overloaded constructor and added feature flag checks. - Updated log msg in NRTReplicationReaderManager - cleaned up store ref counting in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix spotless check. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove TranslogAwareEngine and build translog in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix formatting Signed-off-by: Marc Handalian <handalm@amazon.com> * Add missing translog methods to NRTEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove persistent seqNo check from fastForwardProcessedSeqNo. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add test specific to translog trimming. Signed-off-by: Marc Handalian <handalm@amazon.com> * Javadoc check. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add failEngine calls to translog methods in NRTReplicationEngine. Roll xlog generation on replica when a new commit point is received. Signed-off-by: Marc Handalian <handalm@amazon.com> * Rename master to cluster_manager in the XContent Parser of ClusterHealthResponse (#3432) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Bump hadoop-minicluster in /test/fixtures/hdfs-fixture (#3359) Bumps hadoop-minicluster from 3.3.2 to 3.3.3. --- updated-dependencies: - dependency-name: org.apache.hadoop:hadoop-minicluster dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump avro from 1.10.2 to 1.11.0 in /plugins/repository-hdfs (#3358) * Bump avro from 1.10.2 to 1.11.0 in /plugins/repository-hdfs Bumps avro from 1.10.2 to 1.11.0. --- updated-dependencies: - dependency-name: org.apache.avro:avro dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Fix testSetAdditionalRolesCanAddDeprecatedMasterRole() by removing the initial assertion (#3441) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Replace internal usages of 'master' term in 'server/src/test' directory (#2520) * Replace the non-inclusive terminology "master" with "cluster manager" in code comments, internal variable/method/class names, in `server/src/test` directory. * Backwards compatibility is not impacted. * Add a new unit test `testDeprecatedMasterNodeFilter()` to validate using `master:true` or `master:false` can filter the node in [Cluster Stats](https://opensearch.org/docs/latest/opensearch/rest-api/cluster-stats/) API, after the `master` role is deprecated in PR #2424 Signed-off-by: Tianli Feng <ftianli@amazon.com> * Removing unused method from TransportSearchAction (#3437) * Removing unused method from TransportSearchAction Signed-off-by: Ankit Jain <jain.ankitk@gmail.com> * Set term vector flags to false for ._index_prefix field (#1901). (#3119) * Set term vector flags to false for ._index_prefix field (#1901). Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Replaced the FieldType copy ctor with ctor for the prefix field and replaced setting the field type parameters with setIndexOptions(). (#1901) Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Added tests for term vectors. (#1901) Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> * Fixed code formatting error. Signed-off-by: Vesa Pehkonen <vesa.pehkonen@intel.com> Co-authored-by: sdp <sdp@9049fa06826d.jf.intel.com> * [BUG] Fixing org.opensearch.monitor.os.OsProbeTests > testLogWarnCpuMessageOnlyOnes when cgroups are available but cgroup stats is not (#3448) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * [Segment Replication] Add SegmentReplicationTargetService to orchestrate replication events. (#3439) * Add SegmentReplicationTargetService to orchestrate replication events. This change introduces boilerplate classes for Segment Replication and a target service to orchestrate replication events. It also includes two refactors of peer recovery components for reuse. 1. Rename RecoveryFileChunkRequest to FileChunkRequest and extract code to handle throttling into ReplicationTarget. 2. Extracts a component to execute retryable requests over the transport layer. Signed-off-by: Marc Handalian <handalm@amazon.com> * Code cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> * Make SegmentReplicationTargetService component final so that it can not be extended by plugins. Signed-off-by: Marc Handalian <handalm@amazon.com> * Bump azure-core-http-netty from 1.11.9 to 1.12.0 in /plugins/repository-azure (#3474) Bumps [azure-core-http-netty](https://github.com/Azure/azure-sdk-for-java) from 1.11.9 to 1.12.0. - [Release notes](https://github.com/Azure/azure-sdk-for-java/releases) - [Commits](Azure/azure-sdk-for-java@azure-core-http-netty_1.11.9...azure-core_1.12.0) --- updated-dependencies: - dependency-name: com.azure:azure-core-http-netty dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Update to Apache Lucene 9.2 (#3477) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Bump protobuf-java from 3.20.1 to 3.21.1 in /plugins/repository-hdfs (#3472) Signed-off-by: dependabot[bot] <support@github.com> * [Upgrade] Lucene-9.3.0-snapshot-823df23 (#3478) Upgrades to latest snapshot of lucene 9.3.0. Signed-off-by: Nicholas Walter Knize <nknize@apache.org> * Filter out invalid URI and HTTP method in the error message of no handler found for a REST request (#3459) Filter out invalid URI and HTTP method of a error message, which shown when there is no handler found for a REST request sent by user, so that HTML special characters <>&"' will not shown in the error message. The error message is return as mine-type `application/json`, which can't contain active (script) content, so it's not a vulnerability. Besides, no browsers are going to render as html when the mine-type is that. While the common security scanners will raise a false-positive alarm for having HTML tags in the response without escaping the HTML special characters, so the solution only aims to satisfy the code security scanners. Signed-off-by: Tianli Feng <ftianli@amazon.com> * Support use of IRSA for repository-s3 plugin credentials (#3475) * Support use of IRSA for repository-s3 plugin credentials Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Address code review comments Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Address code review comments Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Bump google-auth-library-oauth2-http from 0.20.0 to 1.7.0 in /plugins/repository-gcs (#3473) * Bump google-auth-library-oauth2-http in /plugins/repository-gcs Bumps google-auth-library-oauth2-http from 0.20.0 to 1.7.0. --- updated-dependencies: - dependency-name: com.google.auth:google-auth-library-oauth2-http dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> * Use variable to define the version of dependency google-auth-library-java Signed-off-by: Tianli Feng <ftianli@amazon.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Tianli Feng <ftianli@amazon.com> * [Segment Replication] Added source-side classes for orchestrating replication events (#3470) This change expands on the existing SegmentReplicationSource interface and its corresponding Factory class by introducing an implementation where the replication source is a primary shard (PrimaryShardReplicationSource). These code paths execute on the target. The primary shard implementation creates the requests to be send to the source/primary shard. Correspondingly, this change also defines two request classes for the GET_CHECKPOINT_INFO and GET_SEGMENT_FILES requests as well as an abstract superclass. A CopyState class has been introduced that captures point-in-time, file-level details from an IndexShard. This implementation mirrors Lucene's NRT CopyState implementation. Finally, a service class has been introduce for segment replication that runs on the source side (SegmentReplicationSourceService) which handles these two types of incoming requests. This includes private handler classes that house the logic to respond to these requests, with some functionality stubbed for now. The service class also uses a simple map to cache CopyState objects that would be needed by replication targets. Unit tests have been added/updated for all new functionality. Signed-off-by: Kartik Ganesh <gkart@amazon.com> * [Dependency upgrade] google-oauth-client to 1.33.3 (#3500) Signed-off-by: Suraj Singh <surajrider@gmail.com> * move bash flag to set statement (#3494) Passing bash with flags to the first argument of /usr/bin/env requires its own flag to interpret it correctly. Rather than use `env -S` to split the argument, have the script `set -e` to enable the same behavior explicitly in preinst and postinst scripts. Also set `-o pipefail` for consistency. Closes: #3492 Signed-off-by: Cole White <cwhite@wikimedia.org> * Support use of IRSA for repository-s3 plugin credentials: added YAML Rest test case (#3499) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Bump azure-storage-common from 12.15.0 to 12.16.0 in /plugins/repository-azure (#3517) * Bump azure-storage-common in /plugins/repository-azure Bumps [azure-storage-common](https://github.com/Azure/azure-sdk-for-java) from 12.15.0 to 12.16.0. - [Release notes](https://github.com/Azure/azure-sdk-for-java/releases) - [Commits](Azure/azure-sdk-for-java@azure-storage-blob_12.15.0...azure-storage-blob_12.16.0) --- updated-dependencies: - dependency-name: com.azure:azure-storage-common dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Bump google-oauth-client from 1.33.3 to 1.34.0 in /plugins/discovery-gce (#3516) * Bump google-oauth-client from 1.33.3 to 1.34.0 in /plugins/discovery-gce Bumps [google-oauth-client](https://github.com/googleapis/google-oauth-java-client) from 1.33.3 to 1.34.0. - [Release notes](https://github.com/googleapis/google-oauth-java-client/releases) - [Changelog](https://github.com/googleapis/google-oauth-java-client/blob/main/CHANGELOG.md) - [Commits](googleapis/google-oauth-java-client@v1.33.3...v1.34.0) --- updated-dependencies: - dependency-name: com.google.oauth-client:google-oauth-client dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Fix the support of RestClient Node Sniffer for version 2.x and update tests (#3487) Fix the support of RestClient Node Sniffer for OpenSearch 2.x, and update unit tests for OpenSearch. The current code contains the logic to be compatible with Elasticsearch 2.x version, which is conflict with OpenSearch 2.x, so removed that part of legacy code. * Update the script create_test_nodes_info.bash to dump the response of Nodes Info API GET _nodes/http for OpenSearch 1.0 and 2.0 version, which used for unit test. * Remove the support of Elasticsearch version 2.x for the Sniffer * Update unit test to validate the Sniffer compatible with OpenSearch 1.x and 2.x * Update the API response parser to meet the array notation (in ES 6.1 and above) for the node attributes setting. It will result the value of `node.attr` setting will not be parsed as array in the Sniffer, when using the Sniffer on cluster in Elasticsearch 6.0 and above. * Replace "master" node role with "cluster_manager" in unit test Signed-off-by: Tianli Feng <ftianli@amazon.com> * Bump com.diffplug.spotless from 6.6.1 to 6.7.0 (#3513) Bumps com.diffplug.spotless from 6.6.1 to 6.7.0. --- updated-dependencies: - dependency-name: com.diffplug.spotless dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump guava from 18.0 to 23.0 in /plugins/ingest-attachment (#3357) * Bump guava from 18.0 to 23.0 in /plugins/ingest-attachment Bumps [guava](https://github.com/google/guava) from 18.0 to 23.0. - [Release notes](https://github.com/google/guava/releases) - [Commits](google/guava@v18.0...v23.0) --- updated-dependencies: - dependency-name: com.google.guava:guava dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> * Add more ingorance of using internal java API sun.misc.Unsafe Signed-off-by: Tianli Feng <ftianli@amazon.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Tianli Feng <ftianli@amazon.com> * Added bwc version 2.0.1 (#3452) Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Co-authored-by: opensearch-ci-bot <opensearch-ci-bot@users.noreply.github.com> * Add release notes for 1.3.3 (#3549) Signed-off-by: Xue Zhou <xuezhou@amazon.com> * [Upgrade] Lucene-9.3.0-snapshot-b7231bb (#3537) Upgrades to latest snapshot of lucene 9.3; including reducing maxFullFlushMergeWaitMillis in LuceneTest.testWrapLiveDocsNotExposeAbortedDocuments to 0 ms to ensure aborted docs are not merged away in the test with the new mergeOnRefresh default policy. Signed-off-by: Nicholas Walter Knize <nknize@apache.org> * [Remote Store] Upload segments to remote store post refresh (#3460) * Add RemoteDirectory interface to copy segment files to/from remote store Signed-off-by: Sachin Kale <kalsac@amazon.com> Co-authored-by: Sachin Kale <kalsac@amazon.com> * Add index level setting for remote store Signed-off-by: Sachin Kale <kalsac@amazon.com> Co-authored-by: Sachin Kale <kalsac@amazon.com> * Add RemoteDirectoryFactory and use RemoteDirectory instance in RefreshListener Co-authored-by: Sachin Kale <kalsac@amazon.com> Signed-off-by: Sachin Kale <kalsac@amazon.com> * Upload segment to remote store post refresh Signed-off-by: Sachin Kale <kalsac@amazon.com> Co-authored-by: Sachin Kale <kalsac@amazon.com> * Fixing VerifyVersionConstantsIT test failure (#3574) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Bump jettison from 1.4.1 to 1.5.0 in /plugins/discovery-azure-classic (#3571) * Bump jettison from 1.4.1 to 1.5.0 in /plugins/discovery-azure-classic Bumps [jettison](https://github.com/jettison-json/jettison) from 1.4.1 to 1.5.0. - [Release notes](https://github.com/jettison-json/jettison/releases) - [Commits](jettison-json/jettison@jettison-1.4.1...jettison-1.5.0) --- updated-dependencies: - dependency-name: org.codehaus.jettison:jettison dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Bump google-api-services-storage from v1-rev20200814-1.30.10 to v1-rev20220608-1.32.1 in /plugins/repository-gcs (#3573) * Bump google-api-services-storage in /plugins/repository-gcs Bumps google-api-services-storage from v1-rev20200814-1.30.10 to v1-rev20220608-1.32.1. --- updated-dependencies: - dependency-name: com.google.apis:google-api-services-storage dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> * Upgrade Google HTTP Client to 1.42.0 Signed-off-by: Xue Zhou <xuezhou@amazon.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Xue Zhou <xuezhou@amazon.com> * Add flat_skew setting to node overload decider (#3563) * Add flat_skew setting to node overload decider Signed-off-by: Rishab Nahata <rnnahata@amazon.com> * Bump xmlbeans from 5.0.3 to 5.1.0 in /plugins/ingest-attachment (#3572) * Bump xmlbeans from 5.0.3 to 5.1.0 in /plugins/ingest-attachment Bumps xmlbeans from 5.0.3 to 5.1.0. --- updated-dependencies: - dependency-name: org.apache.xmlbeans:xmlbeans dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Bump google-oauth-client from 1.34.0 to 1.34.1 in /plugins/discovery-gce (#3570) * Bump google-oauth-client from 1.34.0 to 1.34.1 in /plugins/discovery-gce Bumps [google-oauth-client](https://github.com/googleapis/google-oauth-java-client) from 1.34.0 to 1.34.1. - [Release notes](https://github.com/googleapis/google-oauth-java-client/releases) - [Changelog](https://github.com/googleapis/google-oauth-java-client/blob/main/CHANGELOG.md) - [Commits](googleapis/google-oauth-java-client@v1.34.0...v1.34.1) --- updated-dependencies: - dependency-name: com.google.oauth-client:google-oauth-client dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * Updating SHAs Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> * Fix for bug showing incorrect awareness attributes count in AwarenessAllocationDecider (#3428) * Fix for bug showing incorrect awareness attributes count in AwarenessAllocationDecider Signed-off-by: Anshu Agarwal <anshukag@amazon.com> * Added bwc version 1.3.4 (#3552) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: opensearch-ci-bot <opensearch-ci-bot@users.noreply.github.com> * Support dynamic node role (#3436) * Support unknown node role Currently OpenSearch only supports several built-in nodes like data node role. If specify unknown node role, OpenSearch node will fail to start. This limit how to extend OpenSearch to support some extension function. For example, user may prefer to run ML tasks on some dedicated node which doesn't serve as any built-in node roles. So the ML tasks won't impact OpenSearch core function. This PR removed the limitation and user can specify any node role and OpenSearch will start node correctly with that unknown role. This opens the door for plugin developer to run specific tasks on dedicated nodes. Issue: #2877 Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix cat nodes rest API spec Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix mixed cluster IT failure Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add DynamicRole Signed-off-by: Yaliang Wu <ylwu@amazon.com> * change generator method name Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix failed docker test Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform role name to lower case to avoid confusion Signed-off-by: Yaliang Wu <ylwu@amazon.com> * transform the node role abbreviation to lower case Signed-off-by: Yaliang Wu <ylwu@amazon.com> * fix checkstyle Signed-off-by: Yaliang Wu <ylwu@amazon.com> * add test for case-insensitive role name change Signed-off-by: Yaliang Wu <ylwu@amazon.com> * Rename package 'o.o.action.support.master' to 'o.o.action.support.clustermanager' (#3556) * Rename package org.opensearch.action.support.master to org.opensearch.action.support.clustermanager Signed-off-by: Tianli Feng <ftianli@amazon.com> * Rename classes with master term in the package org.opensearch.action.support.master Signed-off-by: Tianli Feng <ftianli@amazon.com> * Deprecate classes in org.opensearch.action.support.master Signed-off-by: Tianli Feng <ftianli@amazon.com> * Remove pakcage o.o.action.support.master Signed-off-by: Tianli Feng <ftianli@amazon.com> * Move package-info back Signed-off-by: Tianli Feng <ftianli@amazon.com> * Move package-info to new folder Signed-off-by: Tianli Feng <ftianli@amazon.com> * Correct the package-info Signed-off-by: Tianli Feng <ftianli@amazon.com> * Fixing flakiness of ShuffleForcedMergePolicyTests (#3591) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Deprecate classes in org.opensearch.action.support.master (#3593) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Add release notes for version 2.0.1 (#3595) Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> * Fix NPE when minBound/maxBound is not set before being called. (#3605) Signed-off-by: George Apaaboah <george.apaaboah@gmail.com> * Added bwc version 2.0.2 (#3613) Co-authored-by: opensearch-ci-bot <opensearch-ci-bot@users.noreply.github.com> * Fix false positive query timeouts due to using cached time (#3454) * Fix false positive query timeouts due to using cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * delegate nanoTime call to SearchContext Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * add override to SearchContext getRelativeTimeInMillis to force non cached time Signed-off-by: Ahmad AbuKhalil <abukhali@amazon.com> * Fix random gradle check failure issue 3584. (#3627) * [Segment Replication] Add components for segment replication to perform file copy. (#3525) * Add components for segment replication to perform file copy. This change adds the required components to SegmentReplicationSourceService to initiate copy and react to lifecycle events. Along with new components it refactors common file copy code from RecoverySourceHandler into reusable pieces. Signed-off-by: Marc Handalian <handalm@amazon.com> * Deprecate public methods and variables with master term in package 'org.opensearch.action.support.master' (#3617) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Add replication orchestration for a single shard (#3533) * implement segment replication target Signed-off-by: Poojita Raj <poojiraj@amazon.com> * test added Signed-off-by: Poojita Raj <poojiraj@amazon.com> * changes to tests + finalizeReplication Signed-off-by: Poojita Raj <poojiraj@amazon.com> * fix style check Signed-off-by: Poojita Raj <poojiraj@amazon.com> * addressing comments + fix gradle check Signed-off-by: Poojita Raj <poojiraj@amazon.com> * added test + addressed review comments Signed-off-by: Poojita Raj <poojiraj@amazon.com> * [BUG] opensearch crashes on closed client connection before search reply (#3626) * [BUG] opensearch crashes on closed client connection before search reply Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Addressing code review comments Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Add all deprecated method in the package with new name 'org.opensearch.action.support.clustermanager' (#3644) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Introduce TranslogManager implementations decoupled from the Engine (#3638) * Introduce decoupled translog manager interfaces Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Adding onNewCheckpoint to Start Replication on Replica Shard when Segment Replication is turned on (#3540) * Adding onNewCheckpoint and it's test to start replication. SCheck for latestcheckpoint and replaying logic is removed from this commit and will be added in a different PR Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Changing binding/inject logic and addressing comments from PR Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Applying spotless check Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Moving shouldProcessCheckpoint() to IndexShard, and removing some trace logs Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * applying spotlessApply Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding more info to log statement in targetservice class Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * applying spotlessApply Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Addressing comments on PR Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding teardown() in SegmentReplicationTargetServiceTests. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * fixing testShouldProcessCheckpoint() in SegmentReplicationTargetServiceTests Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Removing CheckpointPublisherProvider in IndicesModule Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * spotless check apply Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Remove class org.opensearch.action.support.master.AcknowledgedResponse (#3662) * Remove class org.opensearch.action.support.master.AcknowledgedResponse Signed-off-by: Tianli Feng <ftianli@amazon.com> * Remove class org.opensearch.action.support.master.AcknowledgedRequest RequestBuilder ShardsAcknowledgedResponse Signed-off-by: Tianli Feng <ftianli@amazon.com> * Restore AcknowledgedResponse and AcknowledgedRequest to package org.opensearch.action.support.master (#3669) Signed-off-by: Tianli Feng <ftianli@amazon.com> * [BUG] Custom POM configuration for ZIP publication produces duplicit tags (url, scm) (#3656) * [BUG] Custom POM configuration for ZIP publication produces duplicit tags (url, scm) Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Added test case for pluginZip with POM Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Support both Gradle 6.8.x and Gradle 7.4.x Signed-off-by: Andriy Redko <andriy.redko@aiven.io> * Adding 2.2.0 Bwc version to main (#3673) * Upgraded to t-digest 3.3. (#3634) * Revert renaming method onMaster() and offMaster() in interface LocalNodeMasterListener (#3686) Signed-off-by: Tianli Feng <ftianli@amazon.com> * Upgrading AWS SDK dependency for native plugins (#3694) * Merge branch 'feature/point_in_time' of https://github.com/opensearch-project/OpenSearch into fb Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Suraj Singh <surajrider@gmail.com> Co-authored-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Tianli Feng <ftianli@amazon.com> Co-authored-by: Andriy Redko <andriy.redko@aiven.io> Co-authored-by: Rabi Panda <adnapibar@gmail.com> Co-authored-by: Nick Knize <nknize@apache.org> Co-authored-by: Poojita Raj <poojiraj@amazon.com> Co-authored-by: Rishikesh Pasham <62345295+Rishikesh1159@users.noreply.github.com> Co-authored-by: Ankit Jain <jain.ankitk@gmail.com> Co-authored-by: vpehkone <101240162+vpehkone@users.noreply.github.com> Co-authored-by: sdp <sdp@9049fa06826d.jf.intel.com> Co-authored-by: Kartik Ganesh <gkart@amazon.com> Co-authored-by: Cole White <42356806+shdubsh@users.noreply.github.com> Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com> Co-authored-by: opensearch-ci-bot <opensearch-ci-bot@users.noreply.github.com> Co-authored-by: Xue Zhou <85715413+xuezhou25@users.noreply.github.com> Co-authored-by: Sachin Kale <sachinpkale@gmail.com> Co-authored-by: Sachin Kale <kalsac@amazon.com> Co-authored-by: Xue Zhou <xuezhou@amazon.com> Co-authored-by: Rishab Nahata <rishabnahata07@gmail.com> Co-authored-by: Anshu Agarwal <anshuagarwal11@gmail.com> Co-authored-by: Yaliang Wu <ylwu@amazon.com> Co-authored-by: Kunal Kotwani <kkotwani@amazon.com> Co-authored-by: George Apaaboah <35894485+GeorgeAp@users.noreply.github.com> Co-authored-by: Ahmad AbuKhalil <105249973+aabukhalil@users.noreply.github.com> Co-authored-by: Bukhtawar Khan <bukhtawa@amazon.com> Co-authored-by: Sarat Vemulapalli <vemulapallisarat@gmail.com> Co-authored-by: Daniel (dB.) Doubrovkine <dblock@dblock.org>

ylwu-amzn added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 12, 2022

nknize added feature New feature or request discuss Issues intended to help drive brainstorming and decision making labels Apr 12, 2022

dblock mentioned this issue Apr 13, 2022

[Feature] Support for weighted zonal search request routing policy #2859

Closed

kartg added distributed framework and removed untriaged labels Apr 27, 2022

ylwu-amzn mentioned this issue May 24, 2022

Support dynamic node role #3436

Merged

5 tasks

This was referenced Jun 14, 2022

[META] Support ML dynamic node role opensearch-project/anomaly-detection#571

Open

[Documentation] Support ML node role opensearch-project/documentation-website#672

Closed

dblock added the v2.1.0 Issues and PRs related to version 2.1.0 label Jun 14, 2022

ylwu-amzn closed this as completed Jun 15, 2022

This was referenced Jun 15, 2022

dispatch ML task to ML node first opensearch-project/ml-commons#346

Merged

Support dedicated ml node opensearch-project/ml-commons#79

Closed

ylwu-amzn changed the title ~~Support ML node role~~ Support dynamic node role Jun 20, 2022

ylwu-amzn mentioned this issue Jul 7, 2022

Add 2.1.0 release notes opensearch-project/opensearch-build#2302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dynamic node role #2877

Support dynamic node role #2877

ylwu-amzn commented Apr 12, 2022 •

edited

Loading

dblock commented Apr 13, 2022 •

edited

Loading

penghuo commented Apr 14, 2022

jainankitk commented Apr 14, 2022

reta commented Apr 18, 2022

ylwu-amzn commented Apr 20, 2022

ylwu-amzn commented Apr 20, 2022

ylwu-amzn commented Apr 20, 2022 •

edited

Loading

dblock commented Apr 21, 2022 •

edited

Loading

reta commented Apr 21, 2022

ylwu-amzn commented Apr 22, 2022 •

edited

Loading

dblock commented Apr 22, 2022

ylwu-amzn commented Apr 22, 2022

xinlamzn commented Apr 22, 2022

dblock commented Apr 25, 2022 •

edited

Loading

ylwu-amzn commented Apr 27, 2022 •

edited

Loading

dblock commented Apr 27, 2022

ylwu-amzn commented Apr 28, 2022 •

edited

Loading

dblock commented Apr 28, 2022

ylwu-amzn commented Apr 28, 2022

andrross commented Apr 28, 2022

reta commented Apr 29, 2022 •

edited

Loading

ylwu-amzn commented Apr 29, 2022 •

edited

Loading

reta commented May 2, 2022

ylwu-amzn commented May 10, 2022 •

edited

Loading

reta commented May 10, 2022

ylwu-amzn commented May 10, 2022

dblock commented May 10, 2022

ylwu-amzn commented May 10, 2022

dtaivpp commented Jun 10, 2022

reta commented Jun 10, 2022

dtaivpp commented Jun 10, 2022

brijos commented Jun 13, 2022

dblock commented Jun 14, 2022

ylwu-amzn commented Jun 15, 2022

Support dynamic node role #2877

Support dynamic node role #2877

Comments

ylwu-amzn commented Apr 12, 2022 • edited Loading

We need dedicated ML node for ML plugin

What changes we need to make

OpenSearch

ML plugin

dblock commented Apr 13, 2022 • edited Loading

penghuo commented Apr 14, 2022

jainankitk commented Apr 14, 2022

reta commented Apr 18, 2022

ylwu-amzn commented Apr 20, 2022

ylwu-amzn commented Apr 20, 2022

ylwu-amzn commented Apr 20, 2022 • edited Loading

Option1: add new node role

Option 2: enhance current node role framework

Option 3: keep current node role framework and add new node tag/capability

Option 4: get rid of current node role framework, build brand new node tag/capability framework

My conclusion by comparing these options

dblock commented Apr 21, 2022 • edited Loading

reta commented Apr 21, 2022

ylwu-amzn commented Apr 22, 2022 • edited Loading

dblock commented Apr 22, 2022

ylwu-amzn commented Apr 22, 2022

xinlamzn commented Apr 22, 2022

dblock commented Apr 25, 2022 • edited Loading

ylwu-amzn commented Apr 27, 2022 • edited Loading

Option1: user monitor state of custom role nodes by themselves

Option2: add new cluster state for custom role node

Option2.1: add new cluster status for node, keep current cluster status for shard

Option3: enrich current cluster state to include failure of nodes with custom role, but user define rules for custom role node

Option4: enrich current cluster state to include failure of nodes with custom role, user don’t need to define rules for custom role node

My conclusion by comparing these options

dblock commented Apr 27, 2022

ylwu-amzn commented Apr 28, 2022 • edited Loading

dblock commented Apr 28, 2022

ylwu-amzn commented Apr 28, 2022

andrross commented Apr 28, 2022

reta commented Apr 29, 2022 • edited Loading

ylwu-amzn commented Apr 29, 2022 • edited Loading

reta commented May 2, 2022

ylwu-amzn commented May 10, 2022 • edited Loading

reta commented May 10, 2022

ylwu-amzn commented May 10, 2022

dblock commented May 10, 2022

ylwu-amzn commented May 10, 2022

dtaivpp commented Jun 10, 2022

reta commented Jun 10, 2022

dtaivpp commented Jun 10, 2022

brijos commented Jun 13, 2022

dblock commented Jun 14, 2022

ylwu-amzn commented Jun 15, 2022

ylwu-amzn commented Apr 12, 2022 •

edited

Loading

dblock commented Apr 13, 2022 •

edited

Loading

ylwu-amzn commented Apr 20, 2022 •

edited

Loading

dblock commented Apr 21, 2022 •

edited

Loading

ylwu-amzn commented Apr 22, 2022 •

edited

Loading

dblock commented Apr 25, 2022 •

edited

Loading

ylwu-amzn commented Apr 27, 2022 •

edited

Loading

ylwu-amzn commented Apr 28, 2022 •

edited

Loading

reta commented Apr 29, 2022 •

edited

Loading

ylwu-amzn commented Apr 29, 2022 •

edited

Loading

ylwu-amzn commented May 10, 2022 •

edited

Loading