-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhance: declarative resource group api #31930
enhance: declarative resource group api #31930
Conversation
@chyezh ut workflow job failed, comment |
@chyezh ut workflow job failed, comment |
0aeb5c2
to
5e5f394
Compare
@chyezh ut workflow job failed, comment |
ec7c37c
to
1a479ea
Compare
@chyezh E2e jenkins job failed, comment |
068087d
to
a905127
Compare
@chyezh E2e jenkins job failed, comment |
5a8ba03
to
37e1994
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #31930 +/- ##
==========================================
+ Coverage 81.69% 81.75% +0.06%
==========================================
Files 991 993 +2
Lines 122136 122555 +419
==========================================
+ Hits 99777 100197 +420
+ Misses 18530 18519 -11
- Partials 3829 3839 +10
|
@@ -315,6 +315,17 @@ func (c *Client) CreateResourceGroup(ctx context.Context, req *milvuspb.CreateRe | |||
}) | |||
} | |||
|
|||
func (c *Client) UpdateResourceGroups(ctx context.Context, req *milvuspb.UpdateResourceGroupsRequest, opts ...grpc.CallOption) (*commonpb.Status, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not recommend to use milvus.pbMsg
between milvus internal rpc. milvus.pbMsg
should only be used between sdk and proxy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
req = typeutil.Clone(req) | ||
commonpbutil.UpdateMsgBase( | ||
req.GetBase(), | ||
commonpbutil.FillMsgBaseFromClient(paramtable.GetNodeID(), commonpbutil.WithTargetID(c.grpcClient.GetNodeID())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to set target_id here, server id check already be done in server_id_intercepter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All api in internal grpc client add it.
I will keep the coding convention until some PR fix it all.
log.Warn("UpdateResourceGroups failed", | ||
zap.Error(err), | ||
) | ||
return getErrResponse(err, method, "", ""), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's weird to pass two empty string value here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra parameter of getErrReponse
is produced by database
and collection
level metric.
It's a bad implementation, all related code should be removed to somewhere like grpc interceptor
.
But I still keep it in this PR. code refactor of getErrResponse
should be produced by other pr.
|
||
log.Info("UpdateResourceGroups received") | ||
|
||
if err := node.sched.ddQueue.Enqueue(t); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task in ddQueue will be processed one by one, so UpdateResourceGroup
may be blocked by CreateCollection/LoadCollection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateResourceGroup
and DropResourceGroup
is still use the queue to execute task.
I will keep using the queue in this PR.
Remove the related API from ddQueue in other PR if necessary.
nodeMgr: nodeMgr, | ||
for _, meta := range rgs { | ||
rg := NewResourceGroupFromMeta(meta) | ||
rm.groups[rg.GetName()] = rg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should update new version rg meta to catalog, to avoid the old version rg meta exist forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old and new version meta compatible problem is resolved by resource_group.go.
We can always recover the old version meta in meta storage
into new version meta in memory
.
So I think that update new version back into meta storage is not needed here.
} | ||
// transfer node from source resource group to target resource group at high priority. | ||
targetCfg.TransferFrom = append(targetCfg.TransferFrom, &rgpb.ResourceGroupTransfer{ | ||
ResourceGroup: sourceRGName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any details about the high priority?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In current implementation, there's no high priority here.
We may add attribute like boost
... in future to enable priority.
} | ||
// After recovering, all node assigned to these rg has been removed. | ||
// no secondary index need to be removed. | ||
delete(rm.groups, rgName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should clear nodeIDMap here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After last recovering, all node assigned to these rg has been removed.
no secondary index need to be removed.
newNodes = append(newNodes, nid) | ||
for nodeID := range rm.nodeIDMap { | ||
if node := rm.nodeMgr.Get(nodeID); node == nil || node.IsStoppingState() { | ||
// unassignNode failure can be skip. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove stopping node from rm maybe not safe here? if a stopping node is executing balance, what will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In current implementation, it's safe.
In replica manager, the node will be set into RO node
, and balancer will start to remove all segment and channel on it.
func (rm *ResourceManager) recoverMissingNodeRG(rgName string) error { | ||
for rm.groups[rgName].MissingNumOfNodes() > 0 { | ||
rg := rm.groups[rgName] | ||
sourceRG := rm.selectMissingRecoverSourceRG(rg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of select the sourceRG again and again, it's recommend to sort the sourceRGs by priority
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to fix it with weight-order-based select policy.
It's Ok when recoverMissingNodeRG, but difficult to implement in recoverRedundant
.
Low human-readable codes are produced after fix.
At most of time, a recover operation is applied to a small set of query node, May never greater than 10 for milvus cluster size.
So, I prefer to keep current implementation for human-read.
for _, rg := range rm.groups { | ||
for _, transferCfg := range rg.GetConfig().GetTransferFrom() { | ||
if transferCfg.GetResourceGroup() == rgName { | ||
return errors.Wrapf(ErrDeleteInUsedRG, "resource group %s is used by %s's `from`, remove that configuration first", rgName, rg.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend to define a error in merr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it soon
37e1994
to
e4f97c0
Compare
e4f97c0
to
511f897
Compare
rebase master and remove conflict |
@chyezh E2e jenkins job failed, comment |
511f897
to
35343b4
Compare
Fix conflicts with master ( Update commit for comment @weiliu1031 : |
- Add declarative resource group api - Add config for resource group management - resource group recovery enhancement Signed-off-by: chyezh <chyezh@outlook.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Signed-off-by: chyezh <chyezh@outlook.com>
…over Signed-off-by: chyezh <chyezh@outlook.com>
35343b4
to
b061234
Compare
Fix conflict at |
@chyezh E2e jenkins job failed, comment |
/run-cpu-e2e |
/lgtm |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chyezh, xiaofan-luan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
issue: #31930 Signed-off-by: chyezh <chyezh@outlook.com>
issue: #30647
Add declarative resource group api
Add config for resource group management
Resource group recovery enhancement