-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Goroutine memory leak #397
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -119,24 +119,34 @@ func (r *ServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct | |
namespace = req.Name | ||
n := &corev1.Namespace{} | ||
if err := r.Client.Get(ctx, req.NamespacedName, n); err != nil { | ||
if k8serr.IsNotFound(err) { | ||
err = nil | ||
} | ||
return ctrl.Result{}, err | ||
} | ||
if !modelMeshEnabled(n, r.ControllerDeployment.Namespace) { | ||
sl := &corev1.ServiceList{} | ||
err := r.List(ctx, sl, client.HasLabels{"modelmesh-service"}, client.InNamespace(namespace)) | ||
if err == nil { | ||
if err := r.List(ctx, sl, client.HasLabels{"modelmesh-service"}, client.InNamespace(namespace)); err != nil { | ||
return ctrl.Result{}, err | ||
} else { | ||
for i := range sl.Items { | ||
s := &sl.Items[i] | ||
if err2 := r.Delete(ctx, s); err2 != nil && err == nil { | ||
err = err2 | ||
if err := r.Delete(ctx, s); err != nil { | ||
return ctrl.Result{}, err | ||
Comment on lines
+134
to
+140
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is the same logic but I tried to separate error handling.
|
||
} | ||
} | ||
} | ||
|
||
if mms := r.MMServices.Get(namespace); mms != nil { | ||
mms.Disconnect() | ||
r.MMServices.Delete(namespace) | ||
//requeue is never expected here | ||
if _, err, _ := r.reconcileService(ctx, mms, namespace, owner); err != nil { | ||
return ctrl.Result{}, err | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the namespace is not for modelmesh anymore, it should trigger reconcileService for MMService list that manages the goroutines. |
||
} | ||
return ctrl.Result{}, err | ||
|
||
return ctrl.Result{}, nil | ||
} | ||
owner = n | ||
} else { | ||
|
@@ -226,6 +236,18 @@ func (r *ServiceReconciler) reconcileService(ctx context.Context, mms *mmesh.MMS | |
if err := r.List(ctx, sl, client.HasLabels{"modelmesh-service"}, client.InNamespace(namespace)); err != nil { | ||
return nil, err, false | ||
} | ||
|
||
n := &corev1.Namespace{} | ||
if err := r.Client.Get(ctx, types.NamespacedName{Name: namespace}, n); err != nil { | ||
return nil, err, false | ||
} | ||
|
||
if !modelMeshEnabled(n, r.ControllerDeployment.Namespace) { | ||
r.ModelEventStream.RemoveWatchedService(serviceName, namespace) | ||
r.Log.Info("Deleted Watched Service", "name", serviceName, "namespace", namespace) | ||
return nil, nil, false | ||
} | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will remove the goroutine when modelmesh is not enabled for a namespace.
|
||
var s *corev1.Service | ||
for i := range sl.Items { | ||
ss := &sl.Items[i] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,10 @@ import ( | |
|
||
func modelMeshEnabled(n *corev1.Namespace, controllerNamespace string) bool { | ||
if v, ok := n.Labels["modelmesh-enabled"]; ok { | ||
// Returns false if the namespace state is terminating even though the namespace have the 'modelmesh-enabled=true' label. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This comment is helpful. Could you add comments to each of the changed code blocks above to describe what the code does and why (what problem it is solving). |
||
if n.Status.Phase == corev1.NamespaceTerminating { | ||
return false | ||
} | ||
return v == "true" | ||
} | ||
return n.Name == controllerNamespace | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, controller keep checking namespaces even though the namespaces do not exist anymore.
As a result, a lot of misleading error messages showed up in the log. (I don't find the snippet of the message now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I think the ServingRuntime controller also has this problem (erroring when reconciling after a namespace is deleted). I was getting similar error logs from it. But that can be resolved in a separate PR.
I noticed that there is a difference in logic in this PR whether the namespace is terminating (cleanup resources created for the namespace) or if the namespace doesn't exist (skip reconciliation). I think there is a chance that a reconciliation might not see that a namespace is in the terminating state by the time it reconciles. If the namespace has been deleted, we'd still want to ensure disconnect/cleanup of the MMService and run
ModelEventStream.RemoveWatchedService
. If the cleanup is moved to the namespace-deleted case then we would not even need the check to see if the namespace is terminatingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjohnson31415 Thanks for the review. From my 2 cents, the operator keeps watching the namespace so I don't think there is a possibility of missing the terminated status of the Namespace object. When any namespace changes its status, this operator detects it. WDYT?
Regarding the
IsNotFound
part, do you want me to update this pr without it? and send another pr to update it when this original PR is merged?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more comment about the namespace object status. The main issue was the operator does not have logic when the namespace is under
Terminating
status(source). From my understanding,Terminating
is the only problematic status in this procedureActive
+modelmesh-enabled"
label ==> modelMeshEnabled return trueTerminating
+modelmesh-enabled"
label ==> modelMeshEnabled return trueActive
+ no label ==> modelMeshEnabled return falseThe other status does not impact this logic.
The most important point here is, that when any namespace's status changes, this reconclie will be called and check the above logic. When any namespace gets
terminate
signal, the status will be changed toTerminating
.please correct me if something is wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the "happy path" case, I think that reconciliation should always be able to observe the terminating state. In error/edge cases (eg. many namespaces deleted in bulk, error encountered during reconciliation and a requeue happening), I'm not so sure since "Terminating" is a temporary status.
Before this PR, the reconciliation handled neither "Terminating" nor "Not Found". This PR implements clean-up if the namespace is seen as Terminating, but, If reconciliation happens for a namespace that no longer exists, should the code:
(a) Assume that in-memory resources for that namespace do not exists and just skip reconciliation (i.e. assume that if modelmesh was enabled for that namespace that it previously reconciled while it was Terminating)
(b) Check for in-memory resources that exist and clean them up if they do
With the "level-based" reconciliation that Operators are meant to implement, I think (b) makes more sense as it reconciles based on the current state, without assumption of past reconciliation.