Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Goroutine memory leak #397

Merged
merged 4 commits into from
Sep 19, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 27 additions & 5 deletions controllers/service_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -119,24 +119,34 @@ func (r *ServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
namespace = req.Name
n := &corev1.Namespace{}
if err := r.Client.Get(ctx, req.NamespacedName, n); err != nil {
if k8serr.IsNotFound(err) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, controller keep checking namespaces even though the namespaces do not exist anymore.
As a result, a lot of misleading error messages showed up in the log. (I don't find the snippet of the message now)

Copy link
Contributor

@tjohnson31415 tjohnson31415 Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I think the ServingRuntime controller also has this problem (erroring when reconciling after a namespace is deleted). I was getting similar error logs from it. But that can be resolved in a separate PR.

I noticed that there is a difference in logic in this PR whether the namespace is terminating (cleanup resources created for the namespace) or if the namespace doesn't exist (skip reconciliation). I think there is a chance that a reconciliation might not see that a namespace is in the terminating state by the time it reconciles. If the namespace has been deleted, we'd still want to ensure disconnect/cleanup of the MMService and run ModelEventStream.RemoveWatchedService. If the cleanup is moved to the namespace-deleted case then we would not even need the check to see if the namespace is terminating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjohnson31415 Thanks for the review. From my 2 cents, the operator keeps watching the namespace so I don't think there is a possibility of missing the terminated status of the Namespace object. When any namespace changes its status, this operator detects it. WDYT?

Regarding the IsNotFound part, do you want me to update this pr without it? and send another pr to update it when this original PR is merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more comment about the namespace object status. The main issue was the operator does not have logic when the namespace is under Terminating status(source). From my understanding, Terminating is the only problematic status in this procedure

  • Active + modelmesh-enabled" label ==> modelMeshEnabled return true
  • Terminating + modelmesh-enabled" label ==> modelMeshEnabled return true
  • Active + no label ==> modelMeshEnabled return false
    The other status does not impact this logic.

The most important point here is, that when any namespace's status changes, this reconclie will be called and check the above logic. When any namespace gets terminate signal, the status will be changed to Terminating.

please correct me if something is wrong here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "happy path" case, I think that reconciliation should always be able to observe the terminating state. In error/edge cases (eg. many namespaces deleted in bulk, error encountered during reconciliation and a requeue happening), I'm not so sure since "Terminating" is a temporary status.

Before this PR, the reconciliation handled neither "Terminating" nor "Not Found". This PR implements clean-up if the namespace is seen as Terminating, but, If reconciliation happens for a namespace that no longer exists, should the code:
(a) Assume that in-memory resources for that namespace do not exists and just skip reconciliation (i.e. assume that if modelmesh was enabled for that namespace that it previously reconciled while it was Terminating)
(b) Check for in-memory resources that exist and clean them up if they do

With the "level-based" reconciliation that Operators are meant to implement, I think (b) makes more sense as it reconciles based on the current state, without assumption of past reconciliation.

err = nil
}
return ctrl.Result{}, err
}
if !modelMeshEnabled(n, r.ControllerDeployment.Namespace) {
sl := &corev1.ServiceList{}
err := r.List(ctx, sl, client.HasLabels{"modelmesh-service"}, client.InNamespace(namespace))
if err == nil {
if err := r.List(ctx, sl, client.HasLabels{"modelmesh-service"}, client.InNamespace(namespace)); err != nil {
return ctrl.Result{}, err
} else {
for i := range sl.Items {
s := &sl.Items[i]
if err2 := r.Delete(ctx, s); err2 != nil && err == nil {
err = err2
if err := r.Delete(ctx, s); err != nil {
return ctrl.Result{}, err
Comment on lines +134 to +140
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same logic but I tried to separate error handling.
The logic is

  • If the namespace is not for modelmesh anymore, it will delete modelmesh Service when it exists.
  • If the namespace is terminated, it does not need to delete the modelmesh Service because it will be gone with the namespace

}
}
}

if mms := r.MMServices.Get(namespace); mms != nil {
mms.Disconnect()
r.MMServices.Delete(namespace)
//requeue is never expected here
if _, err, _ := r.reconcileService(ctx, mms, namespace, owner); err != nil {
return ctrl.Result{}, err
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the namespace is not for modelmesh anymore, it should trigger reconcileService for MMService list that manages the goroutines.

}
return ctrl.Result{}, err

return ctrl.Result{}, nil
}
owner = n
} else {
Expand Down Expand Up @@ -226,6 +236,18 @@ func (r *ServiceReconciler) reconcileService(ctx context.Context, mms *mmesh.MMS
if err := r.List(ctx, sl, client.HasLabels{"modelmesh-service"}, client.InNamespace(namespace)); err != nil {
return nil, err, false
}

n := &corev1.Namespace{}
if err := r.Client.Get(ctx, types.NamespacedName{Name: namespace}, n); err != nil {
return nil, err, false
}

if !modelMeshEnabled(n, r.ControllerDeployment.Namespace) {
r.ModelEventStream.RemoveWatchedService(serviceName, namespace)
r.Log.Info("Deleted Watched Service", "name", serviceName, "namespace", namespace)
return nil, nil, false
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will remove the goroutine when modelmesh is not enabled for a namespace.

  • when the namespace does not have the annotation modelmesh-enabled
  • when the namespace is under a Terminating state.

var s *corev1.Service
for i := range sl.Items {
ss := &sl.Items[i]
Expand Down
4 changes: 4 additions & 0 deletions controllers/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ import (

func modelMeshEnabled(n *corev1.Namespace, controllerNamespace string) bool {
if v, ok := n.Labels["modelmesh-enabled"]; ok {
// Returns false if the namespace state is terminating even though the namespace have the 'modelmesh-enabled=true' label.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is helpful. Could you add comments to each of the changed code blocks above to describe what the code does and why (what problem it is solving).

if n.Status.Phase == corev1.NamespaceTerminating {
return false
}
return v == "true"
}
return n.Name == controllerNamespace
Expand Down