Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes secret provider: Caching it to avoid performance issues #3594

Closed
2 tasks done
constanca-m opened this issue Oct 12, 2023 · 16 comments
Closed
2 tasks done

Kubernetes secret provider: Caching it to avoid performance issues #3594

constanca-m opened this issue Oct 12, 2023 · 16 comments
Assignees

Comments

@constanca-m
Copy link
Contributor

constanca-m commented Oct 12, 2023

Issue

This issue was first mentioned in #3442.

When we use a kubernetes secret provider, we need to get its result every time:

secret, err := secretIntefrace.Get(ctx, secretName, metav1.GetOptions{})

This is not ideal, because we could be just fetching the result one time and using it for the future, instead of repeating the same steps every time.

We need to implement a way to reduce the amount of requests to the API server for this by caching the result.

To do

@blakerouse
Copy link
Contributor

Duplicate of elastic/beats#3442

@blakerouse blakerouse marked this as a duplicate of #3442 Oct 16, 2023
@constanca-m constanca-m reopened this Oct 31, 2023
@constanca-m constanca-m self-assigned this Nov 16, 2023
@constanca-m
Copy link
Contributor Author

constanca-m commented Nov 16, 2023

First thoughts:

Caching the result by default is not as straightforward as a better solution as it seemed.

To cache the results we need to find a way to update them. There are several ways to do that, but most ways (TTL, for example) might lead to incorrect results if the cache is not updated in time. One way that seems efficient is to use watchers, but that will lead to an impact in memory. We have seen this in other cases, like add_resource_metadata (issue for that can be found here).

So even though there is a way to cache the results, we have two solutions that don't seem ideal:

  1. Make a request to the API Server every time we need a secret (the current one).
  2. Use a cache and have watchers so we know that the cache always had the correct value for the secret - it impacts memory.

Now we need to make sure if it is less bad to use option 1 or 2.

@cmacknz
Copy link
Member

cmacknz commented Nov 22, 2023

One way that seems efficient is to use watchers, but that will lead to an impact in memory. We have seen this in other cases, like add_resource_metadata (issue for that can be found #4670).

Do we know why the watchers in that case increased memory, and can confirm for sure it will have the same effect here? If we do know why, can we fix it? Watching the secret keys seems like it is the ideal solution here.

There are an increasing number of support cases where we are making problematic amounts of calls to the k8s API so trying to reduce them seems better in the long term.

@constanca-m
Copy link
Contributor Author

Do we know why the watchers in that case increased memory, and can confirm for sure it will have the same effect here? If we do know why, can we fix it? Watching the secret keys seems like it is the ideal solution here.

We some SDHs related to that, and an issue is well. Since we would be taking the same approach for this, it seems sure that memory would be affected.

@axw Could you please share your input? Should we try to go for this solution and see the results? If they are badly received (many SDHs about memory), maybe we undo this decision?

@axw
Copy link
Member

axw commented Nov 23, 2023

@constanca-m I don't have a lot of insight to provide here, but I tend to agree with @cmacknz that watchers sound like the right approach.

I think we should first:

  • identify why using watchers has increased memory usage elsewhere
  • understand whether the cause of that also applies to this use case

Assuming there's no fundamental reason not to use watchers here, then let's go ahead with that but also perform testing & memory profiling. We should probably perform a test in a cluster with many secrets, to make sure memory usage is only proportional to the secrets that are actually referenced.

Perhaps the memory issue is related to caching things that will never be accessed? In which case, perhaps we could combine the current approach with a watcher, and only cache/update secrets that are requested. Something like synchronously fetch a secret on first use (if it's not cached), and then keep it up to date with a watcher. WDYT @gizas?

@gizas
Copy link
Contributor

gizas commented Nov 23, 2023

This is the issue we found in our previous analysis kubernetes/client-go#871

By using the watchers indeed we will see a memory increase, this is why we use the cache. But still we have not verified if the amount of increase is justified.

For this issue I am thinking we can provide an option on/off for users if they want to use it.
Indeed profiling is recommended here. I would also try to increase the number of secrtes and make some more heavy testing to see how it behaves

@constanca-m
Copy link
Contributor Author

For this issue I am thinking we can provide an option on/off for users if they want to use it.

I like this option. I will implement this approach as the default then, and add the new option if the user does not want to use it.

I would also try to increase the number of secrtes and make some more heavy testing to see how it behaves

I am unsure how the testing would be done, I need to do more research on that.

@constanca-m
Copy link
Contributor Author

constanca-m commented Nov 24, 2023

I ran some tests and I realized that we fetch the secret for multiple resources at the same time. However, we don't update the secret value immediately after it has been updated. This means that sometimes we keep using a secret value that is no longer up to date and we get incorrect values. So based on this, what approach is best?

Option 1
Keep using this logic, but we use a cache like the issue description suggests.

  • Pros:
    • We will not see significant impact in memory (likely as now)
    • The approach is still the same as now apart from minor changes
  • Cons:
    • We update the secrets periodically, which means that they are wrong some times (same as now)

Option 2
We use watchers for secrets.

  • Pros:
    • Secrets are always up to date
  • Cons:
    • Memory increase
    • The secrets might not be updated for a long time, so we have watchers running for nothing.
    • We need more permissions. With current approach we only need "get", but with this one we would need "watch" as well, and I think "list" too.

@gizas @axw Which one seems the best?
At first I was going for option 2, but since we don't receive complains on wrong secret values and there is no other issues with option 1, I am leaning towards that option.

@axw
Copy link
Member

axw commented Nov 25, 2023

I ran some tests and I realized that we fetch the secret for multiple resources at the same time. However, we don't update the secret value immediately after it has been updated. This means that sometimes we keep using a secret value that is no longer up to date and we get incorrect values. So based on this, what approach is best?

@constanca-m could you please elaborate on this? When you say it's not updated immediately, do you mean that it's never updated? Or there's a delay? If there's a delay, how long is it and what causes it?

Watchers still feel right to me, but if there's an existing consistency issue then I guess I'd be OK with a quick fix to add an LRU cache with an option to disable.

Unrelated to this issue specifically, I went to see what opentelemetry-collector-contrib is doing. I found open-telemetry/opentelemetry-collector-contrib#23067, and open-telemetry/opentelemetry-collector-contrib#23226 which was opened to help reduce memory usage. May be a good source of inspiration if we need to go down the watcher/informer route.

@constanca-m
Copy link
Contributor Author

@constanca-m could you please elaborate on this? When you say it's not updated immediately, do you mean that it's never updated? Or there's a delay? If there's a delay, how long is it and what causes it?

It is updated. I tried to see the update of a secret by adding a field to a data stream like this:

            processors:
              - add_fields:
                  target: my-new-fields
                  fields:
                    name: my-secret-field
                    value: ${kubernetes_secrets.default.my-secret-name.file}

And what I notice is that we would still receive documents for a few minutes with the old secret value, but after some time it gets updated. I thought it was something periodic, but from my tests the period was not consistent - it could be 3, 5 or 7 minutes sometimes. I tried to follow the code to see where the function to update was being called, but it leads to a dead end (it leads to the functions of this file and then nowhere else). @gizas , as you have more experience, do you know where this update is being triggered?

@gizas
Copy link
Contributor

gizas commented Nov 27, 2023

The workflow that I have in my mind is:

  1. Elastic agent controller fetches the list of providers (including the kubernetes one). Then as second step
    2)Controller manages the state of the providers current context through channels and receives the information from transpilers . One of those is the ast tree that you pointed out.
    so the 3)rd step is the transformation of configuration into data.

Probably you have it, this is the inital PR for https://github.com/elastic/beats/pull/24789/files

But to our issue now:

it could be 3, 5 or 7 minutes sometimes.

So it seems that all this is event based and we need a relevant watch event that will perform a lookup and call for eg. https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/transpiler/ast.go#L438. @constanca-m this is my theory for not having consistent times

Option 1: The approach is still the same as now apart from minor changes

@constanca-m what will be the changes for Option 1?

and open-telemetry/opentelemetry-collector-contrib#23226 which was opened to help reduce memory usage.

@axw I will try to use this in combination to this issue 3417, which can be part of the wider meta issue.

  • Currently I dont see a reason why to start this issue in general as long as we dont have complains indeed. What do you think if we prioritise 3417 above?
    -If we still think we need this issue, I would add a flag on/off to keep both solutions for secrets Option 1 and Option 2 and leave users the ability to choose with pros and cons

@constanca-m
Copy link
Contributor Author

So it seems that all this is event based and we need a relevant watch event that will perform a lookup and call for eg. https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/transpiler/ast.go#L438. @constanca-m this is my theory for not having consistent times

The old workflow was in this README file. This is a very old file that even points to the beat repository. I can find most of the new steps in the elastic agent repository, and I can see the whole workflow until kubernetes secret provider starts running. Like you said:

  1. We have a controller created.
  2. This controller goes over all dynamic and context providers and creates them.
  3. The controller starts running, which causes all these dynamic and context providers to start running as well.

Option 1: The approach is still the same as now apart from minor changes

I am thinking for option 1:

  1. Add a map for each secret value. The secret value is added to the map like this:
secretMap[secret-name] = {
     secret-value: value
     time: present time
}

We need to register the time so we can check when to update and fetch the secret again. Let's say we watch to have values that are never older than 5 minutes.

  1. For each time we try to fetch:
    1. We check if the secret name is in our map.
    2. It is present:
      1. Check if the secret has been there for longer than 5 minutes.
        1. Yes: we fetch the secret again.
        2. No: we return the secret value.
    3. It is not present: we fetch the secret and place it in the map like this:
secretsMap[secret-name] = {
   secret-value: value just fetched
   time: present time
}

So for this option we would now have a variable such as secret_time where we tell that secrets that are older than secret_time need to be updated. @gizas

The problem: where is this fetch function being called? Because like you linked, I cannot find usages for the Apply function of the AST... I can find the transpiler and everything, I just cannot tell what is triggering it.

Currently I dont see a reason why to start this issue in general as long as we dont have complains indeed. What do you think if we prioritise elastic/beats#34717 above?

I also think the watchers would carry the problem that we would be trying to check more secrets that we actually need. So if we only want to fetch one secret, and we have 10 more in the same namespace and node, we could be watching them as well for no reason.

@gizas
Copy link
Contributor

gizas commented Nov 27, 2023

+1 for secrtet_time then ! I like it better (can we just call it ttl ? :) )

@axw what do think?

@constanca-m
Copy link
Contributor Author

can we just call it ttl ?

Of course, I just called it the first time that came to mind

@constanca-m
Copy link
Contributor Author

I opened the PR for this issue: #3822

@axw
Copy link
Member

axw commented Nov 28, 2023

@axw what do think?

Sounds like a reasonable approach. One other thing to consider: what happens if the secret is not cached locally, and there's also no matching secret in Kubernetes? Should we cache that knowledge? Otherwise every attempt to fetch will trigger a request to the API server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants