Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Proposal: Store installed packages in cluster #81110

Closed
ruflin opened this issue Oct 20, 2020 · 18 comments
Closed

[Fleet] Proposal: Store installed packages in cluster #81110

ruflin opened this issue Oct 20, 2020 · 18 comments
Labels
Team:Fleet Team label for Observability Data Collection Fleet team v7.11.0

Comments

@ruflin
Copy link
Contributor

ruflin commented Oct 20, 2020

When the package manager was created, it was created with the idea that the registry and packages are always available. The current implementation uses a local in-memory cache for package contents. Whenever a package is missing from this cache it is re-fetched from the registry. Over the last 2 releases, quite a few issues have shown up where it became a problem that packages are only available from the registry:

  • Packages are removed from the registry: In general, this should not happen in production but it is common for the snapshot registry where packages are not always stable. Removing a package puts Fleet into a state where it can't pull the assets again.
  • Registry not available: Luckily, this did not really happen yet. But in the case of running a local registry or users running it on prem, this becomes more likely. Also at one stage, our registry will have downtime. If this is the case, Fleet must stay operational, meaning it should be possible to create new policies, not necessarily install new packages.
  • Upgrade rollback: When an upgrade of a package fails, it is rolled back. If the older version of a package does not exist anymore, Fleet ends up in a state between two packages.
  • Package installation by direct upload: We are working on making uploading packages directly to Kibana available. Currently these packages are only cached in-memory and not stored anywhere. It means on restart these packages are lost.
  • Memory issue: As all the packages are stored in memory currently, this adds to the memory usage of Kibana.
  • Changing registry: A user is testing the staging registry and then switches to the production registry. Now an installed package might not be available anymore because it was only availabe on staging. The same would happen if we would support multiple registries in the future and one registry might disappear.

To solve all the above problems, I'm proposing to not only cache the packages in memory, but also store them in a dedicated ES index. This also unifies how packages work which are uploaded through zip, coming from the registry or any other model to add packages. Below is an image to have this visualised:

image

One important detail here is, that for browsing packages from the registry without installing them, these should not be downloaded, see #76261. Packages which are uploaded are always installed.

How exactly packages and assets stored in Elasticsearch should be a follow up discussion if we decide to move forward with this.

Decision: Agreed to move forward. Follow up discussion ticket: #83426

Links

@ruflin ruflin added the Team:Fleet Team label for Observability Data Collection Fleet team label Oct 20, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@skh
Copy link
Contributor

skh commented Oct 26, 2020

I would clarify the use of the word 'local' in the initial description:

  • "The current implementation uses a local cache and pulls down the package again if any files are missing or a package does not exist." -- The current implementation uses a local in-memory cache [...]. If Kibana runs clustered, every Kibana instance has its own in-memory cache.

  • "store locally" -- I assume what is meant is to store in Elasticsearch in a dedicated index? As Kibana can be clustered, local file storage should not be used.

@ruflin
Copy link
Contributor Author

ruflin commented Oct 26, 2020

@skh Spot on. Can you directly update the issue?

@skh
Copy link
Contributor

skh commented Oct 26, 2020

In addition, I see two ways to store packages in ES:

  • as the zip file, in a binary field, which may be large (see also https://github.com/elastic/dev/issues/1544 )
  • every file from the unpacked zip file in a separate document, containing
    • package name
    • package version
    • package source (upload or registry)
    • file type (e.g. screenshot, icon, field definitions, ingest pipeline....)
    • file path (because the folder structure in the package carries meaning, e.g. which fields.yml belongs to which data stream)
    • file content as binary field

The zip file would always need to be downloaded and unpacked as a whole.

Single files could be queried by file type or path so that we can access single assets more quickly, but there may be many of them in some packages.

@ruflin
Copy link
Contributor Author

ruflin commented Oct 27, 2020

I like the idea around storing each file as a document instead of the zip file. It not only allows us to query as you mentioned but in addition, but allows us to exclude certain files form storing and add metadata to each like hash, date modified / date installed etc.

@ph ph added the v7.11.0 label Oct 28, 2020
@ph
Copy link
Contributor

ph commented Oct 28, 2020

You basically have a VFS over Elasticsearch, I like the idea @skh, one thing to consider for a future improvement/release is signature of package / files?. I am not sure of the level of risk here, but it could be possible for a user to update a file via a document.

@jfsiii
Copy link
Contributor

jfsiii commented Oct 28, 2020

👍 to the problem description and proposal. Two things that come to mind are

Number of assets

I'm curious about the difference between making 10, 100, etc requests to ES vs moving them from memory. The best case scenario (few assets & fast connection to ES) might not be noticeably affected, but the more assets or greater latency to ES the slower things will feel. One option is to keep the memory cache (changing to an LRU or something less naive than now) and adding values in there on thier way in to ES. That way we keep the durability of ES but still avoid the latency issues.

Dealing with binary assets (images)

We'll have to base64 encode any binary asset. That adds about 30% to their file size. There's also a CPU cost of decoding them. Again, storing the decoded Buffer in the memory cache would help.

Here a quick check of the image files sizes. Remember these will be about 30% larger after base64 encoding

image files sizes
du -h -d1 */*/img/*
396K	aws/0.2.5/img/filebeat-aws-cloudtrail.png
1.1M	aws/0.2.5/img/filebeat-aws-elb-overview.png
188K	aws/0.2.5/img/filebeat-aws-s3access-overview.png
2.5M	aws/0.2.5/img/filebeat-aws-vpcflow-overview.png
8.0K	aws/0.2.5/img/logo_aws.svg
384K	aws/0.2.5/img/metricbeat-aws-billing-overview.png
260K	aws/0.2.5/img/metricbeat-aws-dynamodb-overview.png
1.1M	aws/0.2.5/img/metricbeat-aws-ebs-overview.png
600K	aws/0.2.5/img/metricbeat-aws-ec2-overview.png
600K	aws/0.2.5/img/metricbeat-aws-elb-overview.png
496K	aws/0.2.5/img/metricbeat-aws-lambda-overview.png
792K	aws/0.2.5/img/metricbeat-aws-overview.png
640K	aws/0.2.5/img/metricbeat-aws-rds-overview.png
332K	aws/0.2.5/img/metricbeat-aws-s3-overview.png
720K	aws/0.2.5/img/metricbeat-aws-sns-overview.png
348K	aws/0.2.5/img/metricbeat-aws-sqs-overview.png
560K	aws/0.2.5/img/metricbeat-aws-usage-overview.png
396K	aws/0.2.7/img/filebeat-aws-cloudtrail.png
1.1M	aws/0.2.7/img/filebeat-aws-elb-overview.png
188K	aws/0.2.7/img/filebeat-aws-s3access-overview.png
2.5M	aws/0.2.7/img/filebeat-aws-vpcflow-overview.png
8.0K	aws/0.2.7/img/logo_aws.svg
384K	aws/0.2.7/img/metricbeat-aws-billing-overview.png
260K	aws/0.2.7/img/metricbeat-aws-dynamodb-overview.png
1.1M	aws/0.2.7/img/metricbeat-aws-ebs-overview.png
600K	aws/0.2.7/img/metricbeat-aws-ec2-overview.png
600K	aws/0.2.7/img/metricbeat-aws-elb-overview.png
496K	aws/0.2.7/img/metricbeat-aws-lambda-overview.png
792K	aws/0.2.7/img/metricbeat-aws-overview.png
640K	aws/0.2.7/img/metricbeat-aws-rds-overview.png
332K	aws/0.2.7/img/metricbeat-aws-s3-overview.png
720K	aws/0.2.7/img/metricbeat-aws-sns-overview.png
348K	aws/0.2.7/img/metricbeat-aws-sqs-overview.png
560K	aws/0.2.7/img/metricbeat-aws-usage-overview.png
8.0K	checkpoint/0.1.0/img/checkpoint-logo.svg
4.0K	cisco/0.3.0/img/cisco.svg
796K	cisco/0.3.0/img/kibana-cisco-asa.png
 12K	crowdstrike/0.1.2/img/logo-integrations-crowdstrike.svg
392K	crowdstrike/0.1.2/img/siem-alerts-cs.jpg
512K	crowdstrike/0.1.2/img/siem-events-cs.jpg
4.0K	endpoint/0.14.0/img/security-logo-color-64px.svg
4.0K	endpoint/0.15.0/img/security-logo-color-64px.svg
4.0K	fortinet/0.1.0/img/fortinet-logo.svg
4.0K	microsoft/0.1.0/img/logo.svg
424K	o365/0.1.0/img/filebeat-o365-audit.png
296K	o365/0.1.0/img/filebeat-o365-azure-permissions.png
 16K	o365/0.1.0/img/logo-integrations-microsoft-365.svg
436K	okta/0.1.0/img/filebeat-okta-dashboard.png
4.0K	okta/0.1.0/img/okta-logo.svg
476K	panw/0.1.0/img/filebeat-panw-threat.png
1.5M	panw/0.1.0/img/filebeat-panw-traffic.png
 12K	panw/0.1.0/img/logo-integrations-paloalto-networks.svg

package-storage image sizes in KB

@jfsiii
Copy link
Contributor

jfsiii commented Oct 28, 2020

To clarify, I'm saying we would still put assets in ES, but use a cache to store ready-to-serve values to avoid hitting ES and doing any unnecessary work. We could add TTL or any other logic to decide when use or invalidate cache entries.

@skh
Copy link
Contributor

skh commented Oct 29, 2020

One option is to keep the memory cache (changing to an LRU or something less naive than now) and adding values in there on thier way in to ES. That way we keep the durability of ES but still avoid the latency issues.

Would it be an option to keep the in-memory cache, but purge some files, like ES and Kibana assets from it regularly, while keeping others, like images, for longer?

@jfsiii
Copy link
Contributor

jfsiii commented Oct 29, 2020

Would it be an option to keep the in-memory cache, but purge some files, like ES and Kibana assets from it regularly, while keeping others, like images, for longer?

Definitely. That's what I was getting at with

We could add TTL or any other logic to decide when use or invalidate cache entries.

We'll have to define the rules and then see if there's an existing package that does what we want out of the box or if we need to wrap one with some code to manage it.

Seems like we want both support for both TTL (different by assert class) and max memory size for the cache.

https://github.com/isaacs/node-lru-cache is an existing dependency and my go-to, but it doesn't support per-entry TTL. I think we'd have to create multiple caches to get different expiration policies.

I did some searching and both https://github.com/node-cache/node-cache & https://github.com/thi-ng/umbrella/tree/develop/packages/cache seem like they'd work for this case

@ruflin
Copy link
Contributor Author

ruflin commented Oct 30, 2020

Before we add a cache, we should first test if we really need it. Having a cache will speed up things but also make things more complicated.

Quite a few of the large assets are images and are only used when run through the browser. I assume the browser cache will also help use here to only load it once per user?

@jfsiii
Copy link
Contributor

jfsiii commented Oct 30, 2020

I agree we should profile. The additional work/complexity is low so we can add it later.

The browser cache will also need some work (setting headers) but we can look at that when profiling.

@skh
Copy link
Contributor

skh commented Nov 2, 2020

When a package is uninstalled from the system, I'd propose it will be removed from the storage index as well.

That way the storage index doesn't silently turn into a secondary installation source that we need to check during package listings and installations.

@neptunian
Copy link
Contributor

Upgrade rollback: When an upgrade of a package fails, it is rolled back. If the older version of a package does not exist anymore, Fleet ends up in a state between two packages.

I'm not sure at what point during the package installation process we want to update the storage index, but if possible, it seems like it'd be easiest to add it to the storage index only if installation has successfully completed. Then during rollback we can perhaps use the storage index if the previous version is not available in the registry. If we update the storage index as we are installing, this probably won't be possible.

@ph
Copy link
Contributor

ph commented Nov 4, 2020

I agree with @ruflin here, adding cache seems great but this adds a level of complexity. +1 on @jfsiii to add it later.

@jfsiii
Copy link
Contributor

jfsiii commented Nov 5, 2020

I just want to highlight that a) we already use a cache b) the proposal specifically mentions.

To solve all the above problems, I'm proposing to not only cache the packages in memory, but also store them in a dedicated ES index.

I don't want to pull us into the weeds re: caching. We can discuss it in the implementation ticket(s). Just highlighting it's not an alteration to the proposal

@jfsiii
Copy link
Contributor

jfsiii commented Nov 16, 2020

Closing since we agreed on the proposal and are discussing further in #83426

@jfsiii jfsiii closed this as completed Nov 16, 2020
@ruflin
Copy link
Contributor Author

ruflin commented Nov 17, 2020

@jfsiii Can you share what the final proposal is that was agreed on? What I put here is more a high level proposal and I hoped the questions around storing etc. (which are also mentioned in #83426) would be answered in a detailed proposal.

@jfsiii jfsiii removed their assignment Nov 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team v7.11.0
Projects
None yet
Development

No branches or pull requests

6 participants