[Fleet] Proposal: Store installed packages in cluster #81110

ruflin · 2020-10-20T07:20:24Z

When the package manager was created, it was created with the idea that the registry and packages are always available. The current implementation uses a local in-memory cache for package contents. Whenever a package is missing from this cache it is re-fetched from the registry. Over the last 2 releases, quite a few issues have shown up where it became a problem that packages are only available from the registry:

Packages are removed from the registry: In general, this should not happen in production but it is common for the snapshot registry where packages are not always stable. Removing a package puts Fleet into a state where it can't pull the assets again.
Registry not available: Luckily, this did not really happen yet. But in the case of running a local registry or users running it on prem, this becomes more likely. Also at one stage, our registry will have downtime. If this is the case, Fleet must stay operational, meaning it should be possible to create new policies, not necessarily install new packages.
Upgrade rollback: When an upgrade of a package fails, it is rolled back. If the older version of a package does not exist anymore, Fleet ends up in a state between two packages.
Package installation by direct upload: We are working on making uploading packages directly to Kibana available. Currently these packages are only cached in-memory and not stored anywhere. It means on restart these packages are lost.
Memory issue: As all the packages are stored in memory currently, this adds to the memory usage of Kibana.
Changing registry: A user is testing the staging registry and then switches to the production registry. Now an installed package might not be available anymore because it was only availabe on staging. The same would happen if we would support multiple registries in the future and one registry might disappear.

To solve all the above problems, I'm proposing to not only cache the packages in memory, but also store them in a dedicated ES index. This also unifies how packages work which are uploaded through zip, coming from the registry or any other model to add packages. Below is an image to have this visualised:

One important detail here is, that for browsing packages from the registry without installing them, these should not be downloaded, see #76261. Packages which are uploaded are always installed.

How exactly packages and assets stored in Elasticsearch should be a follow up discussion if we decide to move forward with this.

Decision: Agreed to move forward. Follow up discussion ticket: #83426

Links

elasticmachine · 2020-10-20T07:20:26Z

Pinging @elastic/ingest-management (Team:Ingest Management)

skh · 2020-10-26T09:10:54Z

I would clarify the use of the word 'local' in the initial description:

"The current implementation uses a local cache and pulls down the package again if any files are missing or a package does not exist." -- The current implementation uses a local in-memory cache [...]. If Kibana runs clustered, every Kibana instance has its own in-memory cache.
"store locally" -- I assume what is meant is to store in Elasticsearch in a dedicated index? As Kibana can be clustered, local file storage should not be used.

ruflin · 2020-10-26T12:23:20Z

@skh Spot on. Can you directly update the issue?

skh · 2020-10-26T13:35:23Z

In addition, I see two ways to store packages in ES:

as the zip file, in a binary field, which may be large (see also https://github.com/elastic/dev/issues/1544 )
every file from the unpacked zip file in a separate document, containing
- package name
- package version
- package source (upload or registry)
- file type (e.g. screenshot, icon, field definitions, ingest pipeline....)
- file path (because the folder structure in the package carries meaning, e.g. which fields.yml belongs to which data stream)
- file content as binary field

The zip file would always need to be downloaded and unpacked as a whole.

Single files could be queried by file type or path so that we can access single assets more quickly, but there may be many of them in some packages.

ruflin · 2020-10-27T14:20:33Z

I like the idea around storing each file as a document instead of the zip file. It not only allows us to query as you mentioned but in addition, but allows us to exclude certain files form storing and add metadata to each like hash, date modified / date installed etc.

ph · 2020-10-28T18:50:45Z

You basically have a VFS over Elasticsearch, I like the idea @skh, one thing to consider for a future improvement/release is signature of package / files?. I am not sure of the level of risk here, but it could be possible for a user to update a file via a document.

jfsiii · 2020-10-28T19:32:46Z

👍 to the problem description and proposal. Two things that come to mind are

Number of assets

I'm curious about the difference between making 10, 100, etc requests to ES vs moving them from memory. The best case scenario (few assets & fast connection to ES) might not be noticeably affected, but the more assets or greater latency to ES the slower things will feel. One option is to keep the memory cache (changing to an LRU or something less naive than now) and adding values in there on thier way in to ES. That way we keep the durability of ES but still avoid the latency issues.

Dealing with binary assets (images)

We'll have to base64 encode any binary asset. That adds about 30% to their file size. There's also a CPU cost of decoding them. Again, storing the decoded Buffer in the memory cache would help.

Here a quick check of the image files sizes. Remember these will be about 30% larger after base64 encoding

image files sizes

du -h -d1 */*/img/*
396K	aws/0.2.5/img/filebeat-aws-cloudtrail.png
1.1M	aws/0.2.5/img/filebeat-aws-elb-overview.png
188K	aws/0.2.5/img/filebeat-aws-s3access-overview.png
2.5M	aws/0.2.5/img/filebeat-aws-vpcflow-overview.png
8.0K	aws/0.2.5/img/logo_aws.svg
384K	aws/0.2.5/img/metricbeat-aws-billing-overview.png
260K	aws/0.2.5/img/metricbeat-aws-dynamodb-overview.png
1.1M	aws/0.2.5/img/metricbeat-aws-ebs-overview.png
600K	aws/0.2.5/img/metricbeat-aws-ec2-overview.png
600K	aws/0.2.5/img/metricbeat-aws-elb-overview.png
496K	aws/0.2.5/img/metricbeat-aws-lambda-overview.png
792K	aws/0.2.5/img/metricbeat-aws-overview.png
640K	aws/0.2.5/img/metricbeat-aws-rds-overview.png
332K	aws/0.2.5/img/metricbeat-aws-s3-overview.png
720K	aws/0.2.5/img/metricbeat-aws-sns-overview.png
348K	aws/0.2.5/img/metricbeat-aws-sqs-overview.png
560K	aws/0.2.5/img/metricbeat-aws-usage-overview.png
396K	aws/0.2.7/img/filebeat-aws-cloudtrail.png
1.1M	aws/0.2.7/img/filebeat-aws-elb-overview.png
188K	aws/0.2.7/img/filebeat-aws-s3access-overview.png
2.5M	aws/0.2.7/img/filebeat-aws-vpcflow-overview.png
8.0K	aws/0.2.7/img/logo_aws.svg
384K	aws/0.2.7/img/metricbeat-aws-billing-overview.png
260K	aws/0.2.7/img/metricbeat-aws-dynamodb-overview.png
1.1M	aws/0.2.7/img/metricbeat-aws-ebs-overview.png
600K	aws/0.2.7/img/metricbeat-aws-ec2-overview.png
600K	aws/0.2.7/img/metricbeat-aws-elb-overview.png
496K	aws/0.2.7/img/metricbeat-aws-lambda-overview.png
792K	aws/0.2.7/img/metricbeat-aws-overview.png
640K	aws/0.2.7/img/metricbeat-aws-rds-overview.png
332K	aws/0.2.7/img/metricbeat-aws-s3-overview.png
720K	aws/0.2.7/img/metricbeat-aws-sns-overview.png
348K	aws/0.2.7/img/metricbeat-aws-sqs-overview.png
560K	aws/0.2.7/img/metricbeat-aws-usage-overview.png
8.0K	checkpoint/0.1.0/img/checkpoint-logo.svg
4.0K	cisco/0.3.0/img/cisco.svg
796K	cisco/0.3.0/img/kibana-cisco-asa.png
 12K	crowdstrike/0.1.2/img/logo-integrations-crowdstrike.svg
392K	crowdstrike/0.1.2/img/siem-alerts-cs.jpg
512K	crowdstrike/0.1.2/img/siem-events-cs.jpg
4.0K	endpoint/0.14.0/img/security-logo-color-64px.svg
4.0K	endpoint/0.15.0/img/security-logo-color-64px.svg
4.0K	fortinet/0.1.0/img/fortinet-logo.svg
4.0K	microsoft/0.1.0/img/logo.svg
424K	o365/0.1.0/img/filebeat-o365-audit.png
296K	o365/0.1.0/img/filebeat-o365-azure-permissions.png
 16K	o365/0.1.0/img/logo-integrations-microsoft-365.svg
436K	okta/0.1.0/img/filebeat-okta-dashboard.png
4.0K	okta/0.1.0/img/okta-logo.svg
476K	panw/0.1.0/img/filebeat-panw-threat.png
1.5M	panw/0.1.0/img/filebeat-panw-traffic.png
 12K	panw/0.1.0/img/logo-integrations-paloalto-networks.svg

jfsiii · 2020-10-28T22:48:45Z

To clarify, I'm saying we would still put assets in ES, but use a cache to store ready-to-serve values to avoid hitting ES and doing any unnecessary work. We could add TTL or any other logic to decide when use or invalidate cache entries.

skh · 2020-10-29T10:42:10Z

One option is to keep the memory cache (changing to an LRU or something less naive than now) and adding values in there on thier way in to ES. That way we keep the durability of ES but still avoid the latency issues.

Would it be an option to keep the in-memory cache, but purge some files, like ES and Kibana assets from it regularly, while keeping others, like images, for longer?

jfsiii · 2020-10-29T14:20:34Z

Would it be an option to keep the in-memory cache, but purge some files, like ES and Kibana assets from it regularly, while keeping others, like images, for longer?

Definitely. That's what I was getting at with

We could add TTL or any other logic to decide when use or invalidate cache entries.

We'll have to define the rules and then see if there's an existing package that does what we want out of the box or if we need to wrap one with some code to manage it.

Seems like we want both support for both TTL (different by assert class) and max memory size for the cache.

https://github.com/isaacs/node-lru-cache is an existing dependency and my go-to, but it doesn't support per-entry TTL. I think we'd have to create multiple caches to get different expiration policies.

I did some searching and both https://github.com/node-cache/node-cache & https://github.com/thi-ng/umbrella/tree/develop/packages/cache seem like they'd work for this case

ruflin · 2020-10-30T07:43:42Z

Before we add a cache, we should first test if we really need it. Having a cache will speed up things but also make things more complicated.

Quite a few of the large assets are images and are only used when run through the browser. I assume the browser cache will also help use here to only load it once per user?

jfsiii · 2020-10-30T11:34:19Z

I agree we should profile. The additional work/complexity is low so we can add it later.

The browser cache will also need some work (setting headers) but we can look at that when profiling.

skh · 2020-11-02T15:00:43Z

When a package is uninstalled from the system, I'd propose it will be removed from the storage index as well.

That way the storage index doesn't silently turn into a secondary installation source that we need to check during package listings and installations.

neptunian · 2020-11-03T15:50:48Z

Upgrade rollback: When an upgrade of a package fails, it is rolled back. If the older version of a package does not exist anymore, Fleet ends up in a state between two packages.

I'm not sure at what point during the package installation process we want to update the storage index, but if possible, it seems like it'd be easiest to add it to the storage index only if installation has successfully completed. Then during rollback we can perhaps use the storage index if the previous version is not available in the registry. If we update the storage index as we are installing, this probably won't be possible.

ph · 2020-11-04T15:41:32Z

I agree with @ruflin here, adding cache seems great but this adds a level of complexity. +1 on @jfsiii to add it later.

jfsiii · 2020-11-05T14:37:37Z

I just want to highlight that a) we already use a cache b) the proposal specifically mentions.

To solve all the above problems, I'm proposing to not only cache the packages in memory, but also store them in a dedicated ES index.

I don't want to pull us into the weeds re: caching. We can discuss it in the implementation ticket(s). Just highlighting it's not an alteration to the proposal

jfsiii · 2020-11-16T18:22:00Z

Closing since we agreed on the proposal and are discussing further in #83426

ruflin · 2020-11-17T07:31:09Z

@jfsiii Can you share what the final proposal is that was agreed on? What I put here is more a high level proposal and I hoped the questions around storing etc. (which are also mentioned in #83426) would be answered in a detailed proposal.

ruflin added the Team:Fleet Team label for Observability Data Collection Fleet team label Oct 20, 2020

skh mentioned this issue Oct 26, 2020

[Fleet] Allow to install a package without the registry. #70582

Closed

8 tasks

ph assigned jfsiii Oct 28, 2020

ph added the v7.11.0 label Oct 28, 2020

skh mentioned this issue Oct 29, 2020

[Fleet] Add error handling: uploaded package missing from in-memory cache #82010

Closed

skh mentioned this issue Nov 4, 2020

[Ingest Management] Can't update the system package #82580

Closed

jfsiii mentioned this issue Nov 16, 2020

[Fleet] Store installed packages in cluster #83426

Closed

jfsiii closed this as completed Nov 16, 2020

jfsiii removed their assignment Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Proposal: Store installed packages in cluster #81110

[Fleet] Proposal: Store installed packages in cluster #81110

ruflin commented Oct 20, 2020 •

edited by jfsiii

Loading

elasticmachine commented Oct 20, 2020

skh commented Oct 26, 2020

ruflin commented Oct 26, 2020

skh commented Oct 26, 2020 •

edited

Loading

ruflin commented Oct 27, 2020

ph commented Oct 28, 2020

jfsiii commented Oct 28, 2020 •

edited

Loading

jfsiii commented Oct 28, 2020

skh commented Oct 29, 2020

jfsiii commented Oct 29, 2020

ruflin commented Oct 30, 2020

jfsiii commented Oct 30, 2020

skh commented Nov 2, 2020

neptunian commented Nov 3, 2020

ph commented Nov 4, 2020

jfsiii commented Nov 5, 2020

jfsiii commented Nov 16, 2020

ruflin commented Nov 17, 2020

[Fleet] Proposal: Store installed packages in cluster #81110

[Fleet] Proposal: Store installed packages in cluster #81110

Comments

ruflin commented Oct 20, 2020 • edited by jfsiii Loading

Links

elasticmachine commented Oct 20, 2020

skh commented Oct 26, 2020

ruflin commented Oct 26, 2020

skh commented Oct 26, 2020 • edited Loading

ruflin commented Oct 27, 2020

ph commented Oct 28, 2020

jfsiii commented Oct 28, 2020 • edited Loading

Number of assets

Dealing with binary assets (images)

jfsiii commented Oct 28, 2020

skh commented Oct 29, 2020

jfsiii commented Oct 29, 2020

ruflin commented Oct 30, 2020

jfsiii commented Oct 30, 2020

skh commented Nov 2, 2020

neptunian commented Nov 3, 2020

ph commented Nov 4, 2020

jfsiii commented Nov 5, 2020

jfsiii commented Nov 16, 2020

ruflin commented Nov 17, 2020

ruflin commented Oct 20, 2020 •

edited by jfsiii

Loading

skh commented Oct 26, 2020 •

edited

Loading

jfsiii commented Oct 28, 2020 •

edited

Loading