Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Implement new transform installation mechanism #134321

Closed
9 tasks
Tracked by #370
joshdover opened this issue Jun 14, 2022 · 29 comments
Closed
9 tasks
Tracked by #370

[Fleet] Implement new transform installation mechanism #134321

joshdover opened this issue Jun 14, 2022 · 29 comments
Assignees
Labels
Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project :ml Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@joshdover
Copy link
Contributor

joshdover commented Jun 14, 2022

Official support for transforms was added to the package-spec in elastic/package-spec#307. We now need to update Fleet to support installing transforms defined in the official format.

Scope

The spec defines transforms to be contained in a package with the following file structure:

  • package-version/elasticsearch/transform/<name>/transform.yml - required, contains the transform definition itself
  • package-version/elasticsearch/transform/<name>/manifest.yml - optional, contains index template settings and whether or not the transform should be started on installation
  • package-version/elasticsearch/transform/<name>/fields.yml - optional, contains index mappings using the same format as data streams mapping defintions

You can find some examples of valid transforms that match the new spec here: https://github.com/elastic/package-spec/tree/main/test/packages/good/elasticsearch/transform

The transform installation code needs to be updated to:

  • Handle creating an index template when either the fields.yml file is present or manifest.yml contains destination_index_template:
    • Generate the mappings from the fields.yml file. This can be done by leveraging loadFieldsFromYaml and generateMappings
    • Combine the mappings and other index template settings from manifest.yml into a single index template
    • Create the index template and track the template in EsAssetReferences
  • Create the destination index if it doesn't already exist
    • Open question: how will new mappings and settings get applied across upgrades if the destination index is not deleted?
  • After the index template / destination index has been created and initialized, create the transform and start it. This largely already exists, we just need a couple adjustments:
    • Update the logic to read the transform from the transform.yml file rather than a json file.
    • Do not start the transform if the manifest.yml file contains start: false (the default is to start the transform if this isn't specified).
  • Handle creating the transforms in the correct order. For instance, if one transform's src index is the dest index of another transform, the second transform and index need to be created first.

The transform uninstallation code should not need to be updated. Our existing code for deleting transforms and index templates should already work as long as the EsAssetReferences are updated correctly to track the assets that were created by the install code.

Things to ensure still work:

  • We need to ensure the old endpoint packages can be uninstalled
  • We need to be able to install the old endpoint packages which used a different specification format for the transform. Support for this can be removed in the next major version.

Out of scope

  • For the time being, we will also continue to support transforms defined in the legacy/illegal format that is currently used by the Endpoint package and supported by Fleet but not allowed by the package-spec. This will allow the Endpoint package to be updated for the new format separate from this change on their own schedule. It will also allow older versions of the package to be installed in newer Kibana versions, if necessary.
  • Changing which privileges are used to install transforms. That is a larger effort which is still being discussed in [Fleet] Support installing transforms without granting kibana_system index privileges #137278
  • Changing the upgrade strategy. This is also a larger effort that needs to be defined. For now, we'll continue to delete the existing transform and create a new one.
  • [Discuss] Support stored scripts in Fleet packages package-spec#202
  • Space-specific transforms

Blocks:

@joshdover joshdover added Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Team:Fleet Team label for Observability Data Collection Fleet team labels Jun 14, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@joshdover
Copy link
Contributor Author

@kevinlog Please provide any feedback on the requirements around the Endpoint package. Will we need to be able to install older versions of the package that define transforms in the legacy way that was not implemented in the spec or can we get away with only supporting the new format?

@kevinlog
Copy link
Contributor

@joeypoon @pzl can you take a look at this and comment?

@pzl
Copy link
Member

pzl commented Jun 14, 2022

I believe the ability to install legacy transforms is still required. There are package rollbacks and other behaviors that could still task a relatively new (8.3,8.4) kibana and fleet plugin with installing an older endpoint package (rather than just a stack upgrade scenario with a pre-existing old package). There are a few edge cases of the actual installation of an older package still needed.

@joshdover
Copy link
Contributor Author

Makes sense, but I want to be sure we don't need to support this for a long period of time. Would it be feasible to drop support for the legacy transforms in the following minor release? So if we were to implement this in 8.4 (for example) could we remove support for legacy in 8.5?

@pzl
Copy link
Member

pzl commented Jun 14, 2022

I'm sure on some timescale we can drop it, but I think it might need a few release cycles. Package upgrading still remains somewhat detached from stack upgrades, which are historically fraught with user issues and considered "difficult". I want to avoid contributing to that reputation for as long as reasonable.

@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

@szeitlin
Copy link

szeitlin commented Aug 1, 2022

I hear this is slated to be done in 8.5, is that correct?

@jen-huang
Copy link
Contributor

@szeitlin Looks like this has switched hands a couple of times so would be good to clarify ownership and timeline. @peteharverson @qn895 Is the ML UI team committing to implement this for 8.5?

@sophiec20
Copy link
Contributor

We do not have plans to work directly on the endpoint package.

From the ML perspective, we want to build up knowledge and define best practices for how transforms should be included in a Fleet package, and to assess the feature gaps. During 8.5 timeframes we will be working on a spike to install and upgrade a dummy package, making sure the end to end experience makes sense for generic transforms in Fleet, and to define how we can efficiently develop and test this.

We see this as a pre-req to creating the host risk score package and to advance the investigations for managing the privileges that transforms are deployed to run with.

@kevinlog
Copy link
Contributor

kevinlog commented Aug 2, 2022

@jen-huang @joshdover

We have a ticket to make Endpoint package changes in association with the new transform spec here: https://github.com/elastic/security-team/issues/4314

@pzl investigated what was required to make changes the the Endpoint package in 8.4.0 and we believe our changes are minimal, we just need support for installation in Fleet. Dan, let me know if that sounds correct.

@szeitlin
Copy link

szeitlin commented Aug 2, 2022

Thanks for calling this out @jen-huang!

FWIW I don't know what's in the Endpoint package, or exactly how it relates to what we're doing with Host Risk Score.

Our current plan is to create and deploy packages as integrations. I'm not sure if we're going to run into blockers there, as I've heard mixed things about whether the privilege issue has to be addressed before we can release e.g. Host Risk Score as a package.

@mtojek
Copy link
Contributor

mtojek commented Aug 4, 2022

Hi,

@pzl investigated what was required to make changes the the Endpoint package in 8.4.0 and we believe our changes are minimal, we just need support for installation in Fleet. Dan, let me know if that sounds correct.

I just wanted to highlight in this thread too that currently, the Endpoint package violates the package-spec. We temporarily acknowledge the bad package format, but we will stop doing it soon. When we enforce validation, it won't be possible to publish the package (hard blocker).

Kindly please to prioritize this issue accordingly.

@kevinlog
Copy link
Contributor

kevinlog commented Aug 4, 2022

Coordinate a release

@mtojek thanks for the heads up.

It sounds like we need to agree on a release to coordinate:

  1. Transform installation changes (this ticket) (needs a team)
  2. Update Endpoint package (https://github.com/elastic/security-team/issues/4314) (OLM team)
  3. Enforce package spec (Package storage team)

@szeitlin @mtojek - is an 8.5 timeline critical for this or is there room to delay to 8.6 ? I imagine most 8.5 planning is complete across teams, so if we need to slot this in for 8.5, we may need to shift things around.

I'm happy to set up a meeting if we feel we need to quickly get in alignment. Even if we're comfortable with 8.6, it may be a good idea to start getting that plan into place.

cc @pzl @joshdover @jen-huang

Host Risk Score

@szeitlin to answer your question below

FWIW I don't know what's in the Endpoint package, or exactly how it relates to what we're doing with Host Risk Score.

It doesn't relate to your changes directly. Currently, the transforms in the Endpoint package are installed by Fleet using a legacy format that isn't in compliance with the Package spec for transforms in packages. In order for Fleet to support the new spec, we need to update the way transforms are installed. Then, we can update the Endpoint package to be in compliance with the new spec and it will still be installed correctly by Fleet.

Host Risk Score is only dependent on this ticket, to update how Fleet installs transforms from packages.

@mtojek
Copy link
Contributor

mtojek commented Aug 4, 2022

@szeitlin @mtojek - is an 8.5 timeline critical for this or is there room to delay to 8.6 ? I imagine most 8.5 planning is complete across teams, so if we need to slot this in for 8.5, we may need to shift things around.

I wouldn't pay much attention to particular timeframes, but ...

Even if we're comfortable with 8.6, it may be a good idea to start getting that plan into place.

... I wouldn't like this topic to die accidentally. It isn't a blocker for my team in terms of v2 migration, but it doesn't allow to enable enforced format validation and makes the endpoint-package error-prone.

@szeitlin
Copy link

szeitlin commented Aug 4, 2022

Ok, if I'm understanding this correctly (and thanks @sophiec20 and @mtojek and @kevinlog for explaining), we aren't blocked from packaging Host Risk Score as an integration but it would not be possible for customers on Elastic Cloud to actually install and use Host Risk Score as a module until Fleet is updated to support the new package spec for transforms? (assuming I'm right in thinking Fleet is the internal shorthand for what customers refer to as Elastic Cloud?)

Perhaps I'm misunderstanding, but it doesn't sound like this is a ton of work, since Fleet support for a package spec including transforms already exists, it's mostly a matter of making sure someone has the bandwidth to actually do it, and it's actually prioritized? Or are there technical details that still need to be decided about how to implement the required changes?

@kevinlog
Copy link
Contributor

kevinlog commented Aug 4, 2022

@szeitlin

we aren't blocked from packaging Host Risk Score as an integration but it would not be possible for customers on Elastic Cloud to actually install and use Host Risk Score as a module until Fleet is updated to support the new package spec for transforms? (assuming I'm right in thinking Fleet is the internal shorthand for what customers refer to as Elastic Cloud?)

This isn't quite right. Fleet is the part of the Kibana app that allows users to deploy and manage Agents. Users can also add different integrations to Agents, such as Endpoint Security. Part of adding this integration is installing the associated integration package, in Endpoint Security's case this is the Endpoint package. This would be just like installing the proposed Host Risk Score integration. The packages are collections ES components needed to store data shipped by that integration, for example, index templates and transforms.

Fleet has logic to install these transforms, but only based on a legacy format that isn't in compliance with the recently merged transform spec. This ticket would update the way Fleet installs transforms to be in compliance with that spec. When that is done, my team needs to update the Endpoint package in parallel to conform to the spec.

My understanding for Host Risk Score is that you will add an Integration Package to package storage which will contain a transform definition in compliance with the new transform specification. The change described in this ticket would be required to properly install your package.

Perhaps I'm misunderstanding, but it doesn't sound like this is a ton of work, since Fleet support for a package spec including transforms already exists, it's mostly a matter of making sure someone has the bandwidth to actually do it, and it's actually prioritized? Or are there technical details that still need to be decided about how to implement the required changes?

I can't speak to the effort required, but we do need to properly roadmap this ticket to unblock:

  • Transforms defined in packages which are in compliance with the new transform specification (i.e. Host Risk Score)
  • Updating the Endpoint package to be in compliance with the transform spec
  • Package storage being able to strictly enforce the package specification

At this point, I do think it's important we discuss in a meeting to make sure expectations are set correctly and we can properly roadmap this work. Our main blocker at this point is resourcing and prioritization. I'll work to set something up for early next week. I'll include representatives from OLM/Endpoint, Host Risk Score, and Fleet. Feel free to ping me if you're interested.

@joshdover
Copy link
Contributor Author

joshdover commented Aug 5, 2022

When that is done, my team needs to update the Endpoint package in parallel to conform to the spec.

It shouldn’t be strictly necessary for Endpoint to update their package at the same time since we were planning to support both the legacy and new formats simultaneously so that older versions of Endpoint could still be installed. This should allow the changes in this ticket to be decoupled from Endpoint updating their package.

I believe the reason to update the Endpoint package is more driven by the Package Storage v2 validation step, which I admit I hadn’t considered. That does make this ticket a dependency for package storage v2.

+1 to meeting in person, thanks for setting that up @kevinlog

@joshdover
Copy link
Contributor Author

Hey all, I've updated this issue with the full scope required for this initial step. Please review in detail so we can determine resourcing for this issue.

@droberts195
Copy link
Contributor

Open question: how will new mappings and settings get applied across upgrades if the destination index is not deleted?

We need to consider carefully what sort of updates to transforms will be supported, and how clever we want to try to make Fleet. Rather than support every possible upgrade scenario in the most optimal way it might be safer (as in reduced test surface area so reduced risk of bugs) to just have one mechanism for upgrading transforms that is sometimes overkill but always works eventually.

For example, I can think of three scenarios for upgrading a transform:

  1. Minor change to the transform that does not affect the destination index - upgrade can be done by a single call to the update transform endpoint
  2. Major change to the transform that is not allowed in updates, but no change to the destination index structure - upgrade can be done by deleting and recreating the transform but not the destination index
  3. Major change to the transform that also requires a different destination index structure - both the transform and the destination index should be deleted and recreated to upgrade

However, trying to support all 3 scenarios optimally would be very complex. Which scenario applies depends not only on the version you upgrade to, but also the version you upgrade from.

For this reason I think it's best that all upgrades that change a transform in any way are assumed to require deletion and recreation of both the transform and its destination index. (Obviously if the transform is exactly the same in both the old and new packages then it doesn't need to be touched.)

Consumers of the package are going to need to be aware of this and compensate for it. When the transform is deleted and recreated it will take a while to process historical data before catching up to real-time. During this period the transform destination index will be incomplete, so consumers of the transform should surface this to end users in some way (for example via a UI indicator). It's possible to tell whether the upgraded transform has processed historical data by looking at the checkpoint number in the transform stats.

@droberts195
Copy link
Contributor

Another option that would avoid the need to delete destination indices is as follows:

  1. Transforms are versioned - every transform installed by Fleet has a number at the end of its ID that is incremented each time it or the mappings/settings of its destination index is changed
  2. Transform destination indices are versioned - every transform destination index has a number at the end of its name that is incremented every time its mappings/settings are changed
  3. Packages contain a read alias that points to all versions of the destination index that are known to be compatible with the current package version
  4. Packages never delete old versions of destination indices
  5. Packages never update mappings or settings for old versions of destination indices
  6. Packages never update transforms - instead they install a new version, which may or may not have a new version of the destination index
  7. Packages delete old versions of the transform after installing the new version so that only one at a time is running

It may be desirable to go with a strategy like this in preference to the simpler suggestion in #134321 (comment) because the raw data needed to rebuild a destination index from scratch may no longer exist at the time of a package upgrade.

However, whatever we decide we should aim to have one way that transforms get upgraded within Fleet. Otherwise the complexity of trying to have multiple ways to upgrade based on what is being changed will inevitably cause bugs.

@kevinlog
Copy link
Contributor

I'm following up in this ticket after the meeting we had today to discuss the roadmap and team assignment for this ticket.

The high level conclusion:

  • In 8.5 ML team will work on an example "best practice" package which will contain a transform definition which follows the recently implemented transform spec. In addition, they will work on a simple first iteration for transform installation via Fleet, i.e. this ticket. We can more explicitly define what that means as we discover more about the problem. cc @droberts195 @sophiec20
  • The Host Risk Score team will work on creating their integration package utilizing the new transform spec. They will follow the example set by the "best practice" package created by the ML team. cc @szeitlin @ajosh0504 @SourinPaul
  • The OLM team will update the Endpoint package with the new transform spec after Fleet supports installing the new specification. However, any Endpoint package changes are decoupled from this change and we do not require any changes to keep operating as normal. The OLM team does not have any resources to dedicate to this in 8.5, but we could plan to dedicate a resource to helping with this effort in 8.6 and beyond. Especially when it comes to testing Endpoint package installation after the spec changes. cc @pzl @kevinlog

Let me know if the above sounds correct to everyone, happy to make any adjustments. I'm sending out the meeting recording and notes in separate email, just let me know if you'd like me to add you.

@szeitlin
Copy link

The Host Risk Score team will work on creating their integration package utilizing the new transform spec. They will follow the example set by the "best practice" package created by the ML team. cc @szeitlin @ajosh0504 @SourinPaul

This ticket is already under construction here: https://github.com/elastic/security-ml/issues/83
We're working with @andrewkroh and his team on that effort. If the process needs to be different, please let us know, otherwise we'll go ahead and try to create a package according to their existing process, to start with.

@kevinlog
Copy link
Contributor

kevinlog commented Sep 20, 2022

Hey all, now that #140046 is merged, I wanted to circle back on this conversation.

In the Endpoint package, we have a PR to update our transform schema. My question for the team is, are we comfortable putting a package into production that uses the new transform schema in the 8.5 release?

I tested the merged PR against a few different scenarios with both the legacy Endpoint package and one using the new schema. Details of my tests are below (context in a comment from the PR).

I tested the following scenarios and they work as expected:

  1. Install the legacy Endpoint package with the legacy transform schema ✅
  2. Install the updated Endpoint package with the new compliant transform schema ✅

Upgrade scenarios:
In this section, I tested some upgrade scenarios.

  1. (New functionality) Install Endpoint package with legacy transforms -> upgrade to an Endpoint package with compliant transforms ✅
  2. (Existing functionality) Install Endpoint package with legacy transforms -> upgrade to an Endpoint package still with legacy transforms ✅
  3. (New functionality) Install Endpoint package with compliant transforms -> upgrade to an Endpoint package with compliant transforms ✅

The above cases are the critical paths for installing new Endpoint package and updating older ones. With this, the Endpoint package using the new transform schema should be supportable in Fleet now. I would be OK with merging the package as the installation works and it gets us closer to elastic package schema compliance. We could also hold our package PR and tag it for an 8.6 release and merge it soon after 8.5 is cut. That way, I believe we could still inch closer to compliance sooner by getting a compliant package into snapshot, but remove risk from 8.5 if we wanted to test more.

Happy to hear thoughts from others. cc @joshdover @qn895 @pzl @sophiec20 @droberts195 @mtojek @szeitlin

@droberts195
Copy link
Contributor

@kevinlog I talked this through with @qn895 and @peteharverson.

Originally the idea was that upgrade was out of scope for the first iteration of the new transform installation mechanism. What's happening when you are testing upgrades is that the transform and all its associated objects (index template, destination index, etc) get completely deleted, then recreated from scratch. One thing to be aware of with this approach is that on a huge dataset rebuilding the destination index from scratch could take a while - assume several hours for your most important customer. Any downstream component that relies on the contents of the destination index will have to be written to gracefully handle this period when the original destination index has been deleted and a new one is being populated from scratch.

In the description of elastic/endpoint-package#270 it says the PR is "mostly renaming files and translating from json to yaml". Does it make any functional changes at all? Or is it entirely renaming files and translating from json to yaml? If there are no functional changes then it may be safer to hold it for 8.6. That would allow time for testing the upgrade on the biggest data set you have access to, and seeing the effect of deleting everything and restarting from scratch. It might also be that we can add more upgrade functionality for 8.6, like creating a new transform in parallel, having an alias on the destination index and only switching the alias to the new destination index once the transform has populated it from existing data. But on the other hand, if you have also made functional changes as well as changing the installation mechanism then the need to get your change into 8.5 might be greater.

@qn895
Copy link
Member

qn895 commented Sep 21, 2022

My apologies for the confusion: Double-checking the Fleet code again, and we currently don’t delete the destination indices upon uninstallation or upgrade. It will only delete the transforms, index templates, and component templates from the old version:

GET _transform/logs-transform.kibana_ecommerce_pivot-default-0.2.0 
GET _index_template/logs-transform.kibana_ecommerce_pivot-template
GET _component_template/logs-transform.kibana_ecommerce_pivot-template@custom
GET _component_template/logs-transform.kibana_ecommerce_pivot-template@package

A simplistic upgrade works now (i.e. Install Endpoint package with compliant transforms -> upgrade to an Endpoint package with compliant transforms). But it's important to note as of #140046, we currently don’t gracefully handle scenarios like if after the upgrade and the index templates or the mappings change, how would that apply to the destination indices that have already been created. Or, if the transforms change functionally (e.g. different pivot or latest configuration) but still use the same destination index name. That would need more robust testing and discussion in 8.6.

@kevinlog
Copy link
Contributor

@droberts195 @qn895 thank you for the insight.

One thing to be aware of with this approach is that on a huge dataset rebuilding the destination index from scratch could take a while

Understood. For what it's worth, I don't believe the existing logic for installation/upgrade of the legacy transform schema was any more graceful or efficient and we've relied on it for about 2 years. However, our use case is likely an outlier as we use the latest transform which pulls the latest document for each unique Agent/Endpoint ID. Our largest set of unique IDs is at around 20k right now, so we likely haven't encountered this too much.

Does it make any functional changes at all? Or is it entirely renaming files and translating from json to yaml?

No, there are no functional changes in the endpoint package PR that are not already merged in the package master branch. This PR is specifically meant to prepare us for the switch to the new compliant transform schema. It is it not critical for us to merge this change to support our 8.5 features. I was more interested in finding the right time to update the transforms to be compliant with the spec.

It sounds like we'd be more comfortable waiting for 8.6 to put a newly compliant set of transforms into production. This works for me and answers my original question - thanks!

Regarding 8.6, I'm happy to continue using our Endpoint package PR to test different scenarios and ensure our data is still correct after upgrades, etc.

cc @pzl

@joshdover
Copy link
Contributor Author

IMO there's no specific rush to transition the Endpoint package to the new mechanism, though there is some tech debt to clean up in the package registry pipelines to remove the exceptions for Endpoint's spec violations.

But I also want to clarify that we don't currently have any short-term plans to change the upgrade behavior. If we should revisit this, let's discuss. I think we may need to fix the behavior mentioned in #134321 (comment) to at least rebuild the transform destination index when required. Let's continue the discussion in the overall thread here: elastic/package-spec#370 (comment)

@qn895 Is there anything remaining to close out this issue?

@qn895
Copy link
Member

qn895 commented Oct 5, 2022

@joshdover, we have some more work to do in 8.6 around upgrades in terms of handling changes to transform destination index, although a lot of it will be best practices and working out how to enforce those best practices. I agree on closing out on this issue to continue discussion in the other thread 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project :ml Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests