Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@kbn/config-schema is slow #78351

Open
kobelb opened this issue Sep 23, 2020 · 33 comments
Open

@kbn/config-schema is slow #78351

kobelb opened this issue Sep 23, 2020 · 33 comments
Labels
kbn/config-schema performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@kobelb
Copy link
Contributor

kobelb commented Sep 23, 2020

At the moment, @kbn/config-schema is slow because Joi is admittedly slow, and not focused on increasing performance in the short-term: hapijs/joi#2340 (comment).

While performing some performance profiling for Fleet, I'm seeing around 6% of the total CPU time being spend performing validation with @kbn/config-schema.

Screen Shot 2020-09-23 at 1 48 39 PM

I think we have two options here, embrace the fact that @kbn/config-schema is too slow for some purposes, and no longer use it situations like route validation. Change @kbn/config-schema to no longer rely on Joi, and work on improving its performance.

@kobelb kobelb added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Sep 23, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@kobelb kobelb added performance bug Fixes for quality problems that affect the customer experience labels Sep 23, 2020
@rudolf
Copy link
Contributor

rudolf commented Sep 24, 2020

I didn't investigate this properly, so I would have to verify it again to be sure, but I remember seeing a significant portion of startup CPU time is also spent inside @kbn/config-schema. It would be worth investigating this, if it's significant there would be even more motivation to increase the performance of @kbn/config-schema.

Having said that, I don't think it's worth for Kibana to maintain it's own validation library. Because config validation is part of the Core API we need to have control over that so it makes sense to use @kbn/config-schema, but instead of spending time making that faster and increasing the API surface, I think we should rather limit the time we spend on it and limit the API to only what's necessary for config validation.

@pgayvallet
Copy link
Contributor

pgayvallet commented Sep 28, 2020

I think we have two options here, embrace the fact that @kbn/config-schema is too slow for some purposes, and no longer use it situations like route validation. Change @kbn/config-schema to no longer rely on Joi, and work on improving its performance.

Aren't both options the same? We will still need validation of some kind for routing, so both options means either finding a replacement for joi, or implement our own as config-schema underlying validation library, don't they? From what I've seen, some of our route validations are not 'less complex' that our config file validation, so we would need almost the same features for both.

Also, sorry to nitpick, but I'm not sure a 6% overheat issue can be labeled as bug? Isn't performance enough?

@kobelb
Copy link
Contributor Author

kobelb commented Sep 28, 2020

Aren't both options the same? We will still need validation of some kind for routing, so both options means either finding a replacement for joi, or implement our own as config-schema underlying validation library, don't they? From what I've seen, some of our route validations are not 'less complex' that our config file validation, so we would need almost the same features for both.

In the context of route handling, both options are similar in that we don't want to use Joi to do route level validation. However, I still think there are two different options in regard to what we do with @kbn/config-schema as a whole.

  1. Admit @kbn/config-schema is slow, and don't use it where performance matters
  2. Fix @kbn/config-schema so it's now fast, can be used everywhere.

Also, sorry to nitpick, but I'm not sure a 6% overheat issue can be labeled as bug? Isn't performance enough?

I don't think we have a consistent way of labeling performance issues, so feel free to use whatever labels you desire :)

@pgayvallet
Copy link
Contributor

Admit @kbn/config-schema is slow, and don't use it where performance matters

(not sure if I should comment here or in #78353, but) are all route validations concerned by this, or only specific bottlenecks?

If that's only bottlenecks, core's route validation system allows to use an arbitrary validation function instead of config-schema, so the quick win should probably be to use manual validation for these endpoints.

If all routes validations are considered as a potential performance issue, we need to find a faster alternative to joi , and document it as the preferred validation system for routing. But if we go down that road, 'Fixing' config-schema to change the underlying validation library is probably the way to go, as ideally, we wouldn't be using two distinct libraries.

As a sidenote, if we keep config-schema for route validation, we might want to rename it to something like @kbn/validation-schema or something.

@mshustov
Copy link
Contributor

We will still need validation of some kind for routing

I think @rudolf meant that we shouldn't enforce @kbn/config-schema as a standard for the routing validation. However, we already allow plugins to use any custom validation library. My main concern about the topic: what if a plugin is not a hot path and called not so often as Fleet endpoints? They might be interested in using whatever tool off-the-shelf and don't spend their time investigating alternative options.

From what I've seen, some of our route validations are not 'less complex' that our config file validation, so we would need almost the same features for both.

Another concern is the tooling fragmentation. It's almost impossible to make sure all the validation library implements the same security model (for example, filtering out the raw values from error messages #58843).

While performing some performance profiling for Fleet, I'm seeing around 6% of the total CPU time being spend performing validation with @kbn/config-schema.

Is it a bottleneck for Fleet? We might need to add a benchmark setup to get absolute numbers to understand which order of numbers we are talking about - millions or thousands of operations per second. Probably, updating the Joi version helps us to increase that limit to reasonable values.

@kobelb
Copy link
Contributor Author

kobelb commented Sep 29, 2020

(not sure if I should comment here or in #78353, but) are all route validations concerned by this, or only specific bottlenecks?

This issue is meant to address all usages of @kbn/config-schema everywhere. Built-in route validation, custom route validation, config, etc.

If that's only bottlenecks, core's route validation system allows to use an arbitrary validation function instead of config-schema, so the quick win should probably be to use manual validation for these endpoints.

Agreed, this is possible for the custom validation for the routes. This is what we did for Fleet, no longer use @kbn/config-schema or Joi for the route validation.

I think @rudolf meant that we shouldn't enforce @kbn/config-schema as a standard for the routing validation. However, we already allow plugins to use any custom validation library.

If we decide that @kbn/config-schema is slow and we aren't going to invest the effort to make it fast, I think we should stop encouraging plugins to use @kbn/config-schema for their route validation. With the way the API is currently created, it leads plugin developers to default to using @kbn/config-schema

My main concern about the topic: what if a plugin is not a hot path and called not so often as Fleet endpoints? They might be interested in using whatever tool off-the-shelf and don't spend their time investigating alternative options.

If a usage of @kbn/config-schema isn't in the hot path, it's not as large of a concern. However, I don't think we should be encouraging developers to use something that is slow and won't scale.

Is it a bottleneck for Fleet? We might need to add a benchmark setup to get absolute numbers to understand which order of numbers we are talking about - millions or thousands of operations per second. Probably, updating the Joi version helps us to increase that limit to reasonable values.

Yes, it is a bottleneck. The current bottleneck is because of the use of @kbn/config-schema for the built-in route validation, which is why I created #78353 to work-around this specific bottleneck.

@mshustov
Copy link
Contributor

Yes, it is a bottleneck. The current bottleneck is because of the use of @kbn/config-schema for the built-in route validation, which is why I created #78353 to work-around this specific bottleneck.

@kobelb I'm going to create a benchmark (after FF) to compare @kbn/config-schema with joi, io-ts to have particular numbers we can discuss here.

@afharo
Copy link
Member

afharo commented Mar 24, 2021

Sharing an existing benchmark I found comparing joi, io-ts and others: https://github.com/gcanti/io-ts-benchmarks

@mshustov
Copy link
Contributor

3 years ago

ok, might require some work to actualize 😅

@afharo
Copy link
Member

afharo commented Mar 25, 2021

True 🤣 but we have the code in place to run it :)

@afharo
Copy link
Member

afharo commented Mar 25, 2021

@pgayvallet pgayvallet removed the bug Fixes for quality problems that affect the customer experience label Mar 31, 2021
@wylieconlon
Copy link
Contributor

We recently chose to remove schema validation for TSVB because of the performance issue: #97061, but I think we are still missing a clear set of recommendations to Kibana developers. Should we continue developing with schemas? Should we stop using them until a fix is made?

@delvedor
Copy link
Member

In Fastify we use ajv, it offers amazing performances, and it also allows you to pre-compile the schema, speeding up even more the start-up phase.
For generating the schemas without writing plain js objects, we have developed fluent-json-schema, while if you need a strongly typed solution, there is @sinclair/typebox (take a look here for other solutions).

@kobelb
Copy link
Contributor Author

kobelb commented Apr 21, 2021

When evaluating alternatives, we should keep in mind that we'd like to minimize, if not eliminate, the places of server-side code that do code generation from strings at runtime. Code generation from strings is a common attack vector for turning prototype pollution or other logic errors into remote code execution.

@delvedor
Copy link
Member

@kobelb if you don't generate the schemas each time at startup and the definitions don't change much, you can directly commit the generated validator code.
This is what we do in Fastify, which is generated here every time we have a config change.

@pgayvallet
Copy link
Contributor

pgayvallet commented Apr 21, 2021

When evaluating alternatives, we should also keep in mind that we'd ideally like feature parity with the parts of joi/config-schema the Kibana codebase is currently using. Else we will not be able to replace config-schema underlying lib and propose an alternative of config-schema instead.

E.g

  • support for default values in schemas (used in route validation)
  • type inference from a schema to the equivalent TS type (used in route validation)
  • sibling / context references (not sure, maybe not used in route validation)
  • custom validation methods with type inference (used in route validation)
  • data conversion (e.g json string repr to parsed json object, string repr of an number to the number) (used by our route validation for structures sent in GET parameters)
  • what else?

Now, we would also need to check which features are currently only used for config validation (not necessary for the alternative), and which are used for actual route validation.

@wylieconlon
Copy link
Contributor

I like that you're starting to evaluate alternatives, but maybe my question was lost in the discussion. Asking again:

We are still missing a clear set of recommendations to Kibana developers. Should we continue developing with schemas? Should we stop using them until a fix is made?

What is your advice to current developers, and can this be shared widely among the Kibana contributors?

@mshustov
Copy link
Contributor

mshustov commented May 5, 2021

Should we stop using them until a fix is made?

@wylieconlon Migrating from config-schema creates fragmentation in our tooling and it will be hard to switch plugins back to config-schema. Let's continue using the common library. When we address the problem, all the endpoints will benefit from it.
Is there an example where config-schema validation affects response time? We can use it as a baseline for further refactoring.

Maybe we can improve the situation by updating joi version?
According to the @afharo's benchmark, updating joi from v14 to v 17 gives almost x2 improvement for the valid dataset, x3 improvement for the invalid dataset.

Kibana is still on v13, in our case we might gain even more.

@wylieconlon
Copy link
Contributor

@mshustov I'm not sure you've addressed the concerns that I have, and that @kobelb originally raised in this issue. It appears that no validation is a superior option to the current library.

There are two main examples where config-schema is a bottleneck. This issue started as a discussion based on performance bottlenecks in Fleet APIs, and more recently we found a ~20-25% improvement in API response times for TSVB after removing schema validation, see this comment. You can also see the private issue linking to this one for more background.

@mshustov
Copy link
Contributor

mshustov commented May 5, 2021

This issue started as a discussion based on performance bottlenecks in Fleet APIs,

IIRC Fleet had quite specific requirements to support up to 50k agents polling the Kibana server every X seconds. Since fleet switched to a custom server, we abandoned the current issue as there were no other examples, where @kbn/config-schema was a bottle-neck. Now, we can consider prioritizing it again.

and more recently we found a ~20-25% improvement in API response times for TSVB after removing schema validation,

I haven't checked the size of the TSVB payload, let me see how big it is.
Anyway, disabling validation is not a viable option, it's better to switch to another library. Do you want Core to provide recommendations on a substitution? It definitely requires additional investigation.

@kobelb
Copy link
Contributor Author

kobelb commented May 5, 2021

It appears that no validation is a superior option to the current library.

What? How did you come to this conclusion? Validation is absolutely necessary and should not be removed because it incurs a performance penalty. If there's a situation where performing validation using @kbn/config-schema is too slow, we should use a different approach to validation, not remove it entirely. This is why we support custom validation for HTTP requests. We've had some teams succeed with using io-ts, but I don't have any first hand experience to comment on whether or not we should be recommending it.

Removing the airbags and seatbelts from a car will make it go faster, and also make it more likely for you to be killed/injured. Everything is a trade-off.

@wylieconlon
Copy link
Contributor

You can find context for the decision to remove schema validation in this comment and the whole issue that it's a part of. I've also just linked you to more context in a private issue. We are still using prototype pollution checking on top of the Joi validation: #85952.

@mshustov
Copy link
Contributor

mshustov commented May 6, 2021

I haven't checked the size of the TSVB payload, let me see how big it is.

@wylieconlon /api/metrics/vis/data route validation has been removed in #97091, so I cannot check validation performance on master.
I downloaded v7.12 dist, installed the sample data set, and navigated to the [Logs] Web Traffic dashboard. According to my measurements, validation takes from 1ms to 4ms. Is there a way to reproduce the performance problem you mentioned?

JFYI: Some of the frames outlined in #97061 (comment) belong to internal Core validation.

@lizozom
Copy link
Contributor

lizozom commented Jul 20, 2022

@kobelb @lukeelmers what is the status on this issue?

@kobelb
Copy link
Contributor Author

kobelb commented Jul 20, 2022

@lizozom @kbn/config-schema is still slow

@pgayvallet
Copy link
Contributor

@kbn/config-schema is still slow

How dare you.

@lukeelmers
Copy link
Member

@lizozom Yeah, it's still slow. In theory we got a 2x-3x performance improvement from the upgrade in #99899, but otherwise we haven't been actively working on this, and it isn't currently on the near-term roadmap.

If we want to prioritize this, my recommendation would be that we start by gathering a few scenarios where this is becoming a significant bottleneck in Kibana. If we had some numbers to use as a baseline that are from real-world Kibana use cases, and some repeatable benchmarks we could run, it'd give us a framework to understand how critical this is relative to other potential performance enhancements. (This would be especially useful since solving this particular issue could require a substantial investment, especially if we are trying to keep feature parity with joi)

@pgayvallet
Copy link
Contributor

This would be especially useful since solving this particular issue could require a substantial investment, especially if we are trying to keep feature parity with joi

Ihmo we unfortunately just won't be able to provide feature parity with joi, at least not if the intent is to effectively have a performance improvement using the replacement. In short, given the features covered by joi (I'm thinking default values, custom validation message per type of error, or 99 other small things), no replacement lib actually does the same as joi faster, plain and simple (at least from the comparisons we did a while back).

However, I agree that identifying bottlenecks would be the key here. If we see that some routes are having their validation as a significant perf bottleneck, we could improve our custom route validation support (which, I'd like everyone to remember, is something that we already support) to have a better type support (in a similar way of what we're doing with config schema -> request body/params type inference/conversion). We could imagine, by improving our route validation types, to have the same kind of inference, for, say, io-ts.

Note that even if we were to do this, the code / route owners would still have to migrate their validation to the new validation lib, and to adapt their code for all the sugar that is not necessarily present in the replacement/alternative validation library (e.g last time I check, there wasn't native / out of the box support for default values for io-ts)

@cnasikas
Copy link
Member

cnasikas commented Sep 2, 2022

I recently came across to zod and I think it is a library worth considering. It is heavily inspired by io-ts. io-ts is an excellent library but I think it has quite the learning curve for developers without functional programming experience.

@lizozom
Copy link
Contributor

lizozom commented Sep 6, 2022

I think it would be interesting to benchmark the impact of validation on several APIs, so we can quantify the problem.
@kobelb @pgayvallet do you have an intuition as per which APIs might be heavily impacted by validation?

@pgayvallet
Copy link
Contributor

do you have an intuition as per which APIs might be heavily impacted by validation?

By APIs, you mean HTTP APIs, right?

I don't really have any idea honestly. IIRC:

  • The validation cost depends on whether it fails or pass, with failures being significantly higher in term of execution time
  • The validation cost scale linear(-ish)-ly with the amount of properties to validate within the schema
  • By nature, the effect of validation on overall % of CPU cycles is more visible on APIs that don't do much

@rudolf
Copy link
Contributor

rudolf commented Sep 6, 2022

I don't think doing a bottom-up approach to performance is fruitful. We need to start at the business level and work down to achieve our objectives. I would recommend we start by:

  1. Ask teams to define Service Level Objectives (SLO) for each of their APIs
  2. Track and report on the measured performance vs SLO using cloud proxy logs
  3. Investigate instances where we didn't meet our SLO
  4. Reproduce the problem then use Nodejs profiling to understand what are the CPU bottlenecks and address those

Otherwise we'll be doing premature optimisation without being sure what the business impact would be. The only way we could do a large scale generic performance improvement would be if we could collect CPU profiles from across cloud https://github.com/elastic/cloud/issues/103563 https://github.com/elastic/prodfiler/issues/2508

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kbn/config-schema performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests