@kbn/config-schema is slow #78351

kobelb · 2020-09-23T20:51:07Z

At the moment, @kbn/config-schema is slow because Joi is admittedly slow, and not focused on increasing performance in the short-term: hapijs/joi#2340 (comment).

While performing some performance profiling for Fleet, I'm seeing around 6% of the total CPU time being spend performing validation with @kbn/config-schema.

I think we have two options here, embrace the fact that @kbn/config-schema is too slow for some purposes, and no longer use it situations like route validation. Change @kbn/config-schema to no longer rely on Joi, and work on improving its performance.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-09-23T20:56:42Z

Pinging @elastic/kibana-platform (Team:Platform)

rudolf · 2020-09-24T11:00:22Z

I didn't investigate this properly, so I would have to verify it again to be sure, but I remember seeing a significant portion of startup CPU time is also spent inside @kbn/config-schema. It would be worth investigating this, if it's significant there would be even more motivation to increase the performance of @kbn/config-schema.

Having said that, I don't think it's worth for Kibana to maintain it's own validation library. Because config validation is part of the Core API we need to have control over that so it makes sense to use @kbn/config-schema, but instead of spending time making that faster and increasing the API surface, I think we should rather limit the time we spend on it and limit the API to only what's necessary for config validation.

pgayvallet · 2020-09-28T14:22:33Z

I think we have two options here, embrace the fact that @kbn/config-schema is too slow for some purposes, and no longer use it situations like route validation. Change @kbn/config-schema to no longer rely on Joi, and work on improving its performance.

Aren't both options the same? We will still need validation of some kind for routing, so both options means either finding a replacement for joi, or implement our own as config-schema underlying validation library, don't they? From what I've seen, some of our route validations are not 'less complex' that our config file validation, so we would need almost the same features for both.

Also, sorry to nitpick, but I'm not sure a 6% overheat issue can be labeled as bug? Isn't performance enough?

kobelb · 2020-09-28T17:15:00Z

Aren't both options the same? We will still need validation of some kind for routing, so both options means either finding a replacement for joi, or implement our own as config-schema underlying validation library, don't they? From what I've seen, some of our route validations are not 'less complex' that our config file validation, so we would need almost the same features for both.

In the context of route handling, both options are similar in that we don't want to use Joi to do route level validation. However, I still think there are two different options in regard to what we do with @kbn/config-schema as a whole.

Admit @kbn/config-schema is slow, and don't use it where performance matters
Fix @kbn/config-schema so it's now fast, can be used everywhere.

Also, sorry to nitpick, but I'm not sure a 6% overheat issue can be labeled as bug? Isn't performance enough?

I don't think we have a consistent way of labeling performance issues, so feel free to use whatever labels you desire :)

pgayvallet · 2020-09-29T08:21:27Z

Admit @kbn/config-schema is slow, and don't use it where performance matters

(not sure if I should comment here or in #78353, but) are all route validations concerned by this, or only specific bottlenecks?

If that's only bottlenecks, core's route validation system allows to use an arbitrary validation function instead of config-schema, so the quick win should probably be to use manual validation for these endpoints.

If all routes validations are considered as a potential performance issue, we need to find a faster alternative to joi , and document it as the preferred validation system for routing. But if we go down that road, 'Fixing' config-schema to change the underlying validation library is probably the way to go, as ideally, we wouldn't be using two distinct libraries.

As a sidenote, if we keep config-schema for route validation, we might want to rename it to something like @kbn/validation-schema or something.

mshustov · 2020-09-29T08:21:49Z

We will still need validation of some kind for routing

I think @rudolf meant that we shouldn't enforce @kbn/config-schema as a standard for the routing validation. However, we already allow plugins to use any custom validation library. My main concern about the topic: what if a plugin is not a hot path and called not so often as Fleet endpoints? They might be interested in using whatever tool off-the-shelf and don't spend their time investigating alternative options.

From what I've seen, some of our route validations are not 'less complex' that our config file validation, so we would need almost the same features for both.

Another concern is the tooling fragmentation. It's almost impossible to make sure all the validation library implements the same security model (for example, filtering out the raw values from error messages #58843).

While performing some performance profiling for Fleet, I'm seeing around 6% of the total CPU time being spend performing validation with @kbn/config-schema.

Is it a bottleneck for Fleet? We might need to add a benchmark setup to get absolute numbers to understand which order of numbers we are talking about - millions or thousands of operations per second. Probably, updating the Joi version helps us to increase that limit to reasonable values.

kobelb · 2020-09-29T21:13:41Z

(not sure if I should comment here or in #78353, but) are all route validations concerned by this, or only specific bottlenecks?

This issue is meant to address all usages of @kbn/config-schema everywhere. Built-in route validation, custom route validation, config, etc.

If that's only bottlenecks, core's route validation system allows to use an arbitrary validation function instead of config-schema, so the quick win should probably be to use manual validation for these endpoints.

Agreed, this is possible for the custom validation for the routes. This is what we did for Fleet, no longer use @kbn/config-schema or Joi for the route validation.

I think @rudolf meant that we shouldn't enforce @kbn/config-schema as a standard for the routing validation. However, we already allow plugins to use any custom validation library.

If we decide that @kbn/config-schema is slow and we aren't going to invest the effort to make it fast, I think we should stop encouraging plugins to use @kbn/config-schema for their route validation. With the way the API is currently created, it leads plugin developers to default to using @kbn/config-schema

My main concern about the topic: what if a plugin is not a hot path and called not so often as Fleet endpoints? They might be interested in using whatever tool off-the-shelf and don't spend their time investigating alternative options.

If a usage of @kbn/config-schema isn't in the hot path, it's not as large of a concern. However, I don't think we should be encouraging developers to use something that is slow and won't scale.

Is it a bottleneck for Fleet? We might need to add a benchmark setup to get absolute numbers to understand which order of numbers we are talking about - millions or thousands of operations per second. Probably, updating the Joi version helps us to increase that limit to reasonable values.

Yes, it is a bottleneck. The current bottleneck is because of the use of @kbn/config-schema for the built-in route validation, which is why I created #78353 to work-around this specific bottleneck.

mshustov · 2020-09-30T06:50:19Z

Yes, it is a bottleneck. The current bottleneck is because of the use of @kbn/config-schema for the built-in route validation, which is why I created #78353 to work-around this specific bottleneck.

@kobelb I'm going to create a benchmark (after FF) to compare @kbn/config-schema with joi, io-ts to have particular numbers we can discuss here.

afharo · 2021-03-24T15:42:23Z

Sharing an existing benchmark I found comparing joi, io-ts and others: https://github.com/gcanti/io-ts-benchmarks

mshustov · 2021-03-24T16:50:35Z

3 years ago

ok, might require some work to actualize 😅

afharo · 2021-03-25T11:33:47Z

True 🤣 but we have the code in place to run it :)

afharo · 2021-03-25T11:53:47Z

Updated it in this PR: gcanti/io-ts-benchmarks#4

The updated results in here: https://github.com/afharo/io-ts-benchmarks/blob/update-deps/README.md#results

wylieconlon · 2021-04-21T15:47:51Z

We recently chose to remove schema validation for TSVB because of the performance issue: #97061, but I think we are still missing a clear set of recommendations to Kibana developers. Should we continue developing with schemas? Should we stop using them until a fix is made?

delvedor · 2021-04-21T17:29:27Z

In Fastify we use ajv, it offers amazing performances, and it also allows you to pre-compile the schema, speeding up even more the start-up phase.
For generating the schemas without writing plain js objects, we have developed fluent-json-schema, while if you need a strongly typed solution, there is @sinclair/typebox (take a look here for other solutions).

kobelb · 2021-04-21T17:32:23Z

When evaluating alternatives, we should keep in mind that we'd like to minimize, if not eliminate, the places of server-side code that do code generation from strings at runtime. Code generation from strings is a common attack vector for turning prototype pollution or other logic errors into remote code execution.

delvedor · 2021-04-21T17:40:01Z

@kobelb if you don't generate the schemas each time at startup and the definitions don't change much, you can directly commit the generated validator code.
This is what we do in Fastify, which is generated here every time we have a config change.

pgayvallet · 2021-04-21T18:27:09Z

When evaluating alternatives, we should also keep in mind that we'd ideally like feature parity with the parts of joi/config-schema the Kibana codebase is currently using. Else we will not be able to replace config-schema underlying lib and propose an alternative of config-schema instead.

E.g

support for default values in schemas (used in route validation)
type inference from a schema to the equivalent TS type (used in route validation)
sibling / context references (not sure, maybe not used in route validation)
custom validation methods with type inference (used in route validation)
data conversion (e.g json string repr to parsed json object, string repr of an number to the number) (used by our route validation for structures sent in GET parameters)
what else?

Now, we would also need to check which features are currently only used for config validation (not necessary for the alternative), and which are used for actual route validation.

wylieconlon · 2021-05-05T15:40:26Z

I like that you're starting to evaluate alternatives, but maybe my question was lost in the discussion. Asking again:

We are still missing a clear set of recommendations to Kibana developers. Should we continue developing with schemas? Should we stop using them until a fix is made?

What is your advice to current developers, and can this be shared widely among the Kibana contributors?

mshustov · 2021-05-05T17:56:48Z

Should we stop using them until a fix is made?

@wylieconlon Migrating from config-schema creates fragmentation in our tooling and it will be hard to switch plugins back to config-schema. Let's continue using the common library. When we address the problem, all the endpoints will benefit from it.
Is there an example where config-schema validation affects response time? We can use it as a baseline for further refactoring.

Maybe we can improve the situation by updating joi version?
According to the @afharo's benchmark, updating joi from v14 to v 17 gives almost x2 improvement for the valid dataset, x3 improvement for the invalid dataset.

Kibana is still on v13, in our case we might gain even more.

wylieconlon · 2021-05-05T18:09:57Z

@mshustov I'm not sure you've addressed the concerns that I have, and that @kobelb originally raised in this issue. It appears that no validation is a superior option to the current library.

There are two main examples where config-schema is a bottleneck. This issue started as a discussion based on performance bottlenecks in Fleet APIs, and more recently we found a ~20-25% improvement in API response times for TSVB after removing schema validation, see this comment. You can also see the private issue linking to this one for more background.

mshustov · 2021-05-05T18:55:30Z

This issue started as a discussion based on performance bottlenecks in Fleet APIs,

IIRC Fleet had quite specific requirements to support up to 50k agents polling the Kibana server every X seconds. Since fleet switched to a custom server, we abandoned the current issue as there were no other examples, where @kbn/config-schema was a bottle-neck. Now, we can consider prioritizing it again.

and more recently we found a ~20-25% improvement in API response times for TSVB after removing schema validation,

I haven't checked the size of the TSVB payload, let me see how big it is.
Anyway, disabling validation is not a viable option, it's better to switch to another library. Do you want Core to provide recommendations on a substitution? It definitely requires additional investigation.

kobelb · 2021-05-05T20:32:41Z

It appears that no validation is a superior option to the current library.

What? How did you come to this conclusion? Validation is absolutely necessary and should not be removed because it incurs a performance penalty. If there's a situation where performing validation using @kbn/config-schema is too slow, we should use a different approach to validation, not remove it entirely. This is why we support custom validation for HTTP requests. We've had some teams succeed with using io-ts, but I don't have any first hand experience to comment on whether or not we should be recommending it.

Removing the airbags and seatbelts from a car will make it go faster, and also make it more likely for you to be killed/injured. Everything is a trade-off.

wylieconlon · 2021-05-05T21:08:26Z

You can find context for the decision to remove schema validation in this comment and the whole issue that it's a part of. I've also just linked you to more context in a private issue. We are still using prototype pollution checking on top of the Joi validation: #85952.

mshustov · 2021-05-06T12:35:07Z

I haven't checked the size of the TSVB payload, let me see how big it is.

@wylieconlon /api/metrics/vis/data route validation has been removed in #97091, so I cannot check validation performance on master.
I downloaded v7.12 dist, installed the sample data set, and navigated to the [Logs] Web Traffic dashboard. According to my measurements, validation takes from 1ms to 4ms. Is there a way to reproduce the performance problem you mentioned?

JFYI: Some of the frames outlined in #97061 (comment) belong to internal Core validation.

lizozom · 2022-07-20T15:01:44Z

@kobelb @lukeelmers what is the status on this issue?

kobelb · 2022-07-20T16:56:38Z

@lizozom @kbn/config-schema is still slow

pgayvallet · 2022-07-21T13:16:07Z

@kbn/config-schema is still slow

How dare you.

lukeelmers · 2022-08-03T20:26:21Z

@lizozom Yeah, it's still slow. In theory we got a 2x-3x performance improvement from the upgrade in #99899, but otherwise we haven't been actively working on this, and it isn't currently on the near-term roadmap.

If we want to prioritize this, my recommendation would be that we start by gathering a few scenarios where this is becoming a significant bottleneck in Kibana. If we had some numbers to use as a baseline that are from real-world Kibana use cases, and some repeatable benchmarks we could run, it'd give us a framework to understand how critical this is relative to other potential performance enhancements. (This would be especially useful since solving this particular issue could require a substantial investment, especially if we are trying to keep feature parity with joi)

pgayvallet · 2022-08-22T15:01:04Z

This would be especially useful since solving this particular issue could require a substantial investment, especially if we are trying to keep feature parity with joi

Ihmo we unfortunately just won't be able to provide feature parity with joi, at least not if the intent is to effectively have a performance improvement using the replacement. In short, given the features covered by joi (I'm thinking default values, custom validation message per type of error, or 99 other small things), no replacement lib actually does the same as joi faster, plain and simple (at least from the comparisons we did a while back).

However, I agree that identifying bottlenecks would be the key here. If we see that some routes are having their validation as a significant perf bottleneck, we could improve our custom route validation support (which, I'd like everyone to remember, is something that we already support) to have a better type support (in a similar way of what we're doing with config schema -> request body/params type inference/conversion). We could imagine, by improving our route validation types, to have the same kind of inference, for, say, io-ts.

Note that even if we were to do this, the code / route owners would still have to migrate their validation to the new validation lib, and to adapt their code for all the sugar that is not necessarily present in the replacement/alternative validation library (e.g last time I check, there wasn't native / out of the box support for default values for io-ts)

cnasikas · 2022-09-02T13:01:57Z

I recently came across to zod and I think it is a library worth considering. It is heavily inspired by io-ts. io-ts is an excellent library but I think it has quite the learning curve for developers without functional programming experience.

lizozom · 2022-09-06T11:15:12Z

I think it would be interesting to benchmark the impact of validation on several APIs, so we can quantify the problem.
@kobelb @pgayvallet do you have an intuition as per which APIs might be heavily impacted by validation?

pgayvallet · 2022-09-06T13:28:10Z

do you have an intuition as per which APIs might be heavily impacted by validation?

By APIs, you mean HTTP APIs, right?

I don't really have any idea honestly. IIRC:

The validation cost depends on whether it fails or pass, with failures being significantly higher in term of execution time
The validation cost scale linear(-ish)-ly with the amount of properties to validate within the schema
By nature, the effect of validation on overall % of CPU cycles is more visible on APIs that don't do much

rudolf · 2022-09-06T13:50:07Z

I don't think doing a bottom-up approach to performance is fruitful. We need to start at the business level and work down to achieve our objectives. I would recommend we start by:

Ask teams to define Service Level Objectives (SLO) for each of their APIs
Track and report on the measured performance vs SLO using cloud proxy logs
Investigate instances where we didn't meet our SLO
Reproduce the problem then use Nodejs profiling to understand what are the CPU bottlenecks and address those

Otherwise we'll be doing premature optimisation without being sure what the business impact would be. The only way we could do a large scale generic performance improvement would be if we could collect CPU profiles from across cloud https://github.com/elastic/cloud/issues/103563 https://github.com/elastic/prodfiler/issues/2508

kobelb mentioned this issue Sep 23, 2020

Change the built-in HTTP request validation to no longer use @kbn/config-schema #78353

Closed

kobelb added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Sep 23, 2020

kobelb added performance bug Fixes for quality problems that affect the customer experience labels Sep 23, 2020

mshustov mentioned this issue Dec 6, 2020

[Meta] Kibana platform performance #63848

Closed

29 tasks

pgayvallet removed the bug Fixes for quality problems that affect the customer experience label Mar 31, 2021

pgayvallet mentioned this issue Apr 14, 2021

[TSVB] Improve schema validation performance #97061

Closed

pgayvallet mentioned this issue May 17, 2021

Migrate joi to 17.4.0 and adapt the codebase #99899

Merged

1 task

afharo mentioned this issue Aug 18, 2021

Improve server_preboot execution time #109059

Closed

pgayvallet added the kbn/config-schema label Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@kbn/config-schema is slow #78351

@kbn/config-schema is slow #78351

kobelb commented Sep 23, 2020 •

edited

Loading

elasticmachine commented Sep 23, 2020

rudolf commented Sep 24, 2020

pgayvallet commented Sep 28, 2020 •

edited

Loading

kobelb commented Sep 28, 2020

pgayvallet commented Sep 29, 2020

mshustov commented Sep 29, 2020

kobelb commented Sep 29, 2020

mshustov commented Sep 30, 2020

afharo commented Mar 24, 2021

mshustov commented Mar 24, 2021

afharo commented Mar 25, 2021

afharo commented Mar 25, 2021

wylieconlon commented Apr 21, 2021

delvedor commented Apr 21, 2021

kobelb commented Apr 21, 2021 •

edited

Loading

delvedor commented Apr 21, 2021

pgayvallet commented Apr 21, 2021 •

edited

Loading

wylieconlon commented May 5, 2021

mshustov commented May 5, 2021

wylieconlon commented May 5, 2021

mshustov commented May 5, 2021

kobelb commented May 5, 2021 •

edited

Loading

wylieconlon commented May 5, 2021

mshustov commented May 6, 2021 •

edited

Loading

lizozom commented Jul 20, 2022

kobelb commented Jul 20, 2022

pgayvallet commented Jul 21, 2022

lukeelmers commented Aug 3, 2022

pgayvallet commented Aug 22, 2022

cnasikas commented Sep 2, 2022 •

edited

Loading

lizozom commented Sep 6, 2022

pgayvallet commented Sep 6, 2022

rudolf commented Sep 6, 2022

@kbn/config-schema is slow #78351

@kbn/config-schema is slow #78351

Comments

kobelb commented Sep 23, 2020 • edited Loading

elasticmachine commented Sep 23, 2020

rudolf commented Sep 24, 2020

pgayvallet commented Sep 28, 2020 • edited Loading

kobelb commented Sep 28, 2020

pgayvallet commented Sep 29, 2020

mshustov commented Sep 29, 2020

kobelb commented Sep 29, 2020

mshustov commented Sep 30, 2020

afharo commented Mar 24, 2021

mshustov commented Mar 24, 2021

afharo commented Mar 25, 2021

afharo commented Mar 25, 2021

wylieconlon commented Apr 21, 2021

delvedor commented Apr 21, 2021

kobelb commented Apr 21, 2021 • edited Loading

delvedor commented Apr 21, 2021

pgayvallet commented Apr 21, 2021 • edited Loading

wylieconlon commented May 5, 2021

mshustov commented May 5, 2021

wylieconlon commented May 5, 2021

mshustov commented May 5, 2021

kobelb commented May 5, 2021 • edited Loading

wylieconlon commented May 5, 2021

mshustov commented May 6, 2021 • edited Loading

lizozom commented Jul 20, 2022

kobelb commented Jul 20, 2022

pgayvallet commented Jul 21, 2022

lukeelmers commented Aug 3, 2022

pgayvallet commented Aug 22, 2022

cnasikas commented Sep 2, 2022 • edited Loading

lizozom commented Sep 6, 2022

pgayvallet commented Sep 6, 2022

rudolf commented Sep 6, 2022

kobelb commented Sep 23, 2020 •

edited

Loading

pgayvallet commented Sep 28, 2020 •

edited

Loading

kobelb commented Apr 21, 2021 •

edited

Loading

pgayvallet commented Apr 21, 2021 •

edited

Loading

kobelb commented May 5, 2021 •

edited

Loading

mshustov commented May 6, 2021 •

edited

Loading

cnasikas commented Sep 2, 2022 •

edited

Loading