-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@kbn/config-schema is slow #78351
Comments
Pinging @elastic/kibana-platform (Team:Platform) |
I didn't investigate this properly, so I would have to verify it again to be sure, but I remember seeing a significant portion of startup CPU time is also spent inside Having said that, I don't think it's worth for Kibana to maintain it's own validation library. Because config validation is part of the Core API we need to have control over that so it makes sense to use |
Aren't both options the same? We will still need validation of some kind for routing, so both options means either finding a replacement for Also, sorry to nitpick, but I'm not sure a 6% overheat issue can be labeled as |
In the context of route handling, both options are similar in that we don't want to use
I don't think we have a consistent way of labeling performance issues, so feel free to use whatever labels you desire :) |
(not sure if I should comment here or in #78353, but) are all route validations concerned by this, or only specific bottlenecks? If that's only bottlenecks, core's route validation system allows to use an arbitrary validation function instead of If all routes validations are considered as a potential performance issue, we need to find a faster alternative to As a sidenote, if we keep |
I think @rudolf meant that we shouldn't enforce
Another concern is the tooling fragmentation. It's almost impossible to make sure all the validation library implements the same security model (for example, filtering out the raw values from error messages #58843).
Is it a bottleneck for Fleet? We might need to add a benchmark setup to get absolute numbers to understand which order of numbers we are talking about - millions or thousands of operations per second. Probably, updating the Joi version helps us to increase that limit to reasonable values. |
This issue is meant to address all usages of
Agreed, this is possible for the custom validation for the routes. This is what we did for Fleet, no longer use
If we decide that
If a usage of
Yes, it is a bottleneck. The current bottleneck is because of the use of |
@kobelb I'm going to create a benchmark (after FF) to compare |
Sharing an existing benchmark I found comparing |
ok, might require some work to actualize 😅 |
True 🤣 but we have the code in place to run it :) |
Updated it in this PR: gcanti/io-ts-benchmarks#4 The updated results in here: https://github.com/afharo/io-ts-benchmarks/blob/update-deps/README.md#results |
We recently chose to remove schema validation for TSVB because of the performance issue: #97061, but I think we are still missing a clear set of recommendations to Kibana developers. Should we continue developing with schemas? Should we stop using them until a fix is made? |
In Fastify we use |
When evaluating alternatives, we should keep in mind that we'd like to minimize, if not eliminate, the places of server-side code that do code generation from strings at runtime. Code generation from strings is a common attack vector for turning prototype pollution or other logic errors into remote code execution. |
When evaluating alternatives, we should also keep in mind that we'd ideally like feature parity with the parts of joi/config-schema the Kibana codebase is currently using. Else we will not be able to replace config-schema underlying lib and propose an alternative of config-schema instead. E.g
Now, we would also need to check which features are currently only used for config validation (not necessary for the alternative), and which are used for actual route validation. |
I like that you're starting to evaluate alternatives, but maybe my question was lost in the discussion. Asking again:
What is your advice to current developers, and can this be shared widely among the Kibana contributors? |
@wylieconlon Migrating from Maybe we can improve the situation by updating Kibana is still on v13, in our case we might gain even more. |
@mshustov I'm not sure you've addressed the concerns that I have, and that @kobelb originally raised in this issue. It appears that no validation is a superior option to the current library. There are two main examples where config-schema is a bottleneck. This issue started as a discussion based on performance bottlenecks in Fleet APIs, and more recently we found a ~20-25% improvement in API response times for TSVB after removing schema validation, see this comment. You can also see the private issue linking to this one for more background. |
IIRC Fleet had quite specific requirements to support up to 50k agents polling the Kibana server every X seconds. Since fleet switched to a custom server, we abandoned the current issue as there were no other examples, where
I haven't checked the size of the TSVB payload, let me see how big it is. |
What? How did you come to this conclusion? Validation is absolutely necessary and should not be removed because it incurs a performance penalty. If there's a situation where performing validation using Removing the airbags and seatbelts from a car will make it go faster, and also make it more likely for you to be killed/injured. Everything is a trade-off. |
You can find context for the decision to remove schema validation in this comment and the whole issue that it's a part of. I've also just linked you to more context in a private issue. We are still using prototype pollution checking on top of the Joi validation: #85952. |
@wylieconlon JFYI: Some of the frames outlined in #97061 (comment) belong to internal Core validation. |
@kobelb @lukeelmers what is the status on this issue? |
@lizozom |
How dare you. |
@lizozom Yeah, it's still slow. In theory we got a 2x-3x performance improvement from the upgrade in #99899, but otherwise we haven't been actively working on this, and it isn't currently on the near-term roadmap. If we want to prioritize this, my recommendation would be that we start by gathering a few scenarios where this is becoming a significant bottleneck in Kibana. If we had some numbers to use as a baseline that are from real-world Kibana use cases, and some repeatable benchmarks we could run, it'd give us a framework to understand how critical this is relative to other potential performance enhancements. (This would be especially useful since solving this particular issue could require a substantial investment, especially if we are trying to keep feature parity with joi) |
Ihmo we unfortunately just won't be able to provide feature parity with joi, at least not if the intent is to effectively have a performance improvement using the replacement. In short, given the features covered by joi (I'm thinking default values, custom validation message per type of error, or 99 other small things), no replacement lib actually does the same as joi faster, plain and simple (at least from the comparisons we did a while back). However, I agree that identifying bottlenecks would be the key here. If we see that some routes are having their validation as a significant perf bottleneck, we could improve our custom route validation support (which, I'd like everyone to remember, is something that we already support) to have a better type support (in a similar way of what we're doing with config schema -> request body/params type inference/conversion). We could imagine, by improving our route validation types, to have the same kind of inference, for, say, Note that even if we were to do this, the code / route owners would still have to migrate their validation to the new validation lib, and to adapt their code for all the sugar that is not necessarily present in the replacement/alternative validation library (e.g last time I check, there wasn't native / out of the box support for default values for |
I recently came across to zod and I think it is a library worth considering. It is heavily inspired by io-ts. io-ts is an excellent library but I think it has quite the learning curve for developers without functional programming experience. |
I think it would be interesting to benchmark the impact of validation on several APIs, so we can quantify the problem. |
By APIs, you mean HTTP APIs, right? I don't really have any idea honestly. IIRC:
|
I don't think doing a bottom-up approach to performance is fruitful. We need to start at the business level and work down to achieve our objectives. I would recommend we start by:
Otherwise we'll be doing premature optimisation without being sure what the business impact would be. The only way we could do a large scale generic performance improvement would be if we could collect CPU profiles from across cloud https://github.com/elastic/cloud/issues/103563 https://github.com/elastic/prodfiler/issues/2508 |
At the moment,
@kbn/config-schema
is slow because Joi is admittedly slow, and not focused on increasing performance in the short-term: hapijs/joi#2340 (comment).While performing some performance profiling for Fleet, I'm seeing around 6% of the total CPU time being spend performing validation with
@kbn/config-schema
.I think we have two options here, embrace the fact that
@kbn/config-schema
is too slow for some purposes, and no longer use it situations like route validation. Change@kbn/config-schema
to no longer rely on Joi, and work on improving its performance.The text was updated successfully, but these errors were encountered: