Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove standard deviation from stats aggregation #1788

Merged
merged 1 commit into from
Jan 17, 2023

Conversation

guilload
Copy link
Member

@guilload guilload commented Jan 13, 2023

(this is actually not part of the elastic api)

@guilload guilload linked an issue Jan 13, 2023 that may be closed by this pull request
@guilload guilload requested a review from PSeitz January 13, 2023 23:52
@codecov-commenter
Copy link

codecov-commenter commented Jan 14, 2023

Codecov Report

Merging #1788 (0caaf13) into main (a59bd96) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1788      +/-   ##
==========================================
- Coverage   94.14%   94.13%   -0.01%     
==========================================
  Files         275      275              
  Lines       51916    51895      -21     
==========================================
- Hits        48875    48851      -24     
- Misses       3041     3044       +3     
Impacted Files Coverage Δ
src/aggregation/agg_req.rs 94.03% <ø> (ø)
src/aggregation/bucket/histogram/histogram.rs 99.56% <ø> (-0.01%) ⬇️
src/aggregation/metric/stats.rs 96.99% <100.00%> (-0.24%) ⬇️
src/fastfield/multivalued/mod.rs 98.45% <0.00%> (-0.78%) ⬇️
src/schema/schema.rs 98.78% <0.00%> (-0.14%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@adamreichold
Copy link
Collaborator

Why do we have to remove this just because ES does not have it? That does not seem to justify anything. Can you explain why it is necessary that Tantivy's API strictly matches that of ES instead of just being compatible? Callers of this API who are not interested in the field can just ignore it after all.

@guilload
Copy link
Member Author

I see where you're coming from. This decision seems arbitrary, but I'll give you more context. Quickwit is built on top of tantivy. Quickwit aims to be fully ES-compatible for a few endpoints. We want to use tools that automatically check whether our ES API implementation conforms to the spec. This field is not part of the spec and will break our tests. Yes, we could deal with that in Quickwit, but it's much simpler to do so in tantivy.

Today, Quickwit, Inc., a for-profit company, drives the tantivy project and invests significant resources ($$$) into it. Consequently, the project has evolved positively in the last two years. We are mindful of developing tantivy in a way that allows Quickwit to achieve its goals while ensuring it remains a coherent general-purpose search engine library. In some instances, our design and implementation decisions will be more favorable to the Quickwit users than the pure tantivy users. That's the reality of commercial open-source software. Nevertheless, it's always a tradeoff that we evaluate carefully. In this case, we determined it was perfectly reasonable to move one field from the stats aggregation to the extended_stats one.

@adamreichold
Copy link
Collaborator

Nevertheless, it's always a tradeoff that we evaluate carefully.

I think for an open-source project, it would nevertheless make sense if this evaluation was public, i.e. giving that context as part of issues and/or pull requests instead of just pushing the resulting decision. (As it is, the related issue here for example adds nothing on top of the PR.)

In this case, we determined it was perfectly reasonable to move one field from the stats aggregation to the extended_stats one.

This seems reasonable to me, but it isn't what this PR proposes. It strictly removes functionality from Tantivy instead of just moving it to another part of the API. (AFAIK, there is no ExtendedStats API at the moment?)

Today, Quickwit, Inc., a for-profit company, drives the tantivy project [..]

Finally, w.r.t. the rest of the given context, I have to admit that the comment feels somewhat aggressive to me personally. I have never questioned which organization drives Tantivy or argued about who invests how many resources.

I just questioned a change that I cannot understand using the publicly available information. I would probably not have done so, if the rationale of automatically ensuring field-for-field compatibility would have been discussed up front.

@fulmicoton
Copy link
Collaborator

@adamreichold The spec of this collector is to mimick what elasticsearch does. A deviation is a bug that needs to be solved. I agree this spec can be discussed. We considered for a long time leaving this collector in quickwit under the AGPL license, and thought it might be useful for other users.

Opensource comes in different forms and a license is not telling much about stewardship, and I understand that this might be frustrating. I'm afraid neither tantivy nor quickwit are very democratic. There are at most consultative democracy.

On tantivy, I will usually have the last word, and will always prioritize the interest of, in that order:

  • quickwit
  • actual official tantivy users weighted by the importance of their usage.
  • contributors
  • non-contributors people

I am open to shift this priority when larger users show up publicy, or if companies want to contribute financially to the development of tantivy.

@guilload guilload force-pushed the guilload/remove-std-dev-from-stats-agg branch from 99c4876 to 0caaf13 Compare January 17, 2023 03:58
@guilload guilload merged commit c9cb3d0 into main Jan 17, 2023
@guilload guilload deleted the guilload/remove-std-dev-from-stats-agg branch January 17, 2023 04:16
@adamreichold
Copy link
Collaborator

We most likely will not qualify as "large" in terms of data set size, but we do use Tantivy at an official project at the Federal Environment Agency in Germany. (Under AGPL which is planned to stay that way even when commercial contracts are awarded.) But we are still in the prototype phase without a commitment to the technical approach.

Whether that commitment happens will not depend on the presence of the standard deviation in the Stats aggregator. But it will depend on our work group making the conscious choice to rely on Tantivy for the multi-year project duration (which is admittedly the same for all non-trivial dependencies). And that depends on whether we are able to build up sufficient trust in the stewardship of the project during the prototype phase.

I do understand that in FOSS-without-a-contract, there are no guarantees beyond go-fork-yourself and that the offer of keeping Tantivy open source while building a commercial product on top of it is generous. So I can take that or leave it if I think I have a better alternative (which I do not, otherwise I would not have made the choice and tried to start contributing). But the message that got delivered here does not help me with the taking part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove standard_deviation field from Stats
4 participants