Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scaled_float. #19264

Merged
merged 1 commit into from
Jul 18, 2016
Merged

Add scaled_float. #19264

merged 1 commit into from
Jul 18, 2016

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jul 5, 2016

This is a tentative to revive #15939 motivated by elastic/beats#1941.
Half-floats are a pretty bad option for storing percentages. They would likely
require 2 bytes all the time while percentages really don't need more than one
byte.

So this PR exposes a new scaled_float type that requires a scaling_factor
and internally indexes value*scaling_factor in a long field. Compared to the
original PR it exposes a lower-level API so that the trade-offs are clearer and
avoids any reference to fixed precision that might imply that this type is more
accurate (actually it is less accurate).

In addition to being more space-efficient for some use-cases that beats is
interested in, this is also faster that half_float unless we can improve the
efficiency of decoding half-float bits (which is currently done using software)
or until Java gets first-class support for half-floats.

@jpountz jpountz added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types v5.0.0-alpha5 labels Jul 5, 2016
@jpountz
Copy link
Contributor Author

jpountz commented Jul 5, 2016

To give more detailed information about why half floats are not enough, here is a table that gives disk usage for storing 10M random floats between 0 and 1 depending on the mapping:

Mapping Points disk usage (kB) Doc values disk usage (kB) Total
float 49728 34180 83908
half float 26560 19532 46092
scaled float (factor=4000) 25744 14652 40396
scaled float (factor=100) 13044 9768 22812

I chose 4000 and 100 as scaling factors because 4000 means 0.025% accuracy, which is better than what a half float can do for this particular use case (floats between 0 and 1) yet requires less disk, and 100 because I suspect it would be enough for many metrics like cpu utilization with its 1% accuracy.

Of course this is not a good benchmark since this is fake data, but given how points and doc values work this simulates the worst case and real data could expect even better disk utilization.


/** A {@link FieldMapper} for scaled floats. Values are internally multiplied
* by a scaling factor and rounded to the closest long. */
public class ScaledFloatFieldMapper extends FieldMapper implements AllFieldMapper.IncludeInAll {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: would it be possible to extend from LongFieldMapper? Would be nice to have some code reuse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it when working on this PR but in the end it made things more complicated since this mapper partially needs to behave as a long field and as a double field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I can see how this can complicate things, was just hoping that this code reuse would be a low hanging fruit.

@jpountz
Copy link
Contributor Author

jpountz commented Jul 12, 2016

Updated numbers with https://issues.apache.org/jira/browse/LUCENE-7371:

Mapping Points disk usage (kB) Doc values disk usage (kB) Total
float 40312 34180 74492
half float 23092 19532 42624
scaled float (factor=4000) 22792 14652 37444
scaled float (factor=100) 12984 9768 22752

This is a tentative to revive elastic#15939 motivated by elastic/beats#1941.
Half-floats are a pretty bad option for storing percentages. They would likely
require 2 bytes all the time while they don't need more than one byte.

So this PR exposes a new `scaled_float` type that requires a `scaling_factor`
and internally indexes `value*scaling_factor` in a long field. Compared to the
original PR it exposes a lower-level API so that the trade-offs are clearer and
avoids any reference to fixed precision that might imply that this type is more
accurate (actually it is *less* accurate).

In addition to being more space-efficient for some use-cases that beats is
interested in, this is also faster that `half_float` unless we can improve the
efficiency of decoding half-float bits (which is currently done using software)
or until Java gets first-class support for half-floats.
@jpountz jpountz force-pushed the feature/scaled_floats branch from 4be06bf to 398d70b Compare July 18, 2016 11:37
@jpountz jpountz merged commit 398d70b into elastic:master Jul 18, 2016
@jpountz jpountz deleted the feature/scaled_floats branch July 18, 2016 12:05
Bargs added a commit to Bargs/kibana that referenced this pull request Jul 22, 2016
Elasticsearch added a couple of new numeric datatypes, which means we
need to update our type casting list to include them. Kibana should
see them as "numbers" so they work properly in searches and aggs.

Fixes elastic#7782
Related elastic/elasticsearch#18887
Related elastic/elasticsearch#19264
tsg pushed a commit to tsg/beats that referenced this pull request Aug 2, 2016
Elasticsearch has recently added scaled_float as an option for storing floating
point numbers. The scaled floats are stored internally as longs, which means
they can take advantage of the integer compression in Lucene. See
elastic/elasticsearch#19264 for details.

The PR moves all percentages to scaled floats. In our `fields.yml` we assume a
default scaling factor of 1000, which should work well for our percentages
(values between 0 and 1). This scaling factor can also be set to a different
value in `fields.yml`.
ruflin pushed a commit to elastic/beats that referenced this pull request Aug 2, 2016
Elasticsearch has recently added scaled_float as an option for storing floating
point numbers. The scaled floats are stored internally as longs, which means
they can take advantage of the integer compression in Lucene. See
elastic/elasticsearch#19264 for details.

The PR moves all percentages to scaled floats. In our `fields.yml` we assume a
default scaling factor of 1000, which should work well for our percentages
(values between 0 and 1). This scaling factor can also be set to a different
value in `fields.yml`.
airow pushed a commit to airow/kibana that referenced this pull request Feb 16, 2017
Elasticsearch added a couple of new numeric datatypes, which means we
need to update our type casting list to include them. Kibana should
see them as "numbers" so they work properly in searches and aggs.

Fixes elastic#7782
Related elastic/elasticsearch#18887
Related elastic/elasticsearch#19264


Former-commit-id: 298ee35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature release highlight :Search Foundations/Mapping Index mappings, including merging and defining field types v5.0.0-alpha5
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants