-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance optimizeDictionary
to optionally optimize var-width type cols
#13994
base: master
Are you sure you want to change the base?
Enhance optimizeDictionary
to optionally optimize var-width type cols
#13994
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13994 +/- ##
============================================
+ Coverage 61.75% 65.09% +3.34%
- Complexity 207 1533 +1326
============================================
Files 2436 2564 +128
Lines 133233 140778 +7545
Branches 20636 21611 +975
============================================
+ Hits 82274 91645 +9371
+ Misses 44911 42388 -2523
- Partials 6048 6745 +697
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this heuristic!
Overall lgtm, can we add a test case/example where noDictionarySizeRatioThreshold
is used for fixed length when both thresholds are specified?
...c/main/java/org/apache/pinot/segment/local/segment/index/dictionary/DictionaryIndexType.java
Outdated
Show resolved
Hide resolved
@@ -293,8 +293,9 @@ private boolean createDictionaryForColumn(ColumnIndexCreationInfo info, SegmentG | |||
|
|||
FieldIndexConfigs fieldIndexConfigs = config.getIndexConfigsByColName().get(column); | |||
if (DictionaryIndexType.ignoreDictionaryOverride(config.isOptimizeDictionary(), | |||
config.isOptimizeDictionaryForMetrics(), config.getNoDictionarySizeRatioThreshold(), spec, fieldIndexConfigs, | |||
info.getDistinctValueCount(), info.getTotalNumberOfEntries())) { | |||
config.isOptimizeDictionaryForMetrics(), config.getNoDictionarySizeRatioThreshold(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list has become quite long. Suggest adding a new config class for dictionary tuning related configs
Changes
noDictionaryCardinalityRatioThreshold
config.optimizeDictionary
is true, then Pinot will override dictionary encoding with raw encoding based on the conditioncardinality / numDocs > noDictionaryCardinalityRatioThreshold
.optimizeDictionary
behavior is unchangedMotivation
When storing log data, often columns will contain many repeated values. It's useful to take advantage of Pinot's dictionary encoding which usually provides better storage/query performance for these columns. Dictionary encoding high cardinality columns is cost/storage prohibitive, so we'd like to avoid applying dictionary encoding unless it is safe. Since column cardinality/values can change rapidly we'd like to make these decisions within Pinot itself.
In our experience, cardinality is a good indicator of whether to dictionary or raw encode a col. With a
0.10
threshold (10%), we see roughly 40-60% improvement in storage compared to raw encoding everything.