Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for BitRound quantization by number-of-significant-bits? #2225

Closed
czender opened this issue Feb 15, 2022 · 13 comments · Fixed by #2232
Closed

Add support for BitRound quantization by number-of-significant-bits? #2225

czender opened this issue Feb 15, 2022 · 13 comments · Fixed by #2232

Comments

@czender
Copy link
Contributor

czender commented Feb 15, 2022

netCDF-C needs (IMHO) one more lossy codec: BitRound. BitRound takes the Number of Significant Bits (NSB) not digits (NSD) as input, and performs quantization with IEEE rounding on the remaining bits, aka the "keep bits". BitRound is already used as the final step in GranularBR, thus the heart of BitRound algorithm is already in netCDF-dev. However, I think we need to make BitRound separately invokable so users can directly specify NSB (not NSD). BitRound will use the same NSB for all values of a given variable, unlike GBR in which each value of a given variable can have a different (internally decided) NSB. I added a separately invocable BitRound to NCO and to CCR last month, in case anyone wants to try it.

The difference in compression ratio (CR) between losslessly compressing (e.g., with Zstd) a uniformly quantized/rounded variable (e.g., with BitRound or BitGroom) and losslessly compressing a variable quantized/rounded value-by-value (e.g., with Granular BitRound) can be significant (~5% though needs testings). This is because lossless compressors can more easily recognize/compress the set of trailing zero-bits in mantissa when the number of those zero bits does not change. Setting NSB may appeal more to computer science types than to domain researchers who might be more comfortable with setting NSD than NSB. BitRound is also used as the final (and easiest) step in the method of Klower et al. (2021) https://doi.org/10.1038/s43588-021-00156-2

This issue is a place to discuss any related feedback. I have started to draft a PR and would appreciate if @DennisHeimbigner might opine on the general idea, and also one specific question: if BitRound were to go into netCDF-C, should the PR re-use the structure member var->nsd to also hold the NSB, or should the PR add a new member var->nsb, or should it rename the structure member, or what?

FWIW, I would recommend everyone use either GranularBR or BR depending on their tastes. BitGroom just doesn't cut the mustard against these two, though it has been a useful starting point with a nice peer-reviewed journal reference. I'm also willing to craft this PR to replace BG with BR thereby reducing the # of quantization codecs from 3 to 2. Feedback welcome!

@DennisHeimbigner
Copy link
Collaborator

Is there a universally used conversion from number-of-digits to number-of-bits?

@czender
Copy link
Contributor Author

czender commented Feb 15, 2022

Not in this case, i.e., not with optimized algorithms. The conversion factor of M_LN10/M_LN2 = 3.32 is the average number of bits per digit. However, 3.32 is really the "marginal number of bits per digit", in that 3.32 mantissa bits buys about one digit of precision, after the first digit or so. In practice the NSD-preserving algorithms must use functions like ceil(), floor() and log2(), and off-by-one corrections to optimize (i.e., minimize) the number of bits retained while still keeping the guarantee that NSD digits are preserved. NSB-based algorithms, on the other hand, are relatively simple: The user specifies NSB, and everything beyond the NSB'th bit is quantized. NSB-methods are thus much more transparent than NSD-methods about what goes on "under the hood" (because not much goes on with NSB). Only the quantization method (shaving, setting, rounding...) sets sets apart one NSB algorithm from another. I implemented BitGroom and GranularBR with one integer (NSD) and BitRound with a different integer (NSB) because they are preserving different types of information (significant digits and kept bits, respectively). The whole point of GranularBR is that the number of bits kept depends on the unquantized value, since guaranteeing a specified NSD requires about 3 more bits for values that start with 9 than for values that start with 1. Also I think most people need or want to specify an exact (integral, not floating point) NSB based on other considerations of the data, whether those be storage constraints, information content, or shape-preservation considerations.

@DennisHeimbigner
Copy link
Collaborator

so use of NSB attribute vs NSD attribute will be algorithm specific?

@czender
Copy link
Contributor Author

czender commented Feb 15, 2022

Yes, BitGroom and GranularBR use NSD. BitRound uses NSB.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Feb 15, 2022

If there is no way that an algorithm will use both nsb and nsd, then
reusing the nsd field seems ok to me.
At this point, I might consider replacing the three algorithm attributes with
a single attribute with a char value being the string name of the algorithm.

Never mind

@czender
Copy link
Contributor Author

czender commented Feb 16, 2022

Yeah, that makes sense. It might be more extensible in the long run, though it requires two attributes not one if I understand you correctly, e.g.,

double foo1(time) ;
  foo1:QuantizeAlgorithm="GranularBitRound" ; 
  foo1:NumberOfSignificantDigits = 3 ;

double foo2(time) ;
  foo2:QuantizeAlgorithm="BitRound" ; 
  foo2:NumberOfSignificantBits = 3 ;

@edwardhartnett
Copy link
Contributor

Whatever changes are going to be made, let's make them now, because @WardF is preparing a release...

@WardF
Copy link
Member

WardF commented Feb 16, 2022

I'll be creating the v4.9.0-wellspring branch soon and will start sorting PR's into that branch for the 4.9.0 release. So there is, realistically, a little bit of time before the release. I certainly won't quietly mint a 4.9.0 release w/out these features :).

@edhartnett
Copy link
Contributor

edhartnett commented Feb 16, 2022 via email

@czender
Copy link
Contributor Author

czender commented Feb 18, 2022

Hey @DennisHeimbigner
I notice that main:libnczarr/zvar.c line 777 appears to allow only NC_QUANTIZE_BITGROOM,
not NC_QUANTIZE_GRANULARBR. Is this intentional? It does not appear to be. I am adding checks to it
for NC_QUANTIZE_GRANULARBR and NC_QUANTIZE_BITROUND in my upcoming PR. Please correct me if I'm wrong to do so.

@DennisHeimbigner
Copy link
Collaborator

Thanks. Yes there appears to be some missing code in zvar.c. You can go ahead and fix it.

@czender
Copy link
Contributor Author

czender commented Feb 23, 2022

FYI, PR #2232 to add the BitRound quantization is ready for review. If this is merged then netCDF will have two state-of-the-art quantization codecs (BitRound and Granular BitRound) and one last generation codec (BitGroom). This PR adds BitRound, it does not remove BitGroom. The decision to keep or remove BitGroom is not mine to make since others have already, graciously, put effort into supporting BitGroom. I prepared some slides to explain the quantization codecs that have been proposed and their key differences, including pros/cons. You can preview the presentation here. Slide 4-9 are the keys ones related to the BitGroom question. There's more to discuss than what's on the slides, and perhaps we can do that at our rescheduled meeting.

@edwardhartnett
Copy link
Contributor

I do not think BitGroom should be removed. It should be faster than the other two methods, and that is a significant feature. Many users might prefer a faster, but less compressive solution.

Great work getting this in @czender and @WardF. On behalf of NOAA and the UFS: thanks.

This will be put to immediate use once the next version of netcdf-c is released, and will make it into the operational code the next big UFS version cycle, where it will be helping compress data that are used by many members of the netCDF community!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants