Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helper function for saving compact integer datatypes #1046

Closed
neurolabusc opened this issue Aug 19, 2021 · 39 comments
Closed

Helper function for saving compact integer datatypes #1046

neurolabusc opened this issue Aug 19, 2021 · 39 comments

Comments

@neurolabusc
Copy link

I aided a nibabel user who wrote a function to segment a brain into 4 tissue types (0=non brain, 1= gray matter, 2= white matter, 3= CSF). They were surprised at the huge disk space resulting from the NIfTI images, which were saved as DT_INT64 datatype. Beyond being large, this is a very unusual datatype for NIfTI (it did not exist in the ancestor Analyze type).

The native NumPy datatype for integers is int64. So nibabel is preserving this datatype. I do wonder if the Nifti1Image command could include an optional parameter that chooses the most compact data type if the input data is of an integer type (e.g. in this example, the voxel intensities range from 0..47, so it could be saved losslessly as DT_UINT8, requiring 1/8th the disk space.

When I look at the example nibabel examples they all show sensible use setting the dtype

data = np.ones((32, 32, 15, 100), dtype=np.int16)

However, the rational for these is never described. Perhaps the documentation could not that choice of datatype impacts file size and some tools only support a limited range of these datatypes (the classic Analyze data types are UINT8, INT16, FLOAT32 and some tools can act in unexpected ways with other datatypes, e.g. AFNI promotes UINT16 to FLOAT32).

import nibabel as nib
import numpy as np

data = np.arange(4*4*3).reshape(4,4,3)
#data.dtype = 'int64'
#data values are integers 0..47
img = nib.Nifti1Image(data, affine=np.eye(4))
nib.save(img, 'test.nii')

data8 = np.arange(4*4*3, dtype=np.uint8).reshape(4,4,3)
img8 = nib.Nifti1Image(data8, affine=np.eye(4))
nib.save(img8, 'test8.nii')

data32 = np.arange(4*4*3, dtype=np.float32).reshape(4,4,3)
img32 = nib.Nifti1Image(data32, affine=np.eye(4))
nib.save(img32, 'test32.nii')
@effigies
Copy link
Member

Thinking a little about this, I think we can do something like:

def get_min_dtype(img):
    arr = np.asanyarray(img.dataobj)
    min_dt = arr.dtype
    if np.issubdtype(min_dt, np.integer):
        min_dt = np.promote_types(np.min_scalar_type(arr.min()), np.min_scalar_type(arr.max())
    return min_dt

And of course we could set with img.set_data_dtype(). What would your ideal API for invoking this behavior look like?

@neurolabusc
Copy link
Author

Related to to able_int_type and usage in test_casting.py. I think someone more familiar with nibabel could suggest the best API.

@matthew-brett
Copy link
Member

Just coming back to this, as @neurolabusc just prodded me (thanks Chris).

Thinking about it quickly now, we have a few options for optional support (sketches).

Starting with arr = np.arange(4*4*3).reshape(4,4,3), we could shrink down to np.int8 with any of:

  • img = nib.Nifti1Image(as_min_int(arr), np.eye(4))
  • img = nib.NiftiImage(arr, np.eye(4)); small_img = as_min_int_img(img)
  • `img = nib.Nifti1Image(arr, np.eye(4), shrink_int_dtype=True)

Then we could do the shrinking automatically - as in shrink_int_dtype=True as the default - but that seems like too much of a behavior change.

Any thoughts about those options?

@neurolabusc
Copy link
Author

neurolabusc commented Jan 19, 2022

My sense would be to shrink integers automatically unless the user disables the feature. My sense is that float types are used when precision is needed, and integers are used by nibabel users for segmentation (3-6 discrete levels), labels (thousands of regions at most) and masks (binary). In each case, UINT8 or INT16 would be sufficient. Saving data as INT64means that the resulting images are incompatible with many popular tools: AFNI, FreeSurfer, Slicer 3D, SPM12, etc. The INT64 also has implications for disk space and RAM when loaded.

As someone who gets feedback from nibabel users, it seems like there are clear unintended consequences for the current default of saving as INT64. I can actually not think of a real world situation where someone would need INT64 data, and such peculiar situations would not be compatible with the standard tools in our field. To me, the default behavior should be compatible with the popular tools in our field.

@matthew-brett
Copy link
Member

matthew-brett commented Jan 19, 2022

Other options I can think of, with less impact on changing default behavior:

  • Truncating int64 to int32 by default.
  • Adding dtype aliases like dtype='mask' (int8) dtype='classification' (int8) and dtype='label' (int16 unless any number is out of range).

Chris - would uint8 or uint16 cause compatibility problems do you think?

@neurolabusc
Copy link
Author

neurolabusc commented Jan 20, 2022

This table shows support for of AFNI, FSL, FreeSurfer (FrSr) MRIcroGL (MRGL) and SPM. For each cell I indicate if the tool support the Read (R), Write (W) and Internal use (I) of the datatype. The datatypes indicated with the asterisk (*) were supported by the ancestor Analyze format, and have historically been the most popular and are generally the best supported.

Type AFNI FSL FrSr MRGL SPM
DT_INT8 R R R R RW
DT_UINT8* RWI RW RWI RWI RW
DT_INT16* RWI RW RWI RWI RW
DT_UINT16 R R R R RW
DT_INT32* RW RW RWI RW RW
DT_UINT32 R R R R RW
DT_INT64 R
DT_UINT64 R
DT_FLOAT32* RWI RWI RWI RWI RW
DT_FLOAT64 R RWI R R RWI

In general, if a tool does not internally support a datatype, it is stored internally as an internally supported FLOAT type. For AFNI and fslmaths (using default -dt float) this rounds extreme integer values due to flintmax. For completelness, there are exceptions to this rule, e.g. MRIcroGL will adjust the scale_intercept to losslessly internally store UINT16 as INT16.

This promotion means that memory demands and disk usage can be impacted by the datatype choice.

Historically, MRIs used 12-bit ADC and therefore the DICOMs stored 12 bits per pixel, which worked nicely with the popular INT16 datatype. Recently, many leading edge scanners have moved to 16-bit ADC with the resulting magnitude images often scaled to use the full UINT16 range (0..65535, even if this precision is beyond the SNR for the raw data). This is a challenge for tools like dcm2niix: should the UINT16 data be converted to the well supported FLOAT32, saved as the poorly supported UINT16, or save as INT16 if the maximum value is less than 32767 and UINT16 otherwise?

These choices have had rare unintended consequences that are directly relevant to this discussion. For example, consider a fMRI study where each individual completes two series of fMRI runs. It is possible (though unlikely) that the MRI reconstructer will generate magnitude images for one run in the range 0..32767, while the other run will generate some voxels that exceed 32768. In this case, it seems efficient to store one as INT16 (for better legacy support) while the other requires either UINT16 or FLOAT32. A tool like AFNI will promote UINT16 to FLOAT32, while natively supporting the INT16. Unfortunately, AFNI requires that all first level fMRI data has the same datatype and will create bogus statistical maps.

My own suggestion is to stick with the legacy datatypes (listed with * in the table) as these have the most widespread support.

@mrneont
Copy link

mrneont commented Feb 8, 2022

@effigies : this issue came up again in a recent discussion. Is there any further thought for how nibabel will output integers? Having int64 be the default int type for outputs, regardless of input type or min/max int value, seems undesirable on many fronts. It increases user computer disk/memory space usage unnecessarily, creates inconsistency of default I/O types (that is, input and output types for simple operations does not match) and creates software interoperability problems.

For example, it would seem to make sense that if a dataset comes in with a particular type and the operation performed doesn't require changing that type, then the output dataset should keep the input type? *Asterisk to prove the rule: now that int64s dsets have been created, though, it would seem that in most cases, propagating these unecessarily---i.e., when extremal values preclude any smaller int type---would not be desirable.

Starting with trying to have output match the input type, but then adapt/"guess" a necessary alternative, seems like it might be preferable and not too bad to implement?

@effigies
Copy link
Member

effigies commented Feb 8, 2022

I do get the point that int64 doesn't have obvious use cases, so it does seem sensible to provide a way for users to say "cast ints down", but I feel that changing the default behavior is problematic (see below).

Matching input type is pretty straightforward with nibabel, but it is the programmer's responsibility to track inputs:

img = nb.load(fname)
new_data = ... # derive from img.dataobj or generate it somehow
new_img = nb.Nifti1Image(new_data, img.affine, img.header)
# Make any additional changes to the header, if needed
new_img.to_filename(new_fname)

Nibabel has no concept of "input type" if the header of an input image is not passed to a new image, but if it is, I think the behavior is what you would expect.

The problem is more what happens when someone doesn't provide a header that lets nibabel know how to interpret the data array. They can set_data_dtype() if they like, or have it set automatically by coercing the data array before creating a new image with it. Downcasting by default will be a significant and possibly unwelcome change in behavior to people who are used to nibabel doing what it's told.

Consider if I actually did want an int64 array. Right now I would write:

>>> img = nb.Nifti1Image(np.arange(4*4*3, dtype='int64').reshape(4,4,3), np.eye(4))
>>> img.get_data_dtype()
dtype('<i8')
>>> img.set_data_dtype("int64")
>>> img.get_data_dtype()
dtype('<i8')
>>> new_img = nb.Nifti1Image.from_bytes(img.to_bytes())  # round-trip through an in-memory filehandle
>>> new_img.get_data_dtype()
dtype('<i8')

Note that being explicit does not produce a different result in the image from receiving an array of Python integers and converting them to a numpy array. So if we change the default behavior, people who have been going out of their way to be explicit may stop getting what they expect.


All this aside, @neurolabusc's earlier point that the documentation can be improved is quite true. The only example where we pass an existing header (that I was able to find quickly, anyway) is https://nipy.org/nibabel/nifti_images.html#default-sform-and-qform-codes.

@jbteves
Copy link

jbteves commented Feb 8, 2022

Hi @effigies, @neurolabusc kindly brought this conversation to my attention.
I feel that in some way the nibabel API's default behavior did change, when the output of a Nifti1Image dataobj became int64 rather than int32 before. We already had to deal with this in tedana (see here for when I had to add a small patch), and to some degree I'd expect similar problems in other packages. It is becoming more common for users to use tools like nilearn, which uses this API, and then go back into some other package. Having to manually track data types is something that is probably not going to be intuitive for most of them, and I can think of at least one case where I've had to walk somebody through why this is happening (since there is no obvious indication of what happened to an end-user until they receive a load failure).
Perhaps a fair API change would be to add a method to the API that would cast the data object into the same datatype as what's in the NIFTI header, something like into_ndarray? Then this way the user is guaranteed a type-clear way of getting their data into the commonly used ndarray format.

@matthew-brett
Copy link
Member

I don't think the behavior of Nibabel has changed . Maybe you were using 32-bit Windows at one point, where the default NumPy integer is 32-bit? Otherwise the default is 64 bit - and Nibabel - as far as I know - has always just accepted the data type you pass it, unless the header tells you what type you have.

I think I don't get what you're suggesting with the into_ndarray API - or maybe - how it relates to the int64 default NumPy datatype. I think you mean that if the header says that the on-disk storage type is int8 for example, that there should be a way of getting the data as np.int8 datatype - I guess throwing away any scalefactors? If the image does not have scalefactors, np.array(img.dataobj) does what you want.

@effigies
Copy link
Member

effigies commented Feb 8, 2022

I feel that in some way the nibabel API's default behavior did change, when the output of a Nifti1Image dataobj became int64 rather than int32 before.

The logic for np.asanyarray(img.dataobj) should not have changed, and int64 should only result if the on-disk dtype is int64 or you pass data such that np.array(data).dtype == np.dtype('int64'). (You can get floats out of on-disk ints if there are scale factors.) If you can reproduce a situation where int32 is coerced up to int64, this is a bug, not an intentional API change.

Perhaps a fair API change would be to add a method to the API that would cast the data object into the same datatype as what's in the NIFTI header, something like into_ndarray? Then this way the user is guaranteed a type-clear way of getting their data into the commonly used ndarray format.

If I'm understanding you correctly, this aims to solve common problems with chaining operations on in-memory images, which do not enforce that img.dataobj.dtype == img.get_data_dtype()? I could imagine something equivalent to (but hopefully more efficient than) the following to ensure that an in-memory image appears identical to what it would look like if freshly read from disk:

class Nifti1Image:
    def refresh(self):
        return self.from_bytes(self.to_bytes())

But I feel like we're going a bit afield of the issue at hand. These issues seem connected in that there are expectation violations. I think the basic problem is that most suites read/write to disk at every point of user intervention, while working in Python allows you to defer the serialization (and possible precision loss) until the last possible moment.

@neurolabusc
Copy link
Author

Pragmatically, the current behavior of nibabel is incompatible with the most popular NIfTI tools in the field. In practice, it is causing grief to a lot of users. If the nibabel team really wants to support this data format, they should develop and validate patches for the leading tools in our field and submit pull requests (SPM, FSL, AFNI, FreeSurfer). The current behavior is not in the spirit of a popular interchange format. I have updated my software simply to stop emails from people who use my tools. However, this is fixing the symptom: my tools are small and simple and validating solutions is much easier than for larger projects.

As nilearn is gaining popularity, its dependency on nibabel is generating a large number of images that are not compatible with other tools. NIfTI is an interchange format, and while legal, the DT_UINT64 and DT_INT64 data types are not used by other tools. In the same way, I can create legal DICOM images where the transfer syntax is unsupported by virtually every tool. While the image is technically legal, it will frustrate users. I would argue that nibabel should never save INT64 data. A nice solution would be if a NIfTI image is requested with dtype='int64', nibabel uses the most compact form of the common integer datatypes (DT_UINT8, DT_INT16 and DT_INT32). If the data range exceeds DT_INT32, I think nibabel should generate a data dtype "int64" not supported. In practice, this is not a supported NIfTI format, and it breaks compatibility with many other tools. There is already a precedence for this: DT_BINARY is a legal NIfTI data type, however it is unsupported by virtually every tool and nibabel prevents a user from creating this technically legal but pragmatically incompatible image:

import numpy as np
import nibabel as nib
data2 = np.array([1,0,1,0,1,0,1,1,1], dtype=bool).reshape(3,3,1)
img2 = nib.Nifti1Image(data2, affine=np.eye(4))

nibabel.spatialimages.HeaderDataError: data dtype "bool" not supported

I want to be clear that I can not think of a single real world situation in our field where a user would prefer an integer exceeding 2147483648 to a floating point representation. Even in such an extreme edge situation, one would think the FLOAT64 flintmax of 9007200000000000 would be sufficient.

The nibabel developers need to be conscientious of the knock on effects of their implementations. There are ripple effects to developers of other tools in situations where the user misdiagnoses the source of an error.

@jbteves
Copy link

jbteves commented Feb 8, 2022

My apologies, np.asanyarray works as intended and this is a nilearn behavior in our specific implementation, which I realize must be frustrating to read since you're not nilearn (been there). That said...

If I'm understanding you correctly, this aims to solve common problems with chaining operations on in-memory images, which do not enforce that img.dataobj.dtype == img.get_data_dtype()?

I do believe it is is pertinent to the issue since what you're suggesting so far is that users must check for themselves that an int64 will not be saved. My understanding of nibabel.save is that it will select the data type of array, rather than image data type. I believe that this is an undue and confusing burden on the user, who may believe that checking the type of the dataobj or the header would be interchangeable in the Nifti1Image API (which does not indicate from my read that these are not interchangeable in the API docs). A read of the Nifti1Image constructor also does not indicate that a header-array mismatch would not be reconciled. At the very least, a Warning could indicate to the users that their array is not the correct datatype for their supplied header.

I agree with the above assessment by @neurolabusc with the exception that I think it may be reasonable to raise a warning rather than fail to support it outright. However, I do think that because most tools do not support this data type, and because it is very rare that such a datatype is even desirable to write to disk, it is imperative that users be made aware of the potential compatibility issues.

I would be happy to help implement either solution.

@effigies
Copy link
Member

effigies commented Feb 8, 2022

I disagree with completely disabling (u)int64. I don't see it as nibabel's job to prevent people from writing data that others can't read, but I do see the value in defaulting to things they can. I would propose we add a dtype parameter to Nifti1Image.to_filename() (and all related methods) that would have the following semantics:

  • If dtype is None (default):
    • Check img.get_data_dtype(). If (U)INT64, coerce down to the smallest integer that can encode values.
  • Else use np.dtype(dtype), and the user accepts the consequences.

There might be some corner cases that need to be handled related to rescaling floats, but would this suffice? I think in practice it will maintain current behavior except for int64, and so be minimally disruptive. We could also include a "smart" option like dtype="compat" that will force into analyze-compatible or error.

@jbteves
Copy link

jbteves commented Feb 8, 2022

From my point of view that's perfect.

@neurolabusc
Copy link
Author

@effigies your proposal sounds reasonable to me.

@matthew-brett
Copy link
Member

My understanding of nibabel.save is that it will select the data type of array, rather than image data type.

Just to clarify - if the dtype is stored in the image header, then Nibabel will save as that dtype, rather than the array dtype:

[ins] In [15]: hdr = nib.Nifti1Header()
[ins] In [16]: hdr.set_data_dtype(np.int8)
[ins] In [17]: arr = np.zeros((2, 3, 4), dtype=np.int64)
[ins] In [19]: nib.save(nib.Nifti1Image(arr, np.eye(4), hdr), 'out.nii')
[ins] In [21]: img = nib.load('out.nii')
[ins] In [22]: img.get_data_dtype()
Out[22]: dtype('int8')
[ins] In [23]: np.array(img.dataobj).dtype
Out[23]: dtype('int8')

So, if Nibabel knows about the dtype, it preserves it. The situation that's causing the trouble, is when Nibabel knows nothing about the dtype except the dtype of the array.

@matthew-brett
Copy link
Member

@effigies - I think the problem we need to think about here is:

arr = np.arange(24).reshape(2, 3, 4)  # Will be int64, almost always
nib.save(nib.Nifti1Image(arr, np.eye(4)), 'test.nii')
arr_back = np.array(nib.load('test.nii')  # Will be int64, as things stand
my_vals = arr_back * 100   # Oops

As things are, all will be well, and we'd get:

Out[29]: 
array([[[   0,  100,  200,  300],
        [ 400,  500,  600,  700],
        [ 800,  900, 1000, 1100]],

       [[1200, 1300, 1400, 1500],
        [1600, 1700, 1800, 1900],
        [2000, 2100, 2200, 2300]]])

If we quietly truncate to e.g. uint8, we get:

[ins] In [31]: arr_back.astype(np.uint8) * 100
Out[31]: 
array([[[  0, 100, 200,  44],
        [144, 244,  88, 188],
        [ 32, 132, 232,  76]],

       [[176,  20, 120, 220],
        [ 64, 164,   8, 108],
        [208,  52, 152, 252]]], dtype=uint8)

Lots of integer overflow because of the working np.uint8 datatype.

So, this could easily cause some nasty and unexpected breakage, for people using img.dataobj. And I suspect that is fairly common.

@neurolabusc
Copy link
Author

@matthew-brett the nibabel use cases I have seen for integers were masks and segmentation maps. For these, one might expect binary operations, but not the sort of manipulation you describe. Common tools in our field like fslmaths and spm distinguish the data type used for computation from the data type stored on disk (the -dt and -odt in fslmaths), using floating point for the computation.

@mrneont
Copy link

mrneont commented Feb 8, 2022

It sounds analogous to reading in floats, doing intermediate calcs with doubles to not lose precision in the intermediate steps, and then outputting (as presumably desired) floats.

Sure, there will be data conversion issues, depending on operations involved. I did phrase my suggestion as:
For example, it would seem to make sense that if a dataset comes in with a particular type and the operation performed doesn't require changing that type, then the output dataset should keep the input type?

(Sidenote: with my nibabel v3.2.1, I had to do the following to get the above example to work without error:

arr = np.arange(24).reshape(2, 3, 4)  # Will be int64, almost always
nib.save(nib.Nifti1Image(arr, affine=np.eye(4)), 'test.nii')
arr_back = np.asanyarray(nib.load('test.nii').dataobj) # Will be int64, as things stand

print(arr_back * 100)
print(arr_back.astype(np.uint8) * 100)

)

Python does automatic type recasting in some cases--- consider either output above if one multiplied by 100. instead of by 100. Similarly, if one divide by 100, or even by 1 (in Python3...). The issue in the above case is that Python won't automagically promote within the int family, even though it "should" to not lose value here.

It would seem fine to do intermediate calcs using int64, sure, but then to have the output type be a more appropriate int varietal.

@matthew-brett
Copy link
Member

@mrneont - yes - sure - the point I was making was that the output type often becomes the type used for intermediate calculations - and that, with this backwards incompatible change- this will appear as silent breakage from integer overflow. I do take Chris' point that this will probably be rare - but it's still a very bad thing to do, to silently change calculations.

I think we've got to find some way of preventing that happening, and in practice, I think that means we've got to find some way of warning then breaking here. I would vote strongly for avoiding silent breakage - can argue further if that's controversial.

For example - we could do something like:

# Warn
arr = np.arange(24).reshape(2, 3, 4)  # Will be int64, almost always
img = nib.Nifti1Image(arr, np.eye(4))
# Maybe
# Warning: using data type np.int64 from array, but this is poorly supported
# by other Nifti tools.  This will cause an error in a future version of Nibabel
# Silence this warning with
# Nifti1Image(arr, np.eye(4), dtype=np.int64)
# if you really want the int64 output type, or specify another datatype, e..g.
# Nifti1Image(arr, np.eye(4), dtype=np.int8)
# or
# Nifti1Image(arr, np.eye(4), dtype='smallest-int')
# See docstring for details

@matthew-brett
Copy link
Member

Sorry - just to clarify - later:

img = nib.Nifti1Image(arr, np.eye(4))

would go straight to an error, with the message above. Therefore, the results from Nibabel code now will stay the same in later versions, or generate a specific error that gives the fix - and we never generate data that can silently result in different results for later versions of Nibabel.

@effigies
Copy link
Member

Okay, so if I understand, @matthew-brett you would rather add a dtype parameter and have this warning emitted at image creation time, rather than at serialization time. That would be fine with me.

Would it be useful to have both? Then the equivalent of -dt double and -odt float would be:

img = nb.Nifti1Image(data, affine, dtype='float64')
final_img = some_func(img)
final_img.to_filename(fname, dtype='float32')

This way the code does not need to know in detail what some_func does to generate a new image.


Regarding using np.asanyarray(img.dataobj) instead of img.get_fdata(), this has always been a "You know what you're doing" mode, so if someone does fetch uint8s out of a label image and multiply them out of range without that seems to be on them. I'm okay with only casting int64 down to int32 by default, though.

@neurolabusc
Copy link
Author

I do think casting int64 to int32 would prevent a lot of compatibility issues. The range of int32 should avoid any reasonable overflow. From my perspective, DT_BINARY and DT_INT64 both exist in the specification, but are virtually unsupported by tools. Since NIfTI is our field's lowest common denominator interchange format, we need be wary of how other tools handle the data. My sense is if an advanced user really wanted INT64, they would be best served storing this in .npy not NIfTI.

@matthew-brett
Copy link
Member

@effigies - yes - I think this is an image instantiation-time thing - where we don't have any ideas for the data type, and we have gone, up until now, with the array data type.

I think I didn't follow about the extra output check. What would that solve? Is the idea that some_func in your example might accidentally make an int64 image? Wouldn't it hit the same warning in that case?

For the np.asarray(img.dataobj) - I agree that the user needs to accept some risk - but I'm really worried about the possibility that they thought they were accepting one risk, and tested against that - but another risk appeared and started breaking their scripts, silently. Here's some discussion of the rule I was thinking of for back-compatibility : https://mail.python.org/archives/list/scikit-image@python.org/message/UYARUQM5LBWXIAWBAPNHIQIDRKUUDTEK/

Summary:

Under (virtually) no circumstances should new versions of a scientific
package silently give substantially different results for the same
function / method call from a previous version of the package.

@matthew-brett
Copy link
Member

We could also add the ability to do something like:

from nibabel.__future__ import int64_error

which would upgrade the warning to an error immediately.

@effigies
Copy link
Member

I think I didn't follow about the extra output check. What would that solve?

The issue here is that the on-disk and in-memory formats serve two different purposes. If I'm passing around in-memory images from function to function, I may want to ensure that I have large dtypes to avoid overflow/precision-loss issues, but then still save to some smaller dtype for space/compatibility reasons.

I think the float64 in memory -> float32 at output is the more likely case than int64 -> int32 one, but I'm trying to think of a consistent interface. In any event, warning that somebody is about to write an unreadable (by everybody but us) file seems to reasonable, even if most of the time it can be caught at instantiation.

For the np.asarray(img.dataobj) - I agree that the user needs to accept some risk - but I'm really worried about the possibility that they thought they were accepting one risk, and tested against that - but another risk appeared and started breaking their scripts, silently.

This is a fair point. Here's a process that includes both warnings at instantiation/write:

>>> nb.Nifti1Image(np.array([[[0, 1, 2]]]), np.eye(4))
UserWarning: "Image data has type int64, which may cause incompatibilities with other tools. \
Converting to int32. To silence this warning, pass a dtype parameter to Nifti1Image()."
>>> nb.Nifti1Image(np.array([[[2147483648]]]), np.eye(4))
UserWarning: "Image data has type int64, which may cause incompatibilities with other tools. \
Converting to uint32. To silence this warning, pass a dtype parameter to Nifti1Image()."
>>> nb.Nifti1Image(np.array([[[9223372036854775807]]]), np.eye(4))
UserWarning: "Image data has type int64, which may cause incompatibilities with other tools. \
Data range exceeds 32-bit range. To silence this warning, pass a dtype parameter to Nifti1Image()."
>>> nb.Nifti1Image(np.array([[[0, 1, 2]]]), np.eye(4), dtype='int64').to_filename("/tmp/a.nii")
FutureWarning: "Image data has type int64, which may cause incompatibilities with other tools. \
In the future, this will cause an error unless dtype='int64' is passed. Pass a dtype parameter \
to set the data type and suppress this warning."

We could also add the ability to do something like:

from nibabel.__future__ import int64_error

which would upgrade the warning to an error immediately.

I like this idea.

@matthew-brett
Copy link
Member

The issue here is that the on-disk and in-memory formats serve two different purposes. If I'm passing around in-memory images from function to function, I may want to ensure that I have large dtypes to avoid overflow/precision-loss issues, but then still save to some smaller dtype for space/compatibility reasons.

I'd expect a function that can accept artibrary images to chose its working precision - otherwise it's going to run into lots of trouble for int16 etc.

I do see the case for the float32 output though.

But in any case - can we get agreement at least on the warn -> raise for image instantiation + future import combination? Maybe we can make another issue for the output dtype warn -> raise, if that's not completely resolved?

@effigies
Copy link
Member

But in any case - can we get agreement at least on the warn -> raise for image instantiation + future import combination?

That's fine with me.

Maybe we can make another issue for the output dtype warn -> raise, if that's not completely resolved?

TBH I think I prefer it as an interface to set_data_dtype(), but agree that it can be treated separately.

@mrneont
Copy link

mrneont commented Feb 10, 2022

So, then is the idea that nibabel will:

  1. read in a dset and forget the input datatype
  2. work in any precision internally (e.g., int64)
  3. output the dset with some appropriate datatype, independent of the input dset.
    That is, the datatype from Step 1 is never propagated?

From the discussion above, I only see int32 mentioned---will that be the default type for any integer-valued data? So many datasets people have fit the short range---most parcellations tissue segmentations and ROI maps (if not byte, for masks). Would it be worth checking the to-be-output dset range for a smaller footprint datatype (like short or byte)?

@matthew-brett
Copy link
Member

@mrneont - no - the only thing we're talking about here is a warn then error when creating a new in-memory image.

At the moment, if the user instantiates a Nifti1 image with a default Numpy array of integers, which is now nearly always int64, they'll get an int64 on disk when they save the image, unless they do something to prevent it.

Here the current proposal is to warn when this happens, and later, error, unless they tell us they specifically want to do that.

But we weren't proposing any change in the rest of the Nibabel API, which is to read in the dataset keeping track if its output dtype in the header, and, if the user preserves the header, return to the same output dtype when saving.

I think you're suggesting checking whether the image contains e.g. numbers > 128, when saving as a np.int8 datatype? We could do that - at the moment, I think the behavior is just to use scalefactors to scale the data to within range, but of course that will often lose precision.

@mrneont
Copy link

mrneont commented Feb 11, 2022

Would something like the following be useful for integer-family arrays? For a list of datatypes in increasing order, it finds an appropriate one for an input integer-dtype array x, and outputs a new array with that simpler type. The shape+values of the output array should match those of the input, but the dtype should be most efficient (as defined by the sequence within list_dts).

Now edited to have list_dts contain uint8 instead of int8, on the sage advice of @neurolabusc , given the list of datatypes and software I/O-ability up above.

list_dts = ['uint8', 'int16', 'int32', 'int64']

def cast_int_arr_to_min_type(x, verb=0):
    '''Output int-valued array with same shape and values, but using the minimal
    scalar type in Python

    Parameters
    ----------
    x : np.ndarray
        A NumPy array of integer dtype of any shape.
    verb : int
        Verbosity level

    Return
    ------
    y : np.ndarray
        A NumPy array with the same shape+values as x, but using minimal  
        integer dtype (as defined in list_dts).
    '''
    
    if type(x) != np.ndarray :
        print("** ERROR: input must be a NumPy array, not: {}".format(type(x)))
        sys.exit(2)
    elif np.issubclass_(x.dtype.type, np.integer) :
        print("** ERROR: array dtype must be int, not: {}".format(x.dtype))
        sys.exit(2)

    ndts = len(list_dts)

    xmin = x.min()
    xmax = x.max()

    n = 0
    while n < ndts :
        dt  = list_dts[n]
        idt = np.iinfo(dt)
        if idt.min <= xmin and xmax <= idt.max :
            break
        n+=1
    
    if n >= ndts :
        print("** ERROR: could not find datatype to fit: {}, {}"
              "".format(xmin, xmax))
        return None

    y = x.astype(dt)

    if verb :
        print("++ Input dtype  : {}".format(x.dtype))
        print("++ Output dtype : {}".format(y.dtype))

    return y

@neurolabusc
Copy link
Author

@mrneont I would change

list_dts = ['int8', 'int16', 'int32', 'int64']

to read

list_dts = ['uint8', 'int16', 'int32', 'int64']

uint8 is one of the most popular datatypes for low precision NIfTI. Used by many templates, atlases, segmentation values. The signed int8 is not supported internally by most tools, and will be promoted to int16.

@mrneont
Copy link

mrneont commented Feb 11, 2022

@neurolabusc : NooooooooooooOOOOOOOOOOOOOOOOOoooooooooooooooooOOO!!!!!!!!!!....

Well, on second thought, OK.

(I should have looked more closely at that nice table you made up above...)

@jbteves
Copy link

jbteves commented Feb 14, 2022

Here is a very bad code and test prototype if you'd like to draw inspiration from it @effigies :

import warnings
import numpy as np
import nibabel as nib

def compat(img, fname):
    """Compatible-save a Nifti1Image

    Parameters
    ----------
    img: nib.image
        An image to save
    fname: str
        The image filename to save to

    Notes
    -----
    Uses the following strategy to save the held data in the integer case:
    1. Prefer integer data types for integer arrays, in this order:
        a. uint8
        b. int16
        c. int32
    2. If the integer range exceeds the limits of an int32, use a float64
    as a fallback if this can be done without truncation.
    3. If a float64 cannot be used as a fallback without truncation, raise
    a warning to indicate potential compatibility issues with common
    programs.
    For floating-point arrays, the strategy is to prefer a float32 to
    float64 representation if there will not be loss of precision.
    In all cases, the header data is adjusted to match the data type of the
    underlying array.
    """
    # integer types in order of preference
    PREFERRED_TYPES = (np.uint8, np.int16, np.int32)
    DANGER_TYPES = (np.int64, np.uint64)
    # dictionary of nifti codes for each type
    # see:
    # https://github.com/NIFTI-Imaging/nifti_clib/blob/master/niftilib/nifti1.h
    NIFTI_CODES = {
        np.uint8: 2,
        np.int16: 4,
        np.int32: 8,
        np.float32: 16,
        np.float64: 64,
    }
    # get the array and its type
    original_array = np.asanyarray(img.dataobj)
    dt = original_array.dtype
    # check if we're in danger
    if not (dt in DANGER_TYPES or dt == np.float64):
        # normal save
        nib.save(img, fname)
        return
    # *chuckles* I'm in danger
    # set some useful functions for quickly getting min and max of int types
    im = lambda t: np.iinfo(t).min
    iM = lambda t: np.iinfo(t).max
    # assert IEEE float integer range (see wikipedia articles below)
    # https://en.wikipedia.org/wiki/Double-precision_floating-point_format#Precision_limitations_on_integer_values
    # https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Precision_limitations_on_integer_values
    f64_imin = 0 - 2**53
    f64_imax = 2**53
    f32_imax = 16777216
    f32_imin = 0 - f32_imax
    fm = lambda t: np.finfo(t).min
    fM = lambda t: np.finfo(t).max
    # get array min and max
    mm = np.amin(original_array)
    MM = np.amax(original_array)
    out_type = dt
    # Check if float
    print(f"Array minimum {mm}, maximum {MM}, orig type {dt}")
    if np.issubdtype(dt, np.floating):
        print("Floating")
        if np.array_equal(np.int64(original_array), original_array):
            # Implied integer type
            dt = np.int64
        # np.array_equal didn't work, thus this approach
        elif np.any(np.abs(original_array - np.float32(original_array)) > 0):
            # We can fit this in a float32, so let's do it
            out_type = np.float32
        else:
            # Guess we're stuck with it
            out_type = np.float64
    if np.issubdtype(dt, np.integer):
        # literal integer type, try to find a suitable integer subtype
        for t in PREFERRED_TYPES:
            print(f"Trying {im(t)} and {iM(t)}")
            if mm >= im(t) and MM <= iM(t):
                # we can un-danger ourselves
                out_type = t
                break
        if out_type == dt:
            # check to see if we can fit in the integer range float64
            if mm > f64_imin and MM < f64_imax:
                # we're solid(ish)
                out_type = np.float64
            else:
                # sob uncontrollably
                out_type = np.int64
    elif dt == bool:
        # We can stuff this in a uint8, praise be
        out_type = np.uint8
    elif dt == np.csingle or dt == np.cdouble or np.issubdtype(dt, np.floating):
        # Do nothing
        pass
    else:
        raise ValueError("Please stop using weird data")
    if dt != out_type:
        # If we're still in this function, time to make a new array, img
        arr = out_type(original_array)
        aff = img.affine
        hdr = img.header
        hdr['datatype'] = np.array(NIFTI_CODES[out_type])
        print(f"Using data type {hdr['datatype']} for {out_type}")
        # Replace the image
        img = nib.nifti1.Nifti1Image(arr, aff, header=hdr)
    # Save it
    nib.save(img, fname)
    if out_type in DANGER_TYPES:
        warnings.warn(
            "This data may not be compatible with other packages such as "
            "SPM, FSL, AFNI, freesurfer."
            f"(Data type is {out_type}; {PREFERRED_TYPES} or float32 "
            "preferred)."
        )
    if out_type == np.float64:
        warnings.warn(
            "This is a very precise float (float64), you may not be able "
            "to write out data at this precision."
        )

Test:

"""Tests for compat.py"""

import os

import pytest
import numpy as np
import nibabel as nib
from compat import compat


TEMP_FNAME = "MY_TEMPORARY_NIFTI.nii"

def default_affine():
    """Return the default affine matrix

    Returns
    -------
    np.ndarray
        The array representing the default affine
    """
    return np.diag([1, 1, 1, 1])



def default_shape():
    """Return the default shape

    Returns
    -------
    tuple(int)
        The shape of the default array
    """
    return (64, 64, 32)


def default_arr():
    """Return the default array

    Returns
    -------
    np.ndarray
        The default data array for an image
    """
    return np.zeros(default_shape())


def default_image():
    """Return the default image

    Returns
    -------
    nib.nifti1.Nifti1Image
    """
    return nib.nifti1.Nifti1Image(default_arr(), default_affine())


# I know this is dumb but I hate typing and this is just an example
def boilerplate(arr):
    """Perform boilerplate actions with the image and filename

    Parameters
    ----------
    arr: np.ndarray
        The array to test

    Returns
    -------
    np.dtype
        The data type of the read array

    Notes
    -----
    Steps are:
    1. Make a new nifti image from the array
    2. Run compat (see companion file compat.py)
    3. Load the compat-generated file with module-level temp name AND
    4. Get the data type of the compat file
    5. Remove the temporary file
    """
    im = nib.nifti1.Nifti1Image(arr, default_affine())
    compat(im, TEMP_FNAME)
    dt = np.asanyarray(nib.load(TEMP_FNAME).dataobj).dtype
    os.remove(TEMP_FNAME)
    return dt


def test_from_i64():
    """Tests to make sure integers are downcast correctly"""
    # A matrix of zeros should go to uint8
    M = np.int64(default_arr())
    assert boilerplate(M) == np.uint8

    # A matrix with a value greater than can be uint8 shu
    M = np.int64(default_arr())
    M[0, 0, 0] = np.iinfo(np.uint8).max + 1
    assert boilerplate(M) == np.int16

    # A matrix with a value greater than int16 should be int32
    M = np.int64(default_arr())
    M[0, 0, 0] = np.iinfo(np.int16).max + 1
    assert boilerplate(M) == np.int32

    # Fallback to float64 should be plausible for a range of int64s
    M = np.int64(default_arr())
    M[0, 0, 0] = np.iinfo(np.int32).max + 1
    # We want to make sure we get warned, however
    with pytest.warns(UserWarning, match=r"This is a very precise float"):
        assert boilerplate(M) == np.float64

    # We can't do float64 for HUGE numbers, should warn
    M = np.int64(default_arr())
    M[0, 0, 0] = np.iinfo(np.int64).max
    with pytest.warns(UserWarning, match=r"This data may not be"):
        assert boilerplate(M) == np.int64


def tests_from_f64():
    """Tests to make sure floats are downcast correctly"""
    # A matrix of almost-zeros should go to float32
    # NOTE: 1e-9 is too small for default np.allclose to coerce a true
    # equal check
    M = default_arr() + 1e-9
    assert boilerplate(M) == np.float32

    # We should get a warning for things that really have to be float64
    M = default_arr()
    M[0, 0, 0] = np.finfo(np.float32).max + 1e4
    with pytest.warns(UserWarning, match=r"This is a very precise float"):
        assert boilerplate(M) == np.float64

@jbteves
Copy link

jbteves commented Feb 14, 2022

Basically it attempts to compress the integer or float range if it's possible, and otherwise warns the user that they may have issues. I adhered to the types that were widespread read/write for PREFERRED_TYPES, and prefer float64 (since it can at least be read) to int64 (signed or not). It also forces the header to match the type of the array. It was unclear to me from how nibabel works if I would also need to coerce the scaling factors as well to have that work properly.

@effigies
Copy link
Member

xref #1089

@mrneont
Copy link

mrneont commented Mar 3, 2022

@afni-rickr has now added int64 support into AFNI, via nifti_tool, to be able to convert these datatypes to usable ones. See:
https://github.com/afni/afni/releases/tag/AFNI_22.0.15

Here are a few examples from the help:

    H. convert to a different datatype (from whatever is there):
       (possibly warn or fail if conversion is not perfect)

      0. nifti_tool -copy_image -prefix copy.nii -infiles dset0.nii
      1. nifti_tool -copy_image -infiles dset0.nii    \
                    -prefix copy_f32.nii              \
                    -convert2dtype NIFTI_TYPE_FLOAT32 \
                    -convert_fail_choice warn
      2. nifti_tool -copy_image -infiles dset0.nii    \
                    -prefix copy_i32.nii              \
                    -convert2dtype NIFTI_TYPE_INT32   \
                    -convert_fail_choice fail

And see these+related help options for more desciption:

    -convert2dtype TYPE : convert input dset to given TYPE
       ...
    -convert_fail_choice CHOICE : set what to do on conversion failures
       ...
    -convert_verify    : verify that conversions were exact
       ...

@effigies
Copy link
Member

This should have been closed by #1096.

>>> img = nb.Nifti1Image(np.arange(20).astype(int), np.eye(4), dtype='compat')
>>> rt_img = img.from_bytes(img.to_bytes())
>>> rt_img.get_data_dtype()
dtype('<i4')
>>> img.set_data_dtype('smallest')
>>> rt_img = img.from_bytes(img.to_bytes())
>>> rt_img.get_data_dtype()
dtype('uint8')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants