-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: use _util.native_to_byteorder
function in ak.from_buffers
#3354
chore: use _util.native_to_byteorder
function in ak.from_buffers
#3354
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that Awkward Arrays (data and indexes) must always be native-endian, and it's also true that the buffers passed to ak.from_buffers
may be big or little-endian, depending on the byteorder
argument:
awkward/src/awkward/operations/ak_from_buffers.py
Lines 56 to 58 in 8285871
byteorder (`"<"`, `">"`): Endianness of buffers read from `container`. | |
If the byteorder does not match the current system byteorder, the | |
arrays will be copied. |
and this is forced to be "<"
when we pickle arrays as part of the (unwritten?) format specification:
awkward/src/awkward/highlevel.py
Lines 1690 to 1723 in fb245f1
def __setstate__(self, state): | |
form, length, container, behavior, *_ = state | |
# If length is a sequence, we have awkward1 | |
if isinstance(length, Sequence): | |
part_layouts = [ | |
# Load partition | |
ak.operations.from_buffers( | |
_awkward_1_rewrite_partition_form(form, i), | |
part_length, | |
container, | |
highlevel=False, | |
buffer_key="{form_key}-{attribute}", | |
byteorder="<", | |
) | |
for i, part_length in enumerate(length) | |
] | |
# Fuse partitions | |
layout = ak.concatenate(part_layouts, axis=0, highlevel=False) | |
# Otherwise, we have either awkward1 or awkward2 | |
else: | |
layout = ak.operations.from_buffers( | |
form, | |
length, | |
container, | |
highlevel=False, | |
buffer_key="{form_key}-{attribute}", | |
byteorder="<", | |
) | |
self._layout = layout | |
self._behavior = behavior | |
self._attrs = None | |
self._update_class() |
Digging into this, the dtype
passed to _from_buffer
(a private function) always comes from primitive_to_dtype
or index_to_dtype
. The primitive_to_dtype
is always defined to be native endian (good):
awkward/src/awkward/types/numpytype.py
Lines 67 to 83 in fb245f1
_primitive_to_dtype_dict = { | |
"bool": np.dtype(np.bool_), | |
"int8": np.dtype(np.int8), | |
"uint8": np.dtype(np.uint8), | |
"int16": np.dtype(np.int16), | |
"uint16": np.dtype(np.uint16), | |
"int32": np.dtype(np.int32), | |
"uint32": np.dtype(np.uint32), | |
"int64": np.dtype(np.int64), | |
"uint64": np.dtype(np.uint64), | |
"float32": np.dtype(np.float32), | |
"float64": np.dtype(np.float64), | |
"complex64": np.dtype(np.complex64), | |
"complex128": np.dtype(np.complex128), | |
"datetime64": np.dtype(np.datetime64), | |
"timedelta64": np.dtype(np.timedelta64), | |
} |
But index_to_dtype
is always defined to be little-endian (probably bad):
awkward/src/awkward/forms/form.py
Lines 375 to 381 in 8285871
index_to_dtype: Final[dict[str, DType]] = { | |
"i8": np.dtype("<i1"), | |
"u8": np.dtype("<u1"), | |
"i32": np.dtype("<i4"), | |
"u32": np.dtype("<u4"), | |
"i64": np.dtype("<i8"), | |
} |
The latter might be a mistake that I introduced in #2660 and it has never mattered because all machines are little-endian machines.
And now I realize that this PR wasn't about changing the byteswap behavior at all—it was just to use a utility function (native_to_byteorder) instead of reimplementing those four lines here.
But my digging into it might have revealed a mistake. What do you think about index_to_dtype
? It should be native-endian, not little-endian, right? I'm 90% sure of it (but this is always a problem because you can't test it to verify).
On its own, this PR is fine to merge.
This is not a bug fix, right? I'm tempted to rename the utility function, as it doesn't coerced to a byte order but rather just swaps. |
yes It was only about this 😅 I'd leave it like that for now, because I'm too unfamiliar with litte/big endian as I would feel confident to correctly implement what you discovered. Hope this is fine 👍 |
That's correct—this PR is not a bug-fix. But while I was thinking about it incorrectly, I probably discovered an unrelated bug. The name of the utility function seems right to me, but I'm rather neutral on it. It swaps the byte order if the given buffers have non-native order. |
_util.native_to_byteorder
function in ak.from_buffers
_util.native_to_byteorder
function in ak.from_buffers
Directly use an existing utility function for byte ordering. Stumbled over this while working on #3353.