Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: use _util.native_to_byteorder function in ak.from_buffers #3354

Merged
merged 2 commits into from
Dec 19, 2024

Conversation

pfackeldey
Copy link
Collaborator

Directly use an existing utility function for byte ordering. Stumbled over this while working on #3353.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that Awkward Arrays (data and indexes) must always be native-endian, and it's also true that the buffers passed to ak.from_buffers may be big or little-endian, depending on the byteorder argument:

byteorder (`"<"`, `">"`): Endianness of buffers read from `container`.
If the byteorder does not match the current system byteorder, the
arrays will be copied.

and this is forced to be "<" when we pickle arrays as part of the (unwritten?) format specification:

def __setstate__(self, state):
form, length, container, behavior, *_ = state
# If length is a sequence, we have awkward1
if isinstance(length, Sequence):
part_layouts = [
# Load partition
ak.operations.from_buffers(
_awkward_1_rewrite_partition_form(form, i),
part_length,
container,
highlevel=False,
buffer_key="{form_key}-{attribute}",
byteorder="<",
)
for i, part_length in enumerate(length)
]
# Fuse partitions
layout = ak.concatenate(part_layouts, axis=0, highlevel=False)
# Otherwise, we have either awkward1 or awkward2
else:
layout = ak.operations.from_buffers(
form,
length,
container,
highlevel=False,
buffer_key="{form_key}-{attribute}",
byteorder="<",
)
self._layout = layout
self._behavior = behavior
self._attrs = None
self._update_class()

Digging into this, the dtype passed to _from_buffer (a private function) always comes from primitive_to_dtype or index_to_dtype. The primitive_to_dtype is always defined to be native endian (good):

_primitive_to_dtype_dict = {
"bool": np.dtype(np.bool_),
"int8": np.dtype(np.int8),
"uint8": np.dtype(np.uint8),
"int16": np.dtype(np.int16),
"uint16": np.dtype(np.uint16),
"int32": np.dtype(np.int32),
"uint32": np.dtype(np.uint32),
"int64": np.dtype(np.int64),
"uint64": np.dtype(np.uint64),
"float32": np.dtype(np.float32),
"float64": np.dtype(np.float64),
"complex64": np.dtype(np.complex64),
"complex128": np.dtype(np.complex128),
"datetime64": np.dtype(np.datetime64),
"timedelta64": np.dtype(np.timedelta64),
}

But index_to_dtype is always defined to be little-endian (probably bad):

index_to_dtype: Final[dict[str, DType]] = {
"i8": np.dtype("<i1"),
"u8": np.dtype("<u1"),
"i32": np.dtype("<i4"),
"u32": np.dtype("<u4"),
"i64": np.dtype("<i8"),
}

The latter might be a mistake that I introduced in #2660 and it has never mattered because all machines are little-endian machines.

And now I realize that this PR wasn't about changing the byteswap behavior at all—it was just to use a utility function (native_to_byteorder) instead of reimplementing those four lines here.

But my digging into it might have revealed a mistake. What do you think about index_to_dtype? It should be native-endian, not little-endian, right? I'm 90% sure of it (but this is always a problem because you can't test it to verify).

On its own, this PR is fine to merge.

@agoose77
Copy link
Collaborator

This is not a bug fix, right?

I'm tempted to rename the utility function, as it doesn't coerced to a byte order but rather just swaps.

@pfackeldey
Copy link
Collaborator Author

it was just to use a utility function (native_to_byteorder) instead of reimplementing those four lines here.

yes It was only about this 😅 I'd leave it like that for now, because I'm too unfamiliar with litte/big endian as I would feel confident to correctly implement what you discovered. Hope this is fine 👍

@jpivarski
Copy link
Member

That's correct—this PR is not a bug-fix. But while I was thinking about it incorrectly, I probably discovered an unrelated bug.

The name of the utility function seems right to me, but I'm rather neutral on it. It swaps the byte order if the given buffers have non-native order.

@pfackeldey pfackeldey changed the title fix: use _util.native_to_byteorder function in ak.from_buffers chore: use _util.native_to_byteorder function in ak.from_buffers Dec 19, 2024
@pfackeldey pfackeldey merged commit 288851a into main Dec 19, 2024
40 checks passed
@pfackeldey pfackeldey deleted the pfackeldey/use_native_to_byteorder_util_func branch December 19, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants