chore: use `_util.native_to_byteorder` function in `ak.from_buffers` #3354

pfackeldey · 2024-12-19T18:46:46Z

Directly use an existing utility function for byte ordering. Stumbled over this while working on #3353.

jpivarski

It's true that Awkward Arrays (data and indexes) must always be native-endian, and it's also true that the buffers passed to ak.from_buffers may be big or little-endian, depending on the byteorder argument:

awkward/src/awkward/operations/ak_from_buffers.py

Lines 56 to 58 in 8285871

    
                   byteorder (`"<"`, `">"`): Endianness of buffers read from `container`. 
        
                       If the byteorder does not match the current system byteorder, the 
        
                       arrays will be copied.

and this is forced to be "<" when we pickle arrays as part of the (unwritten?) format specification:

awkward/src/awkward/highlevel.py

Lines 1690 to 1723 in fb245f1

    
           def __setstate__(self, state): 
        
               form, length, container, behavior, *_ = state 
        
               # If length is a sequence, we have awkward1 
        
               if isinstance(length, Sequence): 
        
                   part_layouts = [ 
        
                       # Load partition 
        
                       ak.operations.from_buffers( 
        
                           _awkward_1_rewrite_partition_form(form, i), 
        
                           part_length, 
        
                           container, 
        
                           highlevel=False, 
        
                           buffer_key="{form_key}-{attribute}", 
        
                           byteorder="<", 
        
                       ) 
        
                       for i, part_length in enumerate(length) 
        
                   ] 
        
                   # Fuse partitions 
        
                   layout = ak.concatenate(part_layouts, axis=0, highlevel=False) 
        
               # Otherwise, we have either awkward1 or awkward2 
        
               else: 
        
                   layout = ak.operations.from_buffers( 
        
                       form, 
        
                       length, 
        
                       container, 
        
                       highlevel=False, 
        
                       buffer_key="{form_key}-{attribute}", 
        
                       byteorder="<", 
        
                   ) 
        
               self._layout = layout 
        
               self._behavior = behavior 
        
               self._attrs = None 
        
               self._update_class()

Digging into this, the dtype passed to _from_buffer (a private function) always comes from primitive_to_dtype or index_to_dtype. The primitive_to_dtype is always defined to be native endian (good):

awkward/src/awkward/types/numpytype.py

Lines 67 to 83 in fb245f1

    
           _primitive_to_dtype_dict = { 
        
               "bool": np.dtype(np.bool_), 
        
               "int8": np.dtype(np.int8), 
        
               "uint8": np.dtype(np.uint8), 
        
               "int16": np.dtype(np.int16), 
        
               "uint16": np.dtype(np.uint16), 
        
               "int32": np.dtype(np.int32), 
        
               "uint32": np.dtype(np.uint32), 
        
               "int64": np.dtype(np.int64), 
        
               "uint64": np.dtype(np.uint64), 
        
               "float32": np.dtype(np.float32), 
        
               "float64": np.dtype(np.float64), 
        
               "complex64": np.dtype(np.complex64), 
        
               "complex128": np.dtype(np.complex128), 
        
               "datetime64": np.dtype(np.datetime64), 
        
               "timedelta64": np.dtype(np.timedelta64), 
        
           }

But index_to_dtype is always defined to be little-endian (probably bad):

awkward/src/awkward/forms/form.py

Lines 375 to 381 in 8285871

    
           index_to_dtype: Final[dict[str, DType]] = { 
        
               "i8": np.dtype("<i1"), 
        
               "u8": np.dtype("<u1"), 
        
               "i32": np.dtype("<i4"), 
        
               "u32": np.dtype("<u4"), 
        
               "i64": np.dtype("<i8"), 
        
           }

The latter might be a mistake that I introduced in #2660 and it has never mattered because all machines are little-endian machines.

And now I realize that this PR wasn't about changing the byteswap behavior at all—it was just to use a utility function (native_to_byteorder) instead of reimplementing those four lines here.

But my digging into it might have revealed a mistake. What do you think about index_to_dtype? It should be native-endian, not little-endian, right? I'm 90% sure of it (but this is always a problem because you can't test it to verify).

On its own, this PR is fine to merge.

agoose77 · 2024-12-19T19:41:47Z

This is not a bug fix, right?

I'm tempted to rename the utility function, as it doesn't coerced to a byte order but rather just swaps.

pfackeldey · 2024-12-19T19:44:41Z

it was just to use a utility function (native_to_byteorder) instead of reimplementing those four lines here.

yes It was only about this 😅 I'd leave it like that for now, because I'm too unfamiliar with litte/big endian as I would feel confident to correctly implement what you discovered. Hope this is fine 👍

jpivarski · 2024-12-19T19:46:31Z

That's correct—this PR is not a bug-fix. But while I was thinking about it incorrectly, I probably discovered an unrelated bug.

The name of the utility function seems right to me, but I'm rather neutral on it. It swaps the byte order if the given buffers have non-native order.

use

8285871

pfackeldey requested a review from jpivarski December 19, 2024 18:47

pfackeldey temporarily deployed to docs December 19, 2024 18:57 — with GitHub Actions Inactive

jpivarski approved these changes Dec 19, 2024

View reviewed changes

Merge branch 'main' into pfackeldey/use_native_to_byteorder_util_func

161faca

pfackeldey changed the title ~~fix: use _util.native_to_byteorder function in ak.from_buffers~~ chore: use _util.native_to_byteorder function in ak.from_buffers Dec 19, 2024

pfackeldey deployed to docs December 19, 2024 20:04 — with GitHub Actions View deployment

pfackeldey merged commit 288851a into main Dec 19, 2024
40 checks passed

pfackeldey deleted the pfackeldey/use_native_to_byteorder_util_func branch December 19, 2024 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: use `_util.native_to_byteorder` function in `ak.from_buffers` #3354

chore: use `_util.native_to_byteorder` function in `ak.from_buffers` #3354

pfackeldey commented Dec 19, 2024

jpivarski left a comment

agoose77 commented Dec 19, 2024

pfackeldey commented Dec 19, 2024

jpivarski commented Dec 19, 2024

	byteorder (`"<"`, `">"`): Endianness of buffers read from `container`.
	If the byteorder does not match the current system byteorder, the
	arrays will be copied.

	def __setstate__(self, state):
	form, length, container, behavior, *_ = state
	# If length is a sequence, we have awkward1
	if isinstance(length, Sequence):
	part_layouts = [
	# Load partition
	ak.operations.from_buffers(
	_awkward_1_rewrite_partition_form(form, i),
	part_length,
	container,
	highlevel=False,
	buffer_key="{form_key}-{attribute}",
	byteorder="<",
	)
	for i, part_length in enumerate(length)
	]

	# Fuse partitions
	layout = ak.concatenate(part_layouts, axis=0, highlevel=False)
	# Otherwise, we have either awkward1 or awkward2
	else:
	layout = ak.operations.from_buffers(
	form,
	length,
	container,
	highlevel=False,
	buffer_key="{form_key}-{attribute}",
	byteorder="<",
	)
	self._layout = layout
	self._behavior = behavior
	self._attrs = None

	self._update_class()

	_primitive_to_dtype_dict = {
	"bool": np.dtype(np.bool_),
	"int8": np.dtype(np.int8),
	"uint8": np.dtype(np.uint8),
	"int16": np.dtype(np.int16),
	"uint16": np.dtype(np.uint16),
	"int32": np.dtype(np.int32),
	"uint32": np.dtype(np.uint32),
	"int64": np.dtype(np.int64),
	"uint64": np.dtype(np.uint64),
	"float32": np.dtype(np.float32),
	"float64": np.dtype(np.float64),
	"complex64": np.dtype(np.complex64),
	"complex128": np.dtype(np.complex128),
	"datetime64": np.dtype(np.datetime64),
	"timedelta64": np.dtype(np.timedelta64),
	}

	index_to_dtype: Final[dict[str, DType]] = {
	"i8": np.dtype("<i1"),
	"u8": np.dtype("<u1"),
	"i32": np.dtype("<i4"),
	"u32": np.dtype("<u4"),
	"i64": np.dtype("<i8"),
	}

chore: use _util.native_to_byteorder function in ak.from_buffers #3354

chore: use _util.native_to_byteorder function in ak.from_buffers #3354

Conversation

pfackeldey commented Dec 19, 2024

jpivarski left a comment

Choose a reason for hiding this comment

agoose77 commented Dec 19, 2024

pfackeldey commented Dec 19, 2024

jpivarski commented Dec 19, 2024

chore: use `_util.native_to_byteorder` function in `ak.from_buffers` #3354

chore: use `_util.native_to_byteorder` function in `ak.from_buffers` #3354