Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Content.form_with_key_path() #3311

Merged
merged 8 commits into from
Nov 22, 2024
Merged

Conversation

jpivarski
Copy link
Member

As it stands, this PR lacks tests, but the new feature is:

>>> array = ak.Array([{"event": 123, "muons": [{"pt": 1.1}, {"pt": 2.2}]}])
>>> print(array.layout.form_with_key_path())
{
    "class": "RecordArray",
    "fields": [
        "event",
        "muons"
    ],
    "contents": [
        {
            "class": "NumpyArray",
            "primitive": "int64",
            "form_key": "('event',)"
        },
        {
            "class": "ListOffsetArray",
            "offsets": "i64",
            "content": {
                "class": "RecordArray",
                "fields": [
                    "pt"
                ],
                "contents": [
                    {
                        "class": "NumpyArray",
                        "primitive": "float64",
                        "form_key": "('muons', None, 'pt')"
                    }
                ],
                "form_key": "('muons', None)"
            },
            "form_key": "('muons',)"
        }
    ],
    "form_key": "()"
}

These form keys can be turned back into tuples with ast.literal_eval to get a full path from root to each node.

The new function takes root as an argument to put a prefix in the tuple, for uniqueness beyond the tree.

Maybe those None values should optionally be 0? They're needed for nodes to be unique, so that the form keys of this list are not all identical:

>>> array = ak.Array([[[[[1]]]]])
>>> print(array.layout.form_with_key_path())
{
    "class": "ListOffsetArray",
    "offsets": "i64",
    "content": {
        "class": "ListOffsetArray",
        "offsets": "i64",
        "content": {
            "class": "ListOffsetArray",
            "offsets": "i64",
            "content": {
                "class": "ListOffsetArray",
                "offsets": "i64",
                "content": {
                    "class": "NumpyArray",
                    "primitive": "int64",
                    "form_key": "(None, None, None, None)"
                },
                "form_key": "(None, None, None)"
            },
            "form_key": "(None, None)"
        },
        "form_key": "(None,)"
    },
    "form_key": "()"
}

But maybe we don't care about those extra 3 characters per nesting level. How much nesting is there going to be, anyway? Yeah, I think this is fine.

@jpivarski
Copy link
Member Author

The type hint is wrong. One moment...

@pfackeldey
Copy link
Collaborator

pfackeldey commented Nov 21, 2024

This is great 🎉, and would help me with dask-contrib/dask-awkward#551 significantly!

Btw, I think 0 instead of None could be useful, because it allows to use the ast.literal_eval'd form_key directly for indexing into the corresponding buffer:

array[ast.literal_eval("('muons', 0, 'pt')")]
# >> <Array [1.1, 2.2] type='2 * float64'>

which is something that I'd like to do in dask-contrib/dask-awkward#551 for translating touched form_keys of a type tracer report into slices that I can use for explicit touching at a later stage (i.e. in a second function pass), something like:

def render_buffer_key(form: ak.forms.Form, form_key: str, attribute: str) -> str:
  return f"{form_key}"
  
tt, report = ak.typetracer.typetracer_with_report(
  array.layout.form_with_key_path(), 
  highlevel=True,
  buffer_key=render_buffer_key,
)

def fun(x):
  return x["muons"][0]["pt"]

_ = fun(tt)

...

# use the report later to touch the columns explicitly again,
# but don't run the full `fun` pass again
# ~something like:
if ak.backend(array) == "typetracer":
  for form_key in set(report.data_touched):
    array[ast.literal_eval(form_key)].layout._touch_data(recursive=True)
  
  for form_key in set(report.shape_touched):
    array[ast.literal_eval(form_key)].layout._touch_shape(recursive=True)
else:
  fun(array)

(PS: this PR reminds me of https://jax.readthedocs.io/en/latest/working-with-pytrees.html#explicit-key-paths. )

@jpivarski
Copy link
Member Author

array[ast.literal_eval("('muons', 0, 'pt')")]
# >> <Array [1.1, 2.2] type='2 * float64'>

Careful: that only works if the lists are not empty. Also, what about the use of None for option-type and untyped indexes? Or the use of integers for UnionArrays (which are necessary, since UnionArrays branch). Similarly, tuple RecordArrays (RecordArrays in which fields is None) will need some sort of integer.

It won't be possible to use these as slices in general. But for ast.literal_eval, any constants or tuples/lists/dicts of constants are allowed:

>>> ast.literal_eval("('string', None, True, False, ['nested', {'key': 'value'}])")
('string', None, True, False, ['nested', {'key': 'value'}])

But Ellipsis/... are not. In the Python syntax, that qualifies as a ast.Name, rather than a constant known to the parser. Also, range is not allowed.

Any list could take [] (instead of None) as a slice item, since that's an empty slice. But there's no solution for IndexedArray, which needs something in the form_key to distinguish the IndexedArray from its own content, but this distinction is invisible to slicing.

@pfackeldey
Copy link
Collaborator

Hi @jpivarski,
I pushed the following updates to this PR: fixed type hints, enforced string-type path entries for records only, and added tests for each array type.
From my side this is ready to merge, what do you think?

@jpivarski jpivarski merged commit e359210 into main Nov 22, 2024
41 checks passed
@jpivarski jpivarski deleted the jpivarski/form-with-key-path branch November 22, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants