Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define data model for representing points and trajectories #12

Closed
niksirbi opened this issue Mar 23, 2023 · 6 comments · Fixed by #33
Closed

Define data model for representing points and trajectories #12

niksirbi opened this issue Mar 23, 2023 · 6 comments · Fixed by #33
Assignees
Labels
core feature a core functionality that must be implemented

Comments

@niksirbi
Copy link
Member

Define custom classes for representing points (animal body parts) and series of points (animal trajectories) in space.

These could be sub-classes of np.record and np.recarray respectively, to access fields (e.g. 'x'. 'y', 'name', 'confidence') as attributes.

Note
numpy.record and numpy.recarray are NumPy data structures for handling structured data with multiple fields per element. A numpy.record represents a single structured element, while a numpy.recarray is an array of such elements. The key difference between a regular structured array and a numpy.recarray is that fields in a recarray can be accessed using attribute notation (e.g., recarray.field_name) instead of indexing notation (e.g., array['field_name']). This provides more convenient syntax, but with slightly lower performance compared to structured arrays.

This is the approach SLEAP takes. We could also directly use or subclass the SLEAP objects.

@niksirbi niksirbi added the core feature a core functionality that must be implemented label Mar 23, 2023
@niksirbi niksirbi added this to the v 0.1 (MVP) milestone Mar 23, 2023
@niksirbi niksirbi self-assigned this Mar 23, 2023
@niksirbi niksirbi mentioned this issue Mar 24, 2023
3 tasks
@niksirbi
Copy link
Member Author

In a discussion with @lochhh, we agreed to first try using the SLEAP data model.
The main roadblock for this is SLEAP not supporting recent Python versions. The SLEAP developers suggested using sleap-io, a separate Python package which reimplements their data model and deserialization routines - see this thread.

@talmo
Copy link

talmo commented Jun 28, 2023

Just adding some more thoughts on this -- we've gone back and forth a lot on the appropriate data structure for pose data.

SLEAP's object-oriented model is clean and Pythonic (it's basically a bunch of dataclasses), and maps well onto common serialization formats like JSON/YAML/HDF5. It also makes it easy to translate to standardized formats like NWB's ndx-pose. It's also flexible in that you can have variable numbers of instances per frame, and have the ability to link together attributes like tracks/identities or skeletons with individual animal instances.

The downside is that it's not always the most efficient depending on the access pattern. When you're doing labeling, random access creation of a single point or instance is necessary since users label one animal at a time. But imagine repeated serialization/deserialization -- if you have a Python object for every point, you're going to be instantiating hundreds of thousands to millions of little objects!

When you're doing complex queries, it's super inefficient. Consider the use case where you want to ask for all the frames in which there are N animals with body part pairs A and B within distance K of each other. This now requires a full iteration over all T frames (where T >> 1e6 oftentimes), and every instance within the frame, resulting in a O(T * N) operation -- assuming the labels are stored sequentially and not hashed by something else (like in multi-video projects).

I think the best of both worlds -- and what we'd eventually like to have in sleap-io -- would be to have a thin object-oriented access layer backed by a pandas DataFrame that has good support for cythonized or otherwise vectorized operations on the backend. Libraries like sqlalchemy achieve this to some extent, allowing for different access patterns via DAO/ORM/CRUD type patterns. Alternatively, just having different backends optimized for different use cases might be cleaner and reduce the abstraction overhead.

If you're going down the object-oriented model route, consider using a framework like attrs or plain dataclasses for readability and reducing boilerplate. See also these considerations with regards to performance and usability: [1] [2] [3]

In any case, give it a go for your test cases, benchmark it, and feel free to reach out if you need any feedback or have any for us!

@niksirbi
Copy link
Member Author

Thank you for chiming in on this @talmo.
Since this project is still in early development, we are fully open to discussing basic design considerations. We want to choose data structures that will not make our lives difficult down the line.

The SLEAP data model appealed to us precisely because of the flexibility you mentioned (and a desire to not reinvent the wheel), but the performance considerations may indeed become a bottleneck. Not so much for our envisioned alpha product (import, smooth and plot tracks) but definitely for more complex kinematic analyses like the example you mentioned.

I am keen to stay in touch and follow the developments over at sleap-io, given that your team has thought about these issues for much longer than we have.

For now, we will likely try adopting the sleep-io model as is, and implement changes on the backend as things evolve. If the backend approach you end up with is good enough for our needs, we are happy to adopt it. Otherwise, we'll have to design backends tailored to our needs.

Just out of curiosity, have you given dask much thought? We have benefited from Dask in other unrelated projects, but haven't yet thought through if/how to apply it to pose data. In case you have considered it and think it's a dead end, let us know.

Also thanks for the attrs references, I will read through and reconsider my use of Pydantic.

@niksirbi
Copy link
Member Author

After some research and internal discussions, we decided to try using xarray.DataArray as a backend for pose tracking data.

DataArray is an N-dimensional generalisation of pandas Series.

  • values: a numpy.ndarray holding the array's values
  • dims: names for each axis (e.g., ['frames', 'individuals', 'bodyparts'])
  • coords: the levels of each dim - e.g., list of animal names, bodypart names
  • attrs: an OrderedDict to hold arbitrary metadata (attributes).

Multiple DataArray objects can also be put into an xarray.Dataset, aligned along shared dimensions. For example we could create a Dataset corresponding to a collection of videos, with the pose tracks of each video being stored in a separate DataArray object.

xarray Pros xarray Cons
label-based indexing not as widely known as numpy/pandas
numpy-like vectorisation and broadcasting will require some learning for devs
pandas-like aggregation + groupby
Dask integration for parallel computing

I'lle give it a try and see if we can discover some unknown "cons" before we fully commit to it as a backend.

@talmo
Copy link

talmo commented Jul 17, 2023

Would definitely recommend xarray over numpy recarray. If using this for prediction results only, then this should work great.

If using it for training data, I'd advise checking out some of the discussions in rly/ndx-pose#9 for workflow-specific considerations. Basically, you may not want to over-optimize for timeseries since most annotation for pose is done in single images that are explicitly not consecutive in time.

@niksirbi
Copy link
Member Author

Would definitely recommend xarray over numpy recarray. If using this for prediction results only, then this should work great.

Thanks for the input! Most of the things we want to do will operate on the prediction results only. movement is meant for post-SLEAP/DLC analysis, meaning we use already predicted poses as the input.

@niksirbi niksirbi mentioned this issue Aug 16, 2023
@niksirbi niksirbi added this to the v0.1 - alpha release milestone Oct 16, 2023
@sfmig sfmig removed this from the v0.1 - alpha release milestone Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core feature a core functionality that must be implemented
Projects
Development

Successfully merging a pull request may close this issue.

4 participants