Skip to content

NikosAlexandris/rekx

Repository files navigation

rekx 🦖

Under Heavy Development

License GitHub tag (with filter) Built with Material for MkDocs Documentation ci

1

What ?

rekx seamlessly interfaces the Kerchunk library [@Durant2023] in an interactive way through the command line. It assists in creating virtual aggregate datasets, also known as Kerchunk reference sets, which allows for an efficient, parallel and cloud-friendly way to access data in-situ without duplicating the original datasets.

More than a functional tool, rekx serves an educational purpose on matters around chunking, compression and efficient data reading from common scientific file formats such as NetCDF used extensively to store large time-series. While there is abundant documentation on such topics, it is often highly technical and oriented towards developers, rekx tries to simplify these concepts through practical examples.

Why ?

Similarly, existing tools for managing HDF and NetCDF data, such as cdo, nco, and others, often have overlapping functionalities and present a steep learning curve for non-experts. rekx focuses on practical aspects of efficient data access trying to simplify these processes.

It features simple command line tools to:

  • diagnose data structures
  • validate uniform chunking across files
  • suggest good chunking shapes
  • parameterise the rechunking of datasets.
  • create and aggregate Kerchunk reference sets
  • time data read operations for performance analysis

rekx dedicates to practicality, simplicity, and essence.

Interested ? Head over to the documentation.

To Do

  • Complete backend for rechunking, support for
    • NetCDF4
    • Xarray
    • nccopy
  • Simplify command line interface
    • merge "multi" commands to single/simple ones ?
    • make common-shape and validate options to shapes ?
    • clean non-sense suggest-alternative command or merge to suggest
    • merge reference-parquet to reference
    • as above, same for/with combine commands
    • does a sepatate select-fast make sense ?
    • review various select/read commands
  • Go through :
  • Write clean and meaningful docstrings for each and every function
  • Pytest each and every (?) function
  • Packaging
  • Documentation
    • Use https://squidfunk.github.io/mkdocs-material/
    • Simple examples
      • Diagnose
      • Suggest
      • Rechunk
      • Kerchunk
        • JSON
          • Create references
          • Combine references
          • Read data from aggregated reference and load in memory
        • Parquet
      • Select (aka read)
        • From Xarray-supported datasets
        • From Kerchunk references
    • Tutorial
      • Rechunking and Kerchunking SARAH3 products
    • Add visuals to Concepts

Footnotes

  1. Original T-Rex drawn by pikisuperstar on Freepik

About

rekx (or reKX from XKer) : Kerchunk after Xarray

Resources

License

Stars

Watchers

Forks

Packages

No packages published