From 2f3600d5f23854bcb2567a21e29f5c10e1fd574f Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu, 7 Dec 2023 06:04:22 +0000 Subject: [PATCH] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- intermediate/data_tidying/05.1_intro.md | 28 ++++++++++++------- intermediate/data_tidying/05.2_examples.md | 6 ++-- .../data_tidying/05.3_ice_velocity.ipynb | 8 +----- .../data_tidying/05.4_contributing.md | 4 +-- intermediate/data_tidying/05.5_scipy_talk.md | 5 ++-- 5 files changed, 26 insertions(+), 25 deletions(-) diff --git a/intermediate/data_tidying/05.1_intro.md b/intermediate/data_tidying/05.1_intro.md index 935b4da7..4f61292e 100644 --- a/intermediate/data_tidying/05.1_intro.md +++ b/intermediate/data_tidying/05.1_intro.md @@ -1,21 +1,22 @@ # Introduction -Array data that are represented by Xarray objects are often multivariate, multi-dimensional, and very complex. Part of the beauty of Xarray is that it is adaptable and scalable to represent a large number of data structures. However, this can also introduce difficulty (especially for learning users) in arriving at a workable structure that will best suit one's analytical needs. +Array data that are represented by Xarray objects are often multivariate, multi-dimensional, and very complex. Part of the beauty of Xarray is that it is adaptable and scalable to represent a large number of data structures. However, this can also introduce difficulty (especially for learning users) in arriving at a workable structure that will best suit one's analytical needs. -This project is motivated by community sentiment and experiences that often, the hardest part of learning and teaching Xarray is teaching users how best to use Xarray conceptually. We hope to leverage the experiences of Xarray and geospatial data users to arrive at a unifying definition of 'tidy' data in this context and best practices for 'tidying' geospatial raster data represented by Xarray objects. +This project is motivated by community sentiment and experiences that often, the hardest part of learning and teaching Xarray is teaching users how best to use Xarray conceptually. We hope to leverage the experiences of Xarray and geospatial data users to arrive at a unifying definition of 'tidy' data in this context and best practices for 'tidying' geospatial raster data represented by Xarray objects. This page discusses common data ‘tidying’ steps and presents principles to keep in mind when organizing data in Xarray. We also point out helpful extensions to simplify and automate this process for specific dataset types like satellite imagery. -A great first step is familiarizing yourself with the [terminology](https://docs.xarray.dev/en/stable/user-guide/terminology.html) used in the Xarray ecosystem. +A great first step is familiarizing yourself with the [terminology](https://docs.xarray.dev/en/stable/user-guide/terminology.html) used in the Xarray ecosystem. ## A brief primer on tidy data Tidy data was developed by Hadley Wickham for tabular datasets in the R programming language. Many resources comprehensively explain this concept and the ecosystem of tools built upon it. Below is a very brief explanation: -**Data tidying** is the process of structuring datasets to facilitate analysis. Wickham writes: "...tidy datasets are all alike, but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning)" (Wickham, 2014). +**Data tidying** is the process of structuring datasets to facilitate analysis. Wickham writes: "...tidy datasets are all alike, but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning)" (Wickham, 2014). ### Tidy data principles for tabular datasets -The concept of [tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) was developed by Hadley Wickham in the R programming language, and is a set of principles to guide facilitating tabular data for analysis. + +The concept of [tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) was developed by Hadley Wickham in the R programming language, and is a set of principles to guide facilitating tabular data for analysis. ``` "Tidy datasets are all alike, but every messy dataset is messy in its own way." - Wickham, 2014 @@ -31,34 +32,41 @@ Wickham defines three core principles of tidy data for tabular principles. They ### Common use-case: Manipulating individual observations to an x-y-time datacube -Data downloaded or accessed from DAACs and other providers is often (for good reason) separated into temporal observations or spatial subsets. This minimizes the services that must be provided for different datasets and allows the user to access just the material that they need. However, most workflows will involve some sort of spatial and/or temporal investigation of an observable, which will usually require the analyst to arrange individual files into spatial mosaics and/or temporal cubes. In addition to being a source of duplicated effort and work, these steps also introduce decision-points that can be stumbling blocks for newer users. We hope a tidy framework for xarray will streamline the process of preparing data for analysis by providing specific expectations of what 'tidied' datasets look like as well as common patterns and tools to use to arrive at a tidy state. +Data downloaded or accessed from DAACs and other providers is often (for good reason) separated into temporal observations or spatial subsets. This minimizes the services that must be provided for different datasets and allows the user to access just the material that they need. However, most workflows will involve some sort of spatial and/or temporal investigation of an observable, which will usually require the analyst to arrange individual files into spatial mosaics and/or temporal cubes. In addition to being a source of duplicated effort and work, these steps also introduce decision-points that can be stumbling blocks for newer users. We hope a tidy framework for xarray will streamline the process of preparing data for analysis by providing specific expectations of what 'tidied' datasets look like as well as common patterns and tools to use to arrive at a tidy state. ## Tidy data principles for Xarray data structures -These are guidelines to keep in mind while you are organizing your data. For detailed definitions of the terms mentioned below (and more), check out Xarray's [Terminology page](https://docs.xarray.dev/en/stable/user-guide/terminology.html). +These are guidelines to keep in mind while you are organizing your data. For detailed definitions of the terms mentioned below (and more), check out Xarray's [Terminology page](https://docs.xarray.dev/en/stable/user-guide/terminology.html). + +**1. Dimensions** -**1. Dimensions** - Minimize the number of dimensional coordinates **2. Coordinates** + - Non-dimensional coordinates can be numerous. Each should exist along one or multiple dimensions **3. Data Variables** + - Data variables should be observables rather than contextual. Each should exist along one or multiple dimensions. **4. Contextual information (metadata)** + - Metadata should only be stored as an attribute if it is static along the dimensions to which it is applied. - If metadata is dynamic, it should be stored as a coordinate variable. -- Metadata `attrs` should be added such that dataset is self-describing (following CF-conventions) +- Metadata `attrs` should be added such that dataset is self-describing (following CF-conventions) **5. Variable, attribute naming** + - **Wherever possible, use cf-conventions for naming** - Variable names should be descriptive - Variable names should not contain information that belongs in a dimension or coordinate (ie. information stored in a variable name should be reduced to only the observable the variable describes. **6. Make us of & work within the framework of other tools** + - Specification systems such as [CF]() and [STAC](https://stacspec.org/en), and related tools such as [Open Data Cube](https://www.opendatacube.org/), [PySTAC](https://pystac.readthedocs.io/en/stable/), [cf_xarray](https://cf-xarray.readthedocs.io/en/latest/),[stackstac](https://stackstac.readthedocs.io/en/latest/) and more make tidying possible and smoother, especially with large, cloud-optimized datasets. -- +- + ## Other guidelines and rules of thumb - Avoid storing important data in filenames diff --git a/intermediate/data_tidying/05.2_examples.md b/intermediate/data_tidying/05.2_examples.md index 45ef9aad..51f6229d 100644 --- a/intermediate/data_tidying/05.2_examples.md +++ b/intermediate/data_tidying/05.2_examples.md @@ -5,14 +5,12 @@ This page contains examples of 'tidying' datasets. If you have an example you'd ## 1. Aquarius This is an example of tidying a dataset comprised of locally downloaded files. Aquarius is a sea surface salinity dataset produced by NASA and accessed as network Common Data Form (NetCDF) files. -You can find this example [here](https://gist.github.com/dcherian/66269bc2b36c2bc427897590d08472d7). This example focuses on data access steps and organizing data into a workable data cube. +You can find this example [here](https://gist.github.com/dcherian/66269bc2b36c2bc427897590d08472d7). This example focuses on data access steps and organizing data into a workable data cube. ## 2. ASE Ice Velocity -Already integrated into the Xarray tutorial, this examples uses an ice velocity dataset derived from synthetic aperture radar imagery. You can find it [here](https://tutorial.xarray.dev/intermediate/data_tidying/05.3_ice_velocity.html). This example focuses on data access steps and organizing data into a workable data cube. +Already integrated into the Xarray tutorial, this examples uses an ice velocity dataset derived from synthetic aperture radar imagery. You can find it [here](https://tutorial.xarray.dev/intermediate/data_tidying/05.3_ice_velocity.html). This example focuses on data access steps and organizing data into a workable data cube. ## 3. Harmonized Landsat-Sentinel This [example](https://nbviewer.org/gist/scottyhq/efd583d66999ce8f6e8bcefa81545b8d) features cloud-optimized data that does not need to be downloaded locally. Here, package such as [`odc-stac`](https://github.com/opendatacube/odc-stac) are used to accomplish much of the initial tidying (assembling an x,y,time cube). However, this example shows that there is frequently additional formatting required to make a dataset analysis ready. - - diff --git a/intermediate/data_tidying/05.3_ice_velocity.ipynb b/intermediate/data_tidying/05.3_ice_velocity.ipynb index fbb64439..5172cb96 100644 --- a/intermediate/data_tidying/05.3_ice_velocity.ipynb +++ b/intermediate/data_tidying/05.3_ice_velocity.ipynb @@ -473,11 +473,6 @@ } ], "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, "language_info": { "codemirror_mode": { "name": "ipython", @@ -487,8 +482,7 @@ "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.3" + "pygments_lexer": "ipython3" } }, "nbformat": 4, diff --git a/intermediate/data_tidying/05.4_contributing.md b/intermediate/data_tidying/05.4_contributing.md index 5ba9d082..4473b050 100644 --- a/intermediate/data_tidying/05.4_contributing.md +++ b/intermediate/data_tidying/05.4_contributing.md @@ -1,5 +1,5 @@ ## Contributing -This project is an evolving community effort. **We want to hear from you!**. Many workflows involve some version of the examples discussed here. The solutions you've developed in your work could help future users and help the community move toward more established norms around tidy data. Please consider submitting any examples you may have. You can create an issue [here](https://github.com/e-marshall/tidy-xarray/issues/new?assignees=&labels=&projects=&template=data-tidying-example-template.md&title=).If you have any questions or topics you'd like to discuss, please don't hesitate to create an issue on github. +This project is an evolving community effort. **We want to hear from you!**. Many workflows involve some version of the examples discussed here. The solutions you've developed in your work could help future users and help the community move toward more established norms around tidy data. Please consider submitting any examples you may have. You can create an issue [here](https://github.com/e-marshall/tidy-xarray/issues/new?assignees=&labels=&projects=&template=data-tidying-example-template.md&title=).If you have any questions or topics you'd like to discuss, please don't hesitate to create an issue on github. -*note: issue template has some errors currently, need to fix* \ No newline at end of file +_note: issue template has some errors currently, need to fix_ diff --git a/intermediate/data_tidying/05.5_scipy_talk.md b/intermediate/data_tidying/05.5_scipy_talk.md index bd1e10d3..a78573f0 100644 --- a/intermediate/data_tidying/05.5_scipy_talk.md +++ b/intermediate/data_tidying/05.5_scipy_talk.md @@ -1,10 +1,11 @@ # SciPy 2023 -This project was initially presented at the 2023 SciPy conference in Austin, TX. You can check out the slides and a recording of the presentation below. +This project was initially presented at the 2023 SciPy conference in Austin, TX. You can check out the slides and a recording of the presentation below. ## Slides + The presentation slides are avaialble through the [2023 SciPy Conference Proceedings](https://conference.scipy.org/proceedings/scipy2023/slides.html) and can be downloaded [here](https://zenodo.org/records/8221167). ## Recording -A recording of the presentation is available [here](https://www.youtube.com/watch?v=KZlG1im088s). \ No newline at end of file +A recording of the presentation is available [here](https://www.youtube.com/watch?v=KZlG1im088s).