Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: revise top-level package description #2430

Merged
merged 9 commits into from
Jan 6, 2019

Conversation

rabernat
Copy link
Contributor

I have often complained that xarray's top-level package description assumes that the user knows all about pandas. I think this alienates many new users.

This is a first draft at revising that top-level description. Feedback from the community very needed here.

doc/index.rst Outdated Show resolved Hide resolved
@fujiisoup
Copy link
Member

fujiisoup commented Sep 23, 2018

Thanks, @rabernat

I have often complained that xarray's top-level package description assumes that the user knows all about pandas.

Agreed!

I remember when I first met xarray, I didn't understand well what the word 'labelled data' means.
I don't think this terminology is very common.
It may be nice if we have more explicit definition something like 'xarray provides a data structure to handle a data array and its coordinates consistently'.

doc/index.rst Outdated

Labelled multi-dimensional (a.k.a. N-dimensional) arrays are encountered in
many fields, especially physical sciences, engineering, and finance.
But multi-dimensional data doesn't fit neatly into pandas_, python's most
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to contrast directly with Pandas, I think we need to say what Pandas is first. Maybe also provide an example of what Pandas does (tabular data structures).

Copy link
Member

@TomNicholas TomNicholas Oct 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can't assume that readers know what Pandas is - I certainly didn't. I think that users coming from a more data science background will have used Pandas but those coming from a more low-level array-based numpy/MATLAB/Fortran/C++ point-of-view won't have (e.g. all the physicists I work with).

I also think including an explicit example of a labelled data structure in this explanation would go a long way, the printable representation of an xarray Dataset gives a good idea of how it labels the data it contains.

@jhamman
Copy link
Member

jhamman commented Sep 23, 2018

As a reference, we recently wrote some similar prose for xarray's numfocus page:

https://numfocus.org/project/xarray

Xarray is an open source library providing high-level, easy-to-use data structures and analysis tools for working with multidimensional labeled datasets and arrays in Python.

Xarray is a Python library that provides data structures and tools for working with multidimensional labeled datasets and arrays. Xarray enables users to perform operations on complex datasets. Xarray interoperates with many of the core libraries in the scientific Python ecosystem making it a powerful high-level tool for data analysis.

Xarray has been used in a wide variety of academic and industry contexts for applications as varied as weather/climate, computational physics, astronomy, biology, econometrics, machine learning and finance. It is a core component of Pangeo, a community platform for Big Data geoscience.

Examples of results enabled by xarray include:

modeling the environmental and socioeconomic impacts of climate change
understanding the life cycle of viruses from single-cell RNA sequencing data
measuring the speed of galaxies in a telescope survey

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had trouble with the phrase "labelled data" too. I've added an example that maybe helps clear that bit up.

doc/index.rst Outdated Show resolved Hide resolved
doc/index.rst Outdated Show resolved Hide resolved
doc/index.rst Outdated Show resolved Hide resolved
doc/index.rst Outdated
popular data analysis package focused on label tabular data.
Xarray provides a pandas-like and pandas-compatible toolkit for
analytics on multi-dimensional arrays.
Our approach adopts the `Common Data Model`_ for self-
describing scientific data in widespread use in the Earth sciences:
``xarray.Dataset`` is an in-memory representation of a netCDF file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not completely accurate, an xarray.Dataset represents a netCDF-3 or netCDF-4 classic file, but only one of the Groups in a netCDF-4 file with the new netCDF-4 Data Model https://www.unidata.ucar.edu/software/netcdf/workshops/2011/datamodels/Nc4-uml.html (compatible but not identical with the cited Common Data Model). This may sound pedantic at this level, but I found the subtleties of the netCDF 3/4 data models very hard to grasp once I had the mental map between an xarray.Dataset and a netCDF-4 File.

IMHO the best is to keep the reference to the Unidata Common Data Model as xarray uses the extended type system and add a quick reference to the CDM concept of a Group.

dcherian and others added 3 commits January 4, 2019 21:07
Co-Authored-By: rabernat <ryan.abernathey@gmail.com>
Co-Authored-By: rabernat <ryan.abernathey@gmail.com>
Co-Authored-By: rabernat <ryan.abernathey@gmail.com>
@rabernat
Copy link
Contributor Author

rabernat commented Jan 4, 2019

Given this recent twitter thread, I think we should revive and finish this off.

@rabernat
Copy link
Contributor Author

rabernat commented Jan 4, 2019

Based on the comments I received, I have written a second draft of a revised top-level description.

Xarray also provides a large and growing library of functions for advanced
analytics and visualization with these data structures.
Xarray was inspired by and borrows heavily from pandas_, a highly popular data
analysis package focused on labelled tabular data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to still see the words "netCDF" somewhere (or maybe that's implicit in our mentioning of the "Common Data Model"?).

Roughly speaking we have three audiences here:

  • NumPy users who want labels
  • pandas users who want to work with higher-dimensional data
  • netCDF users who want good in-memory data-structures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it in response to @alexamici's comments. But in retrospect I agree that it belongs there. (I personally had never heard of CDM before xarray.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prioritize mentioning netCDF over the CDM and maybe drop CDM entirely from the brief intro. I don't think many people know what the "common data model" refers to, and worse it seems to be a heavily overloaded term, even in technical contexts (e.g., the top hit from Google is something unrelated from Microsoft).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly speaking we have three audiences here:

* NumPy users who want labels

* pandas users who want to work with higher-dimensional data

* netCDF users who want good in-memory data-structures

This seems key enough that I might even put this somewhere in the docs?

and

* pandas users who want to work with higher-dimensional data
->
* pandas users who want to work with higher-dimensional data and an explicit, production-capable API

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be good stuff to add to the “Why xarray” page.

doc/index.rst Outdated
are an essential part of computational science.
They are encountered in a wide range of fields, including physics, astronomy,
geoscience, bioinformatics, engineering, finance, and deep learning.
In python, numpy_ provides the fundamental data structure and API for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: numpy -> NumPy

doc/index.rst Outdated
However, real-world datasets are usually more than just raw numbers;
they have "labels" which encode information about how the array values map
to locations in space, time, etc.
By adopting the the `Common Data Model`_ for self-describing scientific data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the the -> the

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe (I'm not actually sure htis is better):
By adopting the self-describing data model of the netCDF file format

@shoyer
Copy link
Member

shoyer commented Jan 4, 2019

This looks great now. Could you also kindly copy it into our setup.py and README.rst files?

@@ -2,19 +2,33 @@ xarray: N-D labeled arrays and datasets in Python
=================================================

**xarray** (formerly **xray**) is an open source project and Python package
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer can we drop the reference to xray? The set of people that know the old xray and don't know the new xarray name is probably next to empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, just today in the twitter thread under discussion, someone referenced xray and linked to the v0.2 documentation. 🤦‍♂️

doc/index.rst Outdated
In python, numpy_ provides the fundamental data structure and API for
working with raw ND arrays.
However, real-world datasets are usually more than just raw numbers;
they have "labels" which encode information about how the array values map
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need " around labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@max-sixty
Copy link
Collaborator

This is looking great!

@rabernat
Copy link
Contributor Author

rabernat commented Jan 5, 2019

Ready I think.

doc/index.rst Outdated
that makes working with labelled multi-dimensional arrays simple,
efficient, and fun!

Multi-dimensional (a.k.a. N-dimensional, ND) arrays (somtimes called "tensors")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somtimes -> sometimes

@shoyer
Copy link
Member

shoyer commented Jan 5, 2019

@rabernat I pushed some minor tweaks to your branch, please take a look!

@rabernat
Copy link
Contributor Author

rabernat commented Jan 6, 2019

@shoyer - all your changes are 👍 with me.

@dcherian dcherian merged commit a0bbea8 into pydata:master Jan 6, 2019
shoyer pushed a commit that referenced this pull request Jan 6, 2019
* revise main package description

* Update doc/index.rst

Co-Authored-By: rabernat <ryan.abernathey@gmail.com>

* Update doc/index.rst

Co-Authored-By: rabernat <ryan.abernathey@gmail.com>

* Update doc/index.rst

Co-Authored-By: rabernat <ryan.abernathey@gmail.com>

* next draft

* add mention of netCDF

* eliminate CDM reference

* update README and setup.py

* Split long paragraph, minor rewordings
@shoyer
Copy link
Member

shoyer commented Jan 7, 2019

I spent a few more hours working on this this afternoon -- please take a look at #2657!

dcherian pushed a commit to yohai/xarray that referenced this pull request Jan 24, 2019
* master:
  Remove broken Travis-CI builds (pydata#2661)
  Type checking with mypy (pydata#2655)
  Added Coarsen (pydata#2612)
  Improve test for GH 2649 (pydata#2654)
  revise top-level package description (pydata#2430)
  Convert ref_date to UTC in encode_cf_datetime (pydata#2651)
  Change an `==` to an `is`. Fix tests so that this won't happen again. (pydata#2648)
  ENH: switch Dataset and DataArray to use explicit indexes (pydata#2639)
  Use pycodestyle for lint checks. (pydata#2642)
  Switch whats-new for 0.11.2 -> 0.11.3
  DOC: document v0.11.2 release
  Use built-in interp for interpolation with resample (pydata#2640)
  BUG: pytest-runner no required for setup.py (pydata#2643)
shoyer pushed a commit that referenced this pull request Jan 26, 2019
* revise main package description

* Update doc/index.rst

Co-Authored-By: rabernat <ryan.abernathey@gmail.com>

* Update doc/index.rst

Co-Authored-By: rabernat <ryan.abernathey@gmail.com>

* Update doc/index.rst

Co-Authored-By: rabernat <ryan.abernathey@gmail.com>

* next draft

* add mention of netCDF

* eliminate CDM reference

* update README and setup.py

* Split long paragraph, minor rewordings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants