Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_dict without data #2659

Merged
merged 5 commits into from
Jan 21, 2019
Merged

to_dict without data #2659

merged 5 commits into from
Jan 21, 2019

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Jan 7, 2019

This PR provides the ability to export Datasets and DataArrays to dictionary without the actual data. This could be useful for generating indices of dataset contents to expose to search indices or other automated data discovery tools

In the process of doing this, I refactored the core dictionary export function to live in the Variable class, since the same code was duplicated in several places.

@pep8speaks
Copy link

pep8speaks commented Jan 7, 2019

Hello @rabernat! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 08, 2019 at 08:43 Hours UTC

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple of nits


d.update({'data': ensure_us_time_resolution(self.values).tolist(),
'name': self.name})
d['coords'].update({k: self.coords[k].variable.to_dict(data=data)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only adding one item, so just using indexing for assignment seems cleaner: d['coords'][k] = self.coords[k].variable.to_dict(data=data)

@@ -2909,6 +2909,13 @@ def test_to_and_from_dict(self):
ValueError, "cannot convert dict without the key 'data'"):
DataArray.from_dict(d)

# check the data=False option
expected_no_data = {**expected}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is a fancy Py3 way of doing a copy? :)

I would lean towards the more explicit expected.copy() given that you aren't inserting any extra fields here.

@rabernat
Copy link
Contributor Author

rabernat commented Jan 7, 2019

It just occurred to me that it would be nice to have some extra info about the data, such as dtype.

@@ -3045,11 +3045,20 @@ def test_to_and_from_dict(self):
# check roundtrip
assert_identical(ds, Dataset.from_dict(actual))

# check the data=False option
expected_no_data = expected.copy()
print(expected_no_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a forgotten print here!

@fmaussion
Copy link
Member

It just occurred to me that it would be nice to have some extra info about the data, such as dtype.

II was about to ask about it ;-). What other attribute than dtype were you thinking about?

@shoyer
Copy link
Member

shoyer commented Jan 7, 2019

What other attribute than dtype were you thinking about?

Shape might also be a good idea.

@rabernat
Copy link
Contributor Author

rabernat commented Jan 8, 2019

Here is what the .to_dict(data=False) output looks like from the test example

x = np.random.randn(10)
y = np.random.randn(10)
t = list('abcdefghij')
ds = Dataset(OrderedDict([('a', ('t', x)),
                          ('b', ('t', y)), ('t', ('t', t))]))
ds.to_dict(data=False)
{'attrs': {},
 'coords': {'t': {'attrs': {},
   'dims': ('t',),
   'dtype': '<U1',
   'shape': (10,)}},
 'data_vars': {'a': {'attrs': {},
   'dims': ('t',),
   'dtype': 'float64',
   'shape': (10,)},
  'b': {'attrs': {}, 'dims': ('t',), 'dtype': 'float64', 'shape': (10,)}},
 'dims': {'t': 10}}

The one thing I don't like about this is the empty attributes. Maybe we could change it so that attrs is only included if it is non-empty. Alternatively, we coiuld add a squeeze_attrs option.

@dcherian
Copy link
Contributor

dcherian commented Jan 8, 2019

Does this help with #2347?

@rabernat
Copy link
Contributor Author

Maybe we could change it so that attrs is only included if it is non-empty.

How do people feel about this? If folks are fine with it as is, then LGTM.

@rabernat rabernat mentioned this pull request Jan 11, 2019
8 tasks
@fmaussion
Copy link
Member

fmaussion commented Jan 11, 2019

Maybe we could change it so that attrs is only included if it is non-empty.

How do people feel about this? If folks are fine with it as is, then LGTM.

I'd rather have an empty dict as in the current implementation (like xarray)

@rabernat
Copy link
Contributor Author

Let's get this merged?

----------
data : bool, optional
Whether to include the actual data in the dictionary. When set to
False, returns just the schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful to allow including the data for the coordinates but not the data variables?

I'm thinking something like

ds.to_dict(data='coords')

@shoyer
Copy link
Member

shoyer commented Jan 21, 2019

I think we should merge this -- @rabernat feel free to go ahead and do that.

We can leave data='coords' for a follow-up, if anyone has real use-cases for it.

@rabernat rabernat merged commit a7d55b9 into pydata:master Jan 21, 2019
dcherian pushed a commit to dcherian/xarray that referenced this pull request Jan 24, 2019
* master:
  stale requires a label (pydata#2701)
  Update indexing.rst (pydata#2700)
  add line break to message posted (pydata#2698)
  Config for closing stale issues (pydata#2684)
  to_dict without data (pydata#2659)
  Update asv.conf.json (pydata#2693)
  try no rasterio in py36 env (pydata#2691)
  Detailed report for testing.assert_equal and testing.assert_identical (pydata#1507)
  Hotfix for pydata#2662 (pydata#2678)
  Update README.rst (pydata#2682)
  Fix test failures with numpy=1.16 (pydata#2675)
dcherian pushed a commit to yohai/xarray that referenced this pull request Jan 24, 2019
* refactor-plot-utils: (22 commits)
  review comment.
  small rename
  stale requires a label (pydata#2701)
  Update indexing.rst (pydata#2700)
  add line break to message posted (pydata#2698)
  Config for closing stale issues (pydata#2684)
  to_dict without data (pydata#2659)
  Update asv.conf.json (pydata#2693)
  try no rasterio in py36 env (pydata#2691)
  Detailed report for testing.assert_equal and testing.assert_identical (pydata#1507)
  Hotfix for pydata#2662 (pydata#2678)
  Update README.rst (pydata#2682)
  Fix test failures with numpy=1.16 (pydata#2675)
  lint
  Back to map_dataarray_line
  Refactor out cmap_params, cbar_kwargs processing
  Refactor out colorbar making to plot.utils._add_colorbar
  flake8
  facetgrid refactor
  Refactor out utility functions.
  ...
@rabernat
Copy link
Contributor Author

rabernat commented Feb 6, 2019

Too late, I found this:

https://binary-array-ld.github.io/netcdf-ld/

@rabernat
Copy link
Contributor Author

And this
http://cf-json.org/specification

@rabernat
Copy link
Contributor Author

p.s. I learned about this from @kwilcox in pangeo-data/pangeo-datastore#3. He might be a good person to loop into this discussion.

@kwilcox
Copy link

kwilcox commented Feb 12, 2019

If you are interested I could implement an xarray -> cf-json -> xarray round-trip. It will be a bit different from to_dict in terms of carrying through variable and attribute types so the file can be re-produced from the JSON. They both have their use-cases!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants