Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solar astronomy data on GCS bucket #269

Closed
mrocklin opened this issue May 21, 2018 · 60 comments
Closed

Solar astronomy data on GCS bucket #269

mrocklin opened this issue May 21, 2018 · 60 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented May 21, 2018

@MSKirk was kind enough to upload a number of FITS files that correspond to a solar astronomy dataset.

I can load a little bit of the data with the following steps:

import gcsfs
fs = gcsfs.GCSFileSystem()
urls = ['https://storage.googleapis.com/' + fn for fn in fs.ls('pangeo-data/SDO_AIA_Images/094')]

from astropy.io import fits
[img] = fits.open(urls[0])
x = img.data

cc @Cadair, @DanRyanIrish, @martindurant, @mmccarty

@martindurant
Copy link
Contributor

martindurant commented May 21, 2018

Some fitsio functions work with any file-like objects, some do not. This means that we could relatively easily construct dask-arrays from sets of fits files:

with gcs.open('pangeo-data/SDO_AIA_Images/094/aia_lev1_94a_2012_09_23t05_38_37_12z_image_lev1_fits.fits') as f:
    img = fits.getdata(f)

edit: let me flesh that out a bit: we could use

files = dask.bytes.open_files('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
with files[0] as f:
    # grab one header; assume we know which extension we need
    head = fits.getheader(f)
naxes = head['NAXIS']
dtype = np.float32  # not sure how to derive from header
shape = head['NAXIS2'], head['NAXIS1'] # 2D

def file_to_arr(fn):
    with fn as f:
        return fits.getdata(f)

arrs = [da.from_delayed(delayed(file_to_arr)(fn), shape=shape, dtype=dtype)
         for fn in files]
arr = da.stack(arrs)

@Cadair
Copy link

Cadair commented May 21, 2018

This is awesome! I will have a go at utilizing some of my DKIST code for this tomorrow.

@rabernat
Copy link
Member

This is great! I welcome the involvement of astronomers here.

As the number of datasets continues to grow, it would be good to establish an authoritative catalog. Hopefully this can be intake (#39), but I doubt it works with fits format.

@MSKirk
Copy link

MSKirk commented May 21, 2018

I have about 460 GB of data (12,600 images) queued up to get uploaded to the pangeo-data drive. These images haven't been thoughtfully chosen, but were sitting on an external hard disk of mine. They represent 7 different EVU wavelengths that the SDO AIA instrument observes the sun in. Let me know if you have any data specific questions.

@martindurant
Copy link
Contributor

There is no plugin for FITS in Intake yet, this is a question of demand :)

The outstanding problem if actually the coordinates systems (WCS) that FITS typically bundles, which are not aligned with the axes in general, or maybe not even linear. It would be OK to have an array-and-wcs pair, but the model doesn't easily fit into the xarray coordinate-axes model.

@Cadair
Copy link

Cadair commented May 21, 2018

@MSKirk are they level 1.5 images? So effectively all share the same WCS? (Along the time axis)

@mrocklin
Copy link
Member Author

mrocklin commented May 21, 2018 via email

@MSKirk
Copy link

MSKirk commented May 22, 2018

@Cadair, these are level 1 images but they are during a time of stability for the spacecraft. So any corrections that the prep routines make are going to be relatively small. When I am back to my desk, I can work on a level 1.5 dataset.

@martindurant
Copy link
Contributor

Actually, straight-forward fits.open also seems to work on a gcsfs or dask file, so we can

  • get the header and data array(s) in one function, e.g., if we needed some key from the header to know which part of a global array the data were to be inserted into.
  • get only a specific part of the array of any given file using hdu.section[], allowing for chunking not just between files.

@Cadair
Copy link

Cadair commented May 22, 2018

So I have been playing for the last hour or so I have got as far as this http://pangeo.pydata.org/user/cadair/lab/tree/examples%2FAIA1.ipynb (don't know if you can actually follow that link?)

I made a conda env to run the thing, it seems to have issues with that though, what's the best way to handle deps and such?

@martindurant
Copy link
Contributor

We cannot follow that link - perhaps make it a gist?

@mrocklin
Copy link
Member Author

You should be able to conda/pip install into your notebook directory. However in order to use distributed workers you will also need to tell them which libraries to install when they start up. You can do this easily by modifying the worker-template.yaml file in your home directory. You'll want to extend the env: values as follows:

    env:
      - name: GCSFUSE_BUCKET  # this was here before
        value: pangeo-data
      - name: EXTRA_PIP_PACKAGES
        value: foo bar git+https://github.com/user/package.git@branch
      - name: EXTRA_CONDA_PACKAGES
        value: baz quuz -c conda-forge

I generally prefer to use pip packages over conda in this case because it will happen for every worker every time they start up and conda takes a while.

@Cadair
Copy link

Cadair commented May 24, 2018

I made it work 🎉 https://gist.github.com/Cadair/f65fbcb593c6d70eb8f3f3e48324f7c3

I used a cluster to read all the headers of all the files and then used this information to construct a WCS object and fed it all into NDCube, so we now have a coordinate aware Dask array.

I did briefly try to construct a Dask DataFrame out of all the headers in all the FITS files, but I didn't get it to work and I have a limited amount of time for this today!

@Cadair
Copy link

Cadair commented May 24, 2018

(oh the pip requirement is just ndcube as it pulls in everything else)

@martindurant
Copy link
Contributor

@Cadair , in this particular example, the coordinates are well-aligned with the axes, so this would be a good case for xarray as the container, especially if you were to have multiple cubes for each pass-band. Not that I have anything against ndcube, I just don't know (yet) what advantages it has.

Does ndcube reference the dask-array lazily, or does the constructor ndcube.NDCube(arr, wcs=wcs) embed an asarray that would force the array to be materialized?

@Cadair
Copy link

Cadair commented May 24, 2018

While the coordinates are in practice well aligned, the spatial coordinates are subject to a TAN map projection, so they are not really appropriate for xarray like we discussed on the call. NDCube has just been built to be coordinate aware through astropy.wcs so that it can do the coordinate transforms in the "native" way.

NDCube should only reference the array lazily, I haven't encountered anywhere where it dosen't yet.

@rabernat
Copy link
Member

rabernat commented May 24, 2018

fed it all into NDCube, so we now have a coordinate aware Dask array.

As a physical oceanographer / climate modeler, it is very interesting to me to learn how other disciplines handle their datasets. In this case, ndcube sounds quite similar to xarray, so there is possibly some opportunity for synergy / collaboration here.

the spatial coordinates are subject to a TAN map projection, so they are not really appropriate for xarray

This is not fundamentally different from the situation with geographic data. Earth observations and models use different projections and coordinate systems. We often have to transform from one coordinate system to another. Xarray is not limited to cartesian geometry--it supports auxiliary non-dimension coordinates for the purpose of storing such information. However, since these projection / regridding operations are often domain specific, they are not included in xarray. We use packages like cartopy or xesmf to handle the projection problem; these packages can work directly with xarray data structures, making them more flexible and less error prone.

People often assume xarray's data model is more limited than it really is. I think this is a problem with our documentation. It's simply a general purpose container for storing labeled, multi-dimensional arrays. Being built closely around dask from the beginning, however, gives it some real advantages.

@rabernat
Copy link
Member

To clarify, I am not suggesting anyone abandon widely established and successful packages such as ndcube. Just commenting out of interest for the sake of discussion. I'm sure you're very happy with your software stack. 😉

@Cadair
Copy link

Cadair commented May 24, 2018

I am very interested in potential uses of xarray for this type of data, I am also very interested to hear how you might handle the problem of two of the axes having their coordinates coupled (in this case via map projection) because last time I looked into xarray I couldn't find a way of reconciling this with the way xarray works.

ndcube is still quite new (although built on top of older astropy tech), and we did evaluate xarray for the problem when we started working on it a little over a year ago. I think one of the major limitations could be the practical requirement to use astropy WCS functionality for the computation of some or all of the coordinate axes.

@Cadair
Copy link

Cadair commented May 24, 2018

I have now updated the gist with what I think is a good implementation of "don't load the whole FITS array into memory" when doing a slicing operation. I would love to hear what @martindurant or anyone else thinks about this way of doing it.

@rabernat
Copy link
Member

rabernat commented May 24, 2018

I am also very interested to hear how you might handle the problem of two of the axes having their coordinates coupled (in this case via map projection) because last time I looked into xarray I couldn't find a way of reconciling this with the way xarray works.

So in this case, we would distinguish between the logical coordinates of the array (must be 1D; one per axis) and the physical coordinate in space (possibly 2D or higher dimensionality). There is an extensive example of this in the xarray docs:
http://xarray.pydata.org/en/latest/examples/multidimensional-coords.html

A typical climate dataset (used in the example above) would have the following structure:

<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
    xc       (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
    yc       (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...

Tair is the air temperature, and its dimensions are (time, x, y). There aren't any actual coordinates for x and y as they are just logical array indices. However, there are additional "non-dimension coordinates" xc and yc, both with dimensions (x, y), which represent the latitudes and longitudes of the points. Working with cartopy, xarray can plot the data in geographical coordinates like this:

plot

We are still a long way from full "multidimensional coordinate" support. These open issues describe some improvements we would like to make: pydata/xarray#2028, pydata/xarray#1325, pydata/xarray#1553. Progress on these issues is limited by developer time; that's why I am eager to reach out to scientific communities with similar needs to see if there are ways we can work together. Perhaps the transformation routines you have in ndcube could help resolve some xarray issues!

Will any of you folks be at scipy? I personally won't, but @jhamman and lots of other pangeo / xarray contributors will. Perhaps we should consider a Birds of a Feather proposal on labeled nd-arrays.

@Cadair
Copy link

Cadair commented May 24, 2018

That's really cool! I will have to take a look at some point. I think what astro folks would need is for xc and yc in your example to be calculated on demand from a function rather than having to store them as arrays.

I will be at SciPy this year, a BoF sounds like a pretty good idea!

@jhamman
Copy link
Member

jhamman commented May 24, 2018

+1 for a BoF event at Scipy. Proposals are due June 27.

@rabernat
Copy link
Member

@Cadair, I just checked the cluster status and I noticed something odd

$ kubectl get pods --namespace=pangeo
NAME                                           READY     STATUS      RESTARTS   AGE
dask-cadair-5014ba80-f2r4lw                    0/1       OutOfcpu    0          1h
dask-cadair-5014ba80-fbczvs                    0/1       OutOfcpu    0          1h
dask-cadair-5014ba80-fd5d8j                    0/1       OutOfcpu    0          1h
dask-cadair-5014ba80-fgsczn                    0/1       OutOfcpu    0          1h
dask-cadair-5014ba80-fgvblt                    0/1       OutOfcpu    0          1h
dask-cadair-5014ba80-fxlrvj                    0/1       OutOfcpu    0          1h

Did you notice anything strange? I've not encountered that status before.

@Cadair
Copy link

Cadair commented May 24, 2018

Yeah I have been having a few problems with things behaving strangely on me, I am not really sure what to report as I don't fully understand what's going on 😆

@rabernat
Copy link
Member

My understanding of kubernetes is also low. But my impression is that your dask worker pods are exceeding their cpu quota and being limited or stopped. I'm not sure how this is possible. Does you code use multithreading?

@Cadair
Copy link

Cadair commented May 24, 2018

No, I am just using plain dask array and a map with a plain python function (doing bytes io with dask) and gather on the distributed client.

@martindurant
Copy link
Contributor

On OutOfCpu: kubernetes/kubernetes#42433 is relevant?

@stale stale bot added the stale label Aug 25, 2018
@MSKirk
Copy link

MSKirk commented Aug 28, 2018

Sorry I have been distracted by other projects in the last couple of months. @Cadair have you been continuing to work with solar images in dask?

@stale stale bot removed the stale label Aug 28, 2018
@martindurant
Copy link
Contributor

I hope you are following https://github.com/ContinuumIO/intake-astro , which now has a partitioned FITS array and table reader that may already be generally useful.

@rabernat
Copy link
Member

rabernat commented Oct 1, 2018

Hi everyone. I was invited to give a presentation at an EarthCube workshop by Gelu Nita entitled "Towards Integration of Heliophysics, Data, Models, and Analysis Tools." at New Jersey Institute of Technology on Nov. 14. I would really like to be able to show off some simple example of pangeo integrating with solar astronomy. @MSKirk, can we revive this issue and continue to pursue this? It sounds like there has been some progress on the dask side.

I recommend we coordinate with #255 and spin up astro.pangeo.io. Is anyone interested in taking the lead on this?

@martindurant
Copy link
Contributor

I assume @SimonKrughoff would also be interested in an astro.pangeo.io .

@MSKirk
Copy link

MSKirk commented Oct 1, 2018

I like your idea, @rabernat, to shoot for a Nov. 14 demo. What do you need from me to help you move forward?

@rabernat
Copy link
Member

rabernat commented Oct 1, 2018

There are two ways we could pursue:

  • Work on standing up astro.pangeo.io and develop the example from there (would want to collaborate with @SimonKrughoff)
  • Develop a standalone example and run it via pangeo binder

The first one is definitely more work.

In either case, you will probably want to use intake-astro to read the SDO_AIA_Images data you placed in the pangeo-data GCS bucket as dask arrays. From that point you could simply visualize the data interactively (perhaps with holoviews / datashader) or you could actually do some scientific analysis. Do you use sunpy? Do we know whether it plays well with dask?

If you place your example in a github repo, together with an appropriate environment.yaml, it should be easy to run with binder.

@mrocklin
Copy link
Member Author

mrocklin commented Oct 1, 2018

In either case, you will probably want to use intake-astro to read the SDO_AIA_Images data you placed in the pangeo-data GCS bucket as dask arrays. From that point you could simply visualize the data interactively (perhaps with holoviews / datashader) or you could actually do some scientific analysis. Do you use sunpy? Do we know whether it plays well with dask?

To be clear, tools like intake and holoviews/datashader are totally optional. In your position I would stick to whatever technologies you're comfortable with, probably adding in as few as possible to start. I suspect that adding Dask and Kubernetes is enough novelty when getting started.

@rabernat
Copy link
Member

rabernat commented Oct 1, 2018

Fair point about holoviews/datashader. But intake solves a key problem for how to load the data lazily from GCS. I just tried it and it works beautifully

source = intake.open_fits_array('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
data = source.to_dask()

(Have to first conda install -c intake intake-astro.)

@mrocklin
Copy link
Member Author

mrocklin commented Oct 1, 2018

As you like

@martindurant
Copy link
Contributor

Obviously I will also try to help where I can - but time is always limited.
The code in intake-astronomy was partly derived from scipy work with some of the people on this thread, so they may have equivalent functionality elsewhere that I am not aware of.

@rabernat
Copy link
Member

rabernat commented Oct 1, 2018

All of the following works beautifully for me. I am currently browsing the sun. ;)

import intake
import xarray as xr
import numpy as np
import holoviews as hv
from holoviews.operation.datashader import regrid
hv.extension('bokeh')

source = intake.open_fits_array('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
data = source.to_dask()
# the only trick was first wrapping in xarray...otherwise holoviews doesn't seem to work
xrds = xr.DataArray(data, dims=['time', 'y', 'x'],
                    coords={'time': ('time', np.arange(1800)),
                     'x': ('x', np.arange(4096)),
                     'y': ('y', np.arange(4096))},
                    name='fits_data')
hvds = hv.Dataset(xrds)
im = hvds.to(hv.Image, kdims=['x', 'y'], dynamic=True)

# new cell
%opts Image [width=800, height=800 logz=True] (cmap='magma')
regrid(im, dynamic=True)

image

@wtbarnes
Copy link
Member

wtbarnes commented Oct 1, 2018

See discussion on sunpy/sunpy#2715 re: lazily loading FITS. SunPy Map objects can also be constructed from a Dask array and an appropriate metadata object. intake-astro might be a drop-in replacement for all of this though!

At the SciPy meeting back in July, @Cadair and I (with the help of @martindurant and @mrocklin) used pangeo and some of the AIA data on GCS to do some pretty cool data pipe-lining though we never got around to writing it up. The basic workflow was:

  1. Lazily load a sequence (in time) of AIA images for multiple instrument channels
  2. "Prep" the images (i.e. scale to common resolution)
  3. "Derotate" (account for the differential rotation of the Sun so that a single pixel corresponded to the same physical location at each timestep)

After doing all of that, we did a correlation analysis between multiple AIA channels using dask.array.fft. All of the above are steps that solar physicists go through everyday in their analysis of this kind of data. Very often, these steps are done in serial and on desktops and even laptops.

A demo that showed this could all be done in parallel and at scale would have a huge "wow" factor I think.

If this example sounds interesting, I'd be happy to help put it together.

@mrocklin
Copy link
Member Author

mrocklin commented Oct 1, 2018 via email

@rabernat
Copy link
Member

rabernat commented Oct 1, 2018

Scientifically I suspect that it would be interesting to rechunk into blocks and then do some time-series analysis on each pixel.

FWIW, this is structurally very similar to how we often analyze Earth imagery data. It is really fun to rechunk and persist into a dask cluster and then do the timeseries analysis in parallel.

A demo that showed this could all be done in parallel and at scale would have a huge "wow" factor I think.

I agree. For my part, I am obviously not qualified to develop the heliophysics-specific analysis, but I can advise on how we approach similar tasks from the earth-science perspective.

I support @mrocklin's suggestion to make a binder with the necessary dependencies (e.g. intake-astro, sunpy, etc.) and then develop progressively more complex workflows from there. I have created https://github.com/pangeo-data/pangeo-astro-examples as a place for this. I also created the @pangeo-data/pangeo-astro team and invited everyone on this thread (and more) to it.

@MSKirk
Copy link

MSKirk commented Oct 1, 2018

@wtbarnes and I are both very familiar with sunpy and map objects. I think for the Sunpy community, using some of their Maps code will foster a lot more good will than doing without it. Plus you can get the standard color tables for each of the images (I have my own issues, with them but won't get out my soap box just yet).

@rabernat
Copy link
Member

rabernat commented Oct 1, 2018

Of course! Definitely best to leverage all the work that has gone into sunpy. It's clearly an amazing package.

For visualization, I am personally obsessed with holoviews / datashader and what they can do with cloud-backed datasets. (The screenshot I posted above was dynamically zoomable and scrollable in time.) For the long term, creating some integration layer between sunpy and holoviews might allow the best of both worlds.

@martindurant
Copy link
Contributor

There are, though, further advantages to using Intake that may be attractive, if it is not too much of an imposition to use. As a location to frame and put data loading code, that is only part of the purpose, but consider the idea of encoding data locations and loading parameters into catalog files, which can potentially be stored remotely and updated in-place. Note that the headers, including WCS, is available in the data source objects. (just wanting to make sure that everyone has all the information!)

Indeed, the intake-astro outputs may be easy to coerce in to mapping objects, if that seems like a good idea.

@MSKirk
Copy link

MSKirk commented Oct 1, 2018

I am giving an introductory sunpy tutorial at the end of the month. I will play around with holoviews / datashader to see if I can get a "hey look, I am flying" moment.

@rabernat
Copy link
Member

rabernat commented Oct 1, 2018

Ok, so I will revisit this issue in a few weeks to see how things are progressing.

In the meantime, please feel free to put examples into https://github.com/pangeo-data/pangeo-astro-examples

@MSKirk
Copy link

MSKirk commented Oct 2, 2018

@rabernat are you using a specific build of intake? I am not seeing the module open_fits_array in my version 0.2.6.

@martindurant
Copy link
Contributor

@MSKirk , you also need intake-astro

@wtbarnes
Copy link
Member

wtbarnes commented Nov 7, 2018

@wafels This discussion may be interesting to you as well.

@stale
Copy link

stale bot commented Jan 6, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 6, 2019
@stale
Copy link

stale bot commented Jan 14, 2019

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants