-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solar astronomy data on GCS bucket #269
Comments
Some fitsio functions work with any file-like objects, some do not. This means that we could relatively easily construct dask-arrays from sets of fits files:
edit: let me flesh that out a bit: we could use
|
This is awesome! I will have a go at utilizing some of my DKIST code for this tomorrow. |
This is great! I welcome the involvement of astronomers here. As the number of datasets continues to grow, it would be good to establish an authoritative catalog. Hopefully this can be intake (#39), but I doubt it works with fits format. |
I have about 460 GB of data (12,600 images) queued up to get uploaded to the pangeo-data drive. These images haven't been thoughtfully chosen, but were sitting on an external hard disk of mine. They represent 7 different EVU wavelengths that the SDO AIA instrument observes the sun in. Let me know if you have any data specific questions. |
There is no plugin for FITS in Intake yet, this is a question of demand :) The outstanding problem if actually the coordinates systems (WCS) that FITS typically bundles, which are not aligned with the axes in general, or maybe not even linear. It would be OK to have an array-and-wcs pair, but the model doesn't easily fit into the xarray coordinate-axes model. |
@MSKirk are they level 1.5 images? So effectively all share the same WCS? (Along the time axis) |
I suggest that we don't bother with intake for the moment. I don't that
that that's on the critical path to exploring things here.
I think that the next step is for people to write a couple of simple
analyses using this data so that others see what is possible, and then
hopefully build off of that original analysis
…On Mon, May 21, 2018 at 5:52 PM, Stuart Mumford ***@***.***> wrote:
@MSKirk <https://github.com/MSKirk> are they level 1.5 images? So
effectively all share the same WCS? (Along the time axis)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszJpBvp77c4FENAlCELLbxb1eqMfWks5t0zcegaJpZM4UHdiW>
.
|
@Cadair, these are level 1 images but they are during a time of stability for the spacecraft. So any corrections that the prep routines make are going to be relatively small. When I am back to my desk, I can work on a level 1.5 dataset. |
Actually, straight-forward
|
So I have been playing for the last hour or so I have got as far as this http://pangeo.pydata.org/user/cadair/lab/tree/examples%2FAIA1.ipynb (don't know if you can actually follow that link?) I made a conda env to run the thing, it seems to have issues with that though, what's the best way to handle deps and such? |
We cannot follow that link - perhaps make it a gist? |
You should be able to conda/pip install into your notebook directory. However in order to use distributed workers you will also need to tell them which libraries to install when they start up. You can do this easily by modifying the worker-template.yaml file in your home directory. You'll want to extend the env:
- name: GCSFUSE_BUCKET # this was here before
value: pangeo-data
- name: EXTRA_PIP_PACKAGES
value: foo bar git+https://github.com/user/package.git@branch
- name: EXTRA_CONDA_PACKAGES
value: baz quuz -c conda-forge I generally prefer to use pip packages over conda in this case because it will happen for every worker every time they start up and conda takes a while. |
I made it work 🎉 https://gist.github.com/Cadair/f65fbcb593c6d70eb8f3f3e48324f7c3 I used a cluster to read all the headers of all the files and then used this information to construct a WCS object and fed it all into NDCube, so we now have a coordinate aware Dask array. I did briefly try to construct a Dask DataFrame out of all the headers in all the FITS files, but I didn't get it to work and I have a limited amount of time for this today! |
(oh the pip requirement is just |
@Cadair , in this particular example, the coordinates are well-aligned with the axes, so this would be a good case for xarray as the container, especially if you were to have multiple cubes for each pass-band. Not that I have anything against Does ndcube reference the dask-array lazily, or does the constructor |
While the coordinates are in practice well aligned, the spatial coordinates are subject to a TAN map projection, so they are not really appropriate for xarray like we discussed on the call. NDCube has just been built to be coordinate aware through NDCube should only reference the array lazily, I haven't encountered anywhere where it dosen't yet. |
As a physical oceanographer / climate modeler, it is very interesting to me to learn how other disciplines handle their datasets. In this case, ndcube sounds quite similar to xarray, so there is possibly some opportunity for synergy / collaboration here.
This is not fundamentally different from the situation with geographic data. Earth observations and models use different projections and coordinate systems. We often have to transform from one coordinate system to another. Xarray is not limited to cartesian geometry--it supports auxiliary non-dimension coordinates for the purpose of storing such information. However, since these projection / regridding operations are often domain specific, they are not included in xarray. We use packages like cartopy or xesmf to handle the projection problem; these packages can work directly with xarray data structures, making them more flexible and less error prone. People often assume xarray's data model is more limited than it really is. I think this is a problem with our documentation. It's simply a general purpose container for storing labeled, multi-dimensional arrays. Being built closely around dask from the beginning, however, gives it some real advantages. |
To clarify, I am not suggesting anyone abandon widely established and successful packages such as ndcube. Just commenting out of interest for the sake of discussion. I'm sure you're very happy with your software stack. 😉 |
I am very interested in potential uses of xarray for this type of data, I am also very interested to hear how you might handle the problem of two of the axes having their coordinates coupled (in this case via map projection) because last time I looked into xarray I couldn't find a way of reconciling this with the way xarray works. ndcube is still quite new (although built on top of older astropy tech), and we did evaluate xarray for the problem when we started working on it a little over a year ago. I think one of the major limitations could be the practical requirement to use astropy WCS functionality for the computation of some or all of the coordinate axes. |
I have now updated the gist with what I think is a good implementation of "don't load the whole FITS array into memory" when doing a slicing operation. I would love to hear what @martindurant or anyone else thinks about this way of doing it. |
So in this case, we would distinguish between the logical coordinates of the array (must be 1D; one per axis) and the physical coordinate in space (possibly 2D or higher dimensionality). There is an extensive example of this in the xarray docs: A typical climate dataset (used in the example above) would have the following structure:
We are still a long way from full "multidimensional coordinate" support. These open issues describe some improvements we would like to make: pydata/xarray#2028, pydata/xarray#1325, pydata/xarray#1553. Progress on these issues is limited by developer time; that's why I am eager to reach out to scientific communities with similar needs to see if there are ways we can work together. Perhaps the transformation routines you have in ndcube could help resolve some xarray issues! Will any of you folks be at scipy? I personally won't, but @jhamman and lots of other pangeo / xarray contributors will. Perhaps we should consider a Birds of a Feather proposal on labeled nd-arrays. |
That's really cool! I will have to take a look at some point. I think what astro folks would need is for I will be at SciPy this year, a BoF sounds like a pretty good idea! |
+1 for a BoF event at Scipy. Proposals are due June 27. |
@Cadair, I just checked the cluster status and I noticed something odd
Did you notice anything strange? I've not encountered that status before. |
Yeah I have been having a few problems with things behaving strangely on me, I am not really sure what to report as I don't fully understand what's going on 😆 |
My understanding of kubernetes is also low. But my impression is that your dask worker pods are exceeding their cpu quota and being limited or stopped. I'm not sure how this is possible. Does you code use multithreading? |
No, I am just using plain dask array and a map with a plain python function (doing bytes io with dask) and gather on the distributed client. |
On OutOfCpu: kubernetes/kubernetes#42433 is relevant? |
Sorry I have been distracted by other projects in the last couple of months. @Cadair have you been continuing to work with solar images in dask? |
I hope you are following https://github.com/ContinuumIO/intake-astro , which now has a partitioned FITS array and table reader that may already be generally useful. |
Hi everyone. I was invited to give a presentation at an EarthCube workshop by Gelu Nita entitled "Towards Integration of Heliophysics, Data, Models, and Analysis Tools." at New Jersey Institute of Technology on Nov. 14. I would really like to be able to show off some simple example of pangeo integrating with solar astronomy. @MSKirk, can we revive this issue and continue to pursue this? It sounds like there has been some progress on the dask side. I recommend we coordinate with #255 and spin up astro.pangeo.io. Is anyone interested in taking the lead on this? |
I assume @SimonKrughoff would also be interested in an astro.pangeo.io . |
I like your idea, @rabernat, to shoot for a Nov. 14 demo. What do you need from me to help you move forward? |
There are two ways we could pursue:
The first one is definitely more work. In either case, you will probably want to use intake-astro to read the If you place your example in a github repo, together with an appropriate |
To be clear, tools like intake and holoviews/datashader are totally optional. In your position I would stick to whatever technologies you're comfortable with, probably adding in as few as possible to start. I suspect that adding Dask and Kubernetes is enough novelty when getting started. |
Fair point about holoviews/datashader. But intake solves a key problem for how to load the data lazily from GCS. I just tried it and it works beautifully source = intake.open_fits_array('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
data = source.to_dask() (Have to first |
As you like |
Obviously I will also try to help where I can - but time is always limited. |
All of the following works beautifully for me. I am currently browsing the sun. ;) import intake
import xarray as xr
import numpy as np
import holoviews as hv
from holoviews.operation.datashader import regrid
hv.extension('bokeh')
source = intake.open_fits_array('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
data = source.to_dask()
# the only trick was first wrapping in xarray...otherwise holoviews doesn't seem to work
xrds = xr.DataArray(data, dims=['time', 'y', 'x'],
coords={'time': ('time', np.arange(1800)),
'x': ('x', np.arange(4096)),
'y': ('y', np.arange(4096))},
name='fits_data')
hvds = hv.Dataset(xrds)
im = hvds.to(hv.Image, kdims=['x', 'y'], dynamic=True)
# new cell
%opts Image [width=800, height=800 logz=True] (cmap='magma')
regrid(im, dynamic=True) |
See discussion on sunpy/sunpy#2715 re: lazily loading FITS. SunPy Map objects can also be constructed from a Dask array and an appropriate metadata object. At the SciPy meeting back in July, @Cadair and I (with the help of @martindurant and @mrocklin) used pangeo and some of the AIA data on GCS to do some pretty cool data pipe-lining though we never got around to writing it up. The basic workflow was:
After doing all of that, we did a correlation analysis between multiple AIA channels using A demo that showed this could all be done in parallel and at scale would have a huge "wow" factor I think. If this example sounds interesting, I'd be happy to help put it together. |
:)
The next step here might be to make a binder?
Scientifically I suspect that it would be interesting to rechunk into
blocks and then do some time-series analysis on each pixel. I'll let the
actual scientists say what they think though :)
…On Mon, Oct 1, 2018 at 4:30 PM Ryan Abernathey ***@***.***> wrote:
All of the following works beautifully for me. I am currently browsing the
sun. ;)
import intakeimport xarray as xrimport numpy as npimport holoviews as hvfrom holoviews.operation.datashader import regrid
hv.extension('bokeh')
source = intake.open_fits_array('gcs://pangeo-data/SDO_AIA_Images/094/*.fits')
data = source.to_dask()# the only trick was first wrapping in xarray...otherwise holoviews doesn't seem to work
xrds = xr.DataArray(data, dims=['time', 'y', 'x'],
coords={'time': ('time', np.arange(1800)),
'x': ('x', np.arange(4096)),
'y': ('y', np.arange(4096))},
name='fits_data')
hvds = hv.Dataset(xrds)
im = hvds.to(hv.Image, kdims=['x', 'y'], dynamic=True)
# new cell%opts Image [width=800, height=800 logz=True] (cmap='magma')
regrid(im, dynamic=True)
[image: image]
<https://user-images.githubusercontent.com/1197350/46313962-439e9500-c597-11e8-8442-3fc9df85460d.png>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszB5w-tv_wsQrr92G020GdWkAaavMks5ugntkgaJpZM4UHdiW>
.
|
FWIW, this is structurally very similar to how we often analyze Earth imagery data. It is really fun to rechunk and persist into a dask cluster and then do the timeseries analysis in parallel.
I agree. For my part, I am obviously not qualified to develop the heliophysics-specific analysis, but I can advise on how we approach similar tasks from the earth-science perspective. I support @mrocklin's suggestion to make a binder with the necessary dependencies (e.g. intake-astro, sunpy, etc.) and then develop progressively more complex workflows from there. I have created https://github.com/pangeo-data/pangeo-astro-examples as a place for this. I also created the @pangeo-data/pangeo-astro team and invited everyone on this thread (and more) to it. |
@wtbarnes and I are both very familiar with sunpy and map objects. I think for the Sunpy community, using some of their Maps code will foster a lot more good will than doing without it. Plus you can get the standard color tables for each of the images (I have my own issues, with them but won't get out my soap box just yet). |
Of course! Definitely best to leverage all the work that has gone into sunpy. It's clearly an amazing package. For visualization, I am personally obsessed with holoviews / datashader and what they can do with cloud-backed datasets. (The screenshot I posted above was dynamically zoomable and scrollable in time.) For the long term, creating some integration layer between sunpy and holoviews might allow the best of both worlds. |
There are, though, further advantages to using Intake that may be attractive, if it is not too much of an imposition to use. As a location to frame and put data loading code, that is only part of the purpose, but consider the idea of encoding data locations and loading parameters into catalog files, which can potentially be stored remotely and updated in-place. Note that the headers, including WCS, is available in the data source objects. (just wanting to make sure that everyone has all the information!) Indeed, the intake-astro outputs may be easy to coerce in to mapping objects, if that seems like a good idea. |
I am giving an introductory sunpy tutorial at the end of the month. I will play around with holoviews / datashader to see if I can get a "hey look, I am flying" moment. |
Ok, so I will revisit this issue in a few weeks to see how things are progressing. In the meantime, please feel free to put examples into https://github.com/pangeo-data/pangeo-astro-examples |
@rabernat are you using a specific build of |
@MSKirk , you also need intake-astro |
@wafels This discussion may be interesting to you as well. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
@MSKirk was kind enough to upload a number of FITS files that correspond to a solar astronomy dataset.
I can load a little bit of the data with the following steps:
cc @Cadair, @DanRyanIrish, @martindurant, @mmccarty
The text was updated successfully, but these errors were encountered: