Update of HyperSpy Markers API changes for the `hspy`/`zspy` format #164

ericpre · 2023-09-28T11:24:59Z

Progress of the PR

…int`

…oid conflict with several dataset with object dtype in same group - typically occurring with variable length markers. Bump file version to 3.3

…rspy 2.0 - `collection.set_offset_transform` is required

codecov · 2023-09-28T11:30:41Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (e4e71ad) 85.56% compared to head (0255321) 85.59%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #164      +/-   ##
==========================================
+ Coverage   85.56%   85.59%   +0.03%     
==========================================
  Files          76       76              
  Lines       10132    10148      +16     
  Branches     2210     2216       +6     
==========================================
+ Hits         8669     8686      +17     
+ Misses        945      944       -1     
  Partials      518      518

Files	Coverage Δ
rsciio/hspy/_api.py	`93.18% <ø> (ø)`
rsciio/zspy/_api.py	`95.77% <ø> (ø)`
rsciio/_hierarchical.py	`76.30% <93.75%> (+0.40%)`	⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ericpre · 2023-10-04T17:42:58Z

@CSSFrancis, do you want to review this PR?

CSSFrancis · 2023-10-04T17:44:32Z

@ericpre I can do that now! Sorry, I've been trying to finish writing some papers the last couple of days so I haven't been the best at responding :)

CSSFrancis · 2023-10-04T18:26:32Z

@ericpre I looked over this briefly and it looks like a really good improvement overall! I've been meaning to go back to the idea of saving and loading ragged arrays as I think that it is a bit problematic.

One thing that is a bit concerning is that loading ragged arrays into memory is very slow for the zspy format. It's not so much an issue when saving the ragged arrays but the loading time appears to be much slower than the time to save the data. It also increases exponentially with the number of markers to make things extra bad :).

For example:

import time 
import matplotlib.pyplot as plt
import hyperspy.api as hs
import numpy as np

save_times_z = []
load_times_z =[]
save_times_h = []
load_times_h =[]

num_pos = [100, 500, 1000, 2000, 4000, 10000]
load_times =[]
for i in num_pos:
    test = np.empty((i), dtype=object)
    for j in np.ndindex(test.shape):
        test[j]=np.array([[1,1],[2,2]])
        
    s = hs.signals.BaseSignal(test)
    
    tic = time.time()
    s.save("data.zspy", overwrite=True)
    toc =time.time()
    save_times_z.append(toc-tic)
    
    tic = time.time()
    hs.load("data.zspy")
    toc =time.time()
    load_times_z.append(toc-tic)
    
    tic = time.time()
    s.save("data.hspy", overwrite=True)
    toc =time.time()
    save_times_h.append(toc-tic)
    
    tic = time.time()
    hs.load("data.hspy")
    toc =time.time()
    load_times_h.append(toc-tic)

plt.plot(num_pos, load_times_z, label="loading time (zspy)")
plt.plot(num_pos, save_times_z, label="saving time (zspy)")
plt.plot(num_pos, load_times_h, label="loading time (hspy)")
plt.plot(num_pos, save_times_h, label="saving time (hspy")
plt.xlabel("number of positions")
plt.ylabel("time in sec")
plt.legend()

I realize that this is probably something up stream in zarr and not related to this PR but I think that it will cause problems with saving markers like this.

CSSFrancis

This looks like a good change! It also cleans up some stuff. The only thing I am worried about is the slow loading for ragged .zarr arrays.

This has the problem of potentially "bricking" the loading for a 4-D STEM dataset if you save a large ragged array alongside it and then can't access the data because the loading time for the ragged array is very slow.

rsciio/_hierarchical.py

CSSFrancis · 2023-10-04T19:09:31Z

@ericpre Let me know if you have any thoughts on why this might be otherwise I can try to look into this more to figure out some way to save/load efficiently.

ericpre · 2023-10-04T19:32:26Z

@ericpre Let me know if you have any thoughts on why this might be otherwise I can try to look into this more to figure out some way to save/load efficiently.

I don't think that there is much to be done here, as you said, it must come from zarr/numcodecs.

…plementation of nd ragged array support in zarr

CSSFrancis · 2023-10-04T21:03:31Z

So it seems like the VLenArray numcodec isn't the problem as:

def benchmark_codec(codec, a):
    print(codec)
    print('encode')
    %timeit codec.encode(a)
    enc = codec.encode(a)
    print('decode')
    %timeit codec.decode(enc)
    print('size         : {:,}'.format(len(enc)))

np.random.seed(42)
data4 = np.array([np.random.random(size=np.random.randint(0, 20)).astype(np.float64)
                  for i in range(200000)], dtype=object)
data4.shape
benchmark_codec(vlen_arr_codec, data4)

This seems to work just fine and scales nicely. The issue seems to instead be that each array is being treated as a separate chunk and the _decode_chunk function is being called for every position.

If we set the number of chunks here to 1:

rosettasciio/rsciio/zspy/_api.py

Lines 88 to 97 in e4e71ad

    
           @staticmethod 
        
           def _get_object_dset(group, data, key, chunks, **kwds): 
        
               """Creates a Zarr Array object for saving ragged data""" 
        
               these_kwds = kwds.copy() 
        
               these_kwds.update(dict(dtype=object, exact=True, chunks=chunks)) 
        
               dset = group.require_dataset( 
        
                   key, data.shape, object_codec=numcodecs.VLenArray(int), **these_kwds 
        
               ) 
        
               return dset

Then things work a little better :)

CSSFrancis · 2023-10-04T21:13:04Z

@ericpre What do you think?

I would vote that we force any numpy array with dtype=object into one chunk for compression/loading.

We could also try to guess the ideal chunk size by looking at the underlying data.

For a dask array we can leave it's chunking scheme and assume that the person saving/loading the data has some idea of what they are doing.

ericpre · 2023-10-05T07:12:15Z

Thank you @CSSFrancis for looking at this. I tried using one chunks in _get_object_dset but reading is still slow, maybe you changed somewhere else?
Do you want to sort out this issue in a separate PR? The issue is not introduced in this PR, which is about getting the test suite to work with the markers update!

CSSFrancis · 2023-10-05T12:36:42Z

Thank you @CSSFrancis for looking at this. I tried using one chunks in _get_object_dset but reading is still slow, maybe you changed somewhere else?
Do you want to sort out this issue in a separate PR? The issue is not introduced in this PR, which is about getting the test suite to work with the markers update!

Yep I can do that!

Do you want to merge this then and then I will open a new PR.

ericpre added 6 commits September 26, 2023 20:13

Fix test suite when loading bad marker: catch the error

876c126

Update markers test for hyperspy 2.0 and fix parsing tuple of string

3672340

Fix loading ragged signal: the ragged attribute wasn't set

7a92467

Fix saving ragged array with zspy: the dtype was incorrect set to `…

625cbc8

…int`

hspy/zspy: rename ragged_shapes to _ragged_shapes_{key} to av…

9c1164c

…oid conflict with several dataset with object dtype in same group - typically occurring with variable length markers. Bump file version to 3.3

Bump minimum matplotlib requirement to 3.5 to support markers in hype…

0eb2ef6

…rspy 2.0 - `collection.set_offset_transform` is required

Add changelog entry

c5e0754

jlaehne requested a review from CSSFrancis September 28, 2023 19:49

jlaehne approved these changes Sep 28, 2023

View reviewed changes

CSSFrancis reviewed Oct 4, 2023

View reviewed changes

rsciio/_hierarchical.py Show resolved Hide resolved

rsciio/_hierarchical.py Show resolved Hide resolved

Add a comment for future reference on the work in progress for the im…

0255321

…plementation of nd ragged array support in zarr

CSSFrancis approved these changes Oct 5, 2023

View reviewed changes

ericpre merged commit 42574d2 into hyperspy:main Oct 5, 2023
31 of 32 checks passed

This was referenced Oct 5, 2023

Ragged Array Reading is Slow(er than it should be) #168

Closed

Faster Ragged Reading of Markers #169

Merged

ericpre added this to the v0.2 milestone Oct 6, 2023

ericpre mentioned this pull request Oct 14, 2023

Incompatibilities Between Files from 2.0.0 to 1.7.x hyperspy/hyperspy#3239

Closed

3 tasks

ericpre mentioned this pull request Dec 16, 2023

Ragged dtype=object BaseSignal saving .hspy should give better error #141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update of HyperSpy Markers API changes for the `hspy`/`zspy` format #164

Update of HyperSpy Markers API changes for the `hspy`/`zspy` format #164

ericpre commented Sep 28, 2023 •

edited

Loading

codecov bot commented Sep 28, 2023 •

edited

Loading

ericpre commented Oct 4, 2023

CSSFrancis commented Oct 4, 2023

CSSFrancis commented Oct 4, 2023

CSSFrancis left a comment

CSSFrancis commented Oct 4, 2023

ericpre commented Oct 4, 2023

CSSFrancis commented Oct 4, 2023

CSSFrancis commented Oct 4, 2023 •

edited

Loading

ericpre commented Oct 5, 2023

CSSFrancis commented Oct 5, 2023

Update of HyperSpy Markers API changes for the hspy/zspy format #164

Update of HyperSpy Markers API changes for the hspy/zspy format #164

Conversation

ericpre commented Sep 28, 2023 • edited Loading

Progress of the PR

codecov bot commented Sep 28, 2023 • edited Loading

Codecov Report

ericpre commented Oct 4, 2023

CSSFrancis commented Oct 4, 2023

CSSFrancis commented Oct 4, 2023

CSSFrancis left a comment

Choose a reason for hiding this comment

CSSFrancis commented Oct 4, 2023

ericpre commented Oct 4, 2023

CSSFrancis commented Oct 4, 2023

CSSFrancis commented Oct 4, 2023 • edited Loading

ericpre commented Oct 5, 2023

CSSFrancis commented Oct 5, 2023

Update of HyperSpy Markers API changes for the `hspy`/`zspy` format #164

Update of HyperSpy Markers API changes for the `hspy`/`zspy` format #164

ericpre commented Sep 28, 2023 •

edited

Loading

codecov bot commented Sep 28, 2023 •

edited

Loading

CSSFrancis commented Oct 4, 2023 •

edited

Loading