-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update of HyperSpy Markers API changes for the hspy
/zspy
format
#164
Conversation
…oid conflict with several dataset with object dtype in same group - typically occurring with variable length markers. Bump file version to 3.3
…rspy 2.0 - `collection.set_offset_transform` is required
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #164 +/- ##
==========================================
+ Coverage 85.56% 85.59% +0.03%
==========================================
Files 76 76
Lines 10132 10148 +16
Branches 2210 2216 +6
==========================================
+ Hits 8669 8686 +17
+ Misses 945 944 -1
Partials 518 518
☔ View full report in Codecov by Sentry. |
@CSSFrancis, do you want to review this PR? |
@ericpre I can do that now! Sorry, I've been trying to finish writing some papers the last couple of days so I haven't been the best at responding :) |
@ericpre I looked over this briefly and it looks like a really good improvement overall! I've been meaning to go back to the idea of saving and loading ragged arrays as I think that it is a bit problematic. One thing that is a bit concerning is that loading ragged arrays into memory is very slow for the For example: import time
import matplotlib.pyplot as plt
import hyperspy.api as hs
import numpy as np
save_times_z = []
load_times_z =[]
save_times_h = []
load_times_h =[]
num_pos = [100, 500, 1000, 2000, 4000, 10000]
load_times =[]
for i in num_pos:
test = np.empty((i), dtype=object)
for j in np.ndindex(test.shape):
test[j]=np.array([[1,1],[2,2]])
s = hs.signals.BaseSignal(test)
tic = time.time()
s.save("data.zspy", overwrite=True)
toc =time.time()
save_times_z.append(toc-tic)
tic = time.time()
hs.load("data.zspy")
toc =time.time()
load_times_z.append(toc-tic)
tic = time.time()
s.save("data.hspy", overwrite=True)
toc =time.time()
save_times_h.append(toc-tic)
tic = time.time()
hs.load("data.hspy")
toc =time.time()
load_times_h.append(toc-tic)
plt.plot(num_pos, load_times_z, label="loading time (zspy)")
plt.plot(num_pos, save_times_z, label="saving time (zspy)")
plt.plot(num_pos, load_times_h, label="loading time (hspy)")
plt.plot(num_pos, save_times_h, label="saving time (hspy")
plt.xlabel("number of positions")
plt.ylabel("time in sec")
plt.legend() I realize that this is probably something up stream in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good change! It also cleans up some stuff. The only thing I am worried about is the slow loading for ragged .zarr arrays.
This has the problem of potentially "bricking" the loading for a 4-D STEM dataset if you save a large ragged array alongside it and then can't access the data because the loading time for the ragged array is very slow.
@ericpre Let me know if you have any thoughts on why this might be otherwise I can try to look into this more to figure out some way to save/load efficiently. |
I don't think that there is much to be done here, as you said, it must come from zarr/numcodecs. |
…plementation of nd ragged array support in zarr
So it seems like the VLenArray numcodec isn't the problem as: def benchmark_codec(codec, a):
print(codec)
print('encode')
%timeit codec.encode(a)
enc = codec.encode(a)
print('decode')
%timeit codec.decode(enc)
print('size : {:,}'.format(len(enc)))
np.random.seed(42)
data4 = np.array([np.random.random(size=np.random.randint(0, 20)).astype(np.float64)
for i in range(200000)], dtype=object)
data4.shape
benchmark_codec(vlen_arr_codec, data4) This seems to work just fine and scales nicely. The issue seems to instead be that each array is being treated as a separate chunk and the If we set the number of chunks here to 1: rosettasciio/rsciio/zspy/_api.py Lines 88 to 97 in e4e71ad
|
@ericpre What do you think? I would vote that we force any numpy array with We could also try to guess the ideal chunk size by looking at the underlying data. For a |
Thank you @CSSFrancis for looking at this. I tried using one chunks in |
Yep I can do that! Do you want to merge this then and then I will open a new PR. |
Follow up of hyperspy/hyperspy#3148.
Progress of the PR
matplotlib.collections.Collection.set_offset_transform
which was added in matplotlib 3.5ragged
attributes wasn't set properly insave
/load
cyclezspy
upcoming_changes
folder (seeupcoming_changes/README.rst
),docs/readthedocs.org:rosettasciio
build of this PR (link in github checks)