Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] Re-opening an existing RemoteStore V3 group causes the first child to be deleted #2359

Closed
Tracked by #2412
rabernat opened this issue Oct 14, 2024 · 7 comments · Fixed by #2442
Closed
Tracked by #2412
Labels
V3 Affects the v3 branch
Milestone

Comments

@rabernat
Copy link
Contributor

rabernat commented Oct 14, 2024

This is a weird and unexpected behavior

import xarray as xr
import zarr
import s3fs
print(xr.__version__, zarr.__version__)
# -> ('0.9.7.dev3734+g26081d4f', '3.0.0b1.dev8+g9bbfd88')

s3 = s3fs.S3FileSystem()
# replace with your favorite bucket
target_path = "icechunk-test/ryan/zarr-v3/test-92"
store = zarr.storage.RemoteStore(s3, mode="w", path=target_path)

# create a group with two children
group = zarr.open_group(store=store, mode="w", zarr_format=3)
group.create("A", shape=1)
group.create("B", shape=1)
print(group.members())
#  (('A',
#  <Array <RemoteStore(S3FileSystem, icechunk-test/ryan/zarr-v3/test-era5-v3-92)>/A shape=(1,) dtype=float64>),
#  ('B',
#   <Array <RemoteStore(S3FileSystem, icechunk-test/ryan/zarr-v3/test-era5-v3-92)>/B shape=(1,) dtype=float64>))

# re-open the group
store2 = zarr.storage.RemoteStore(s3, mode="w", path=target_path)
group = zarr.open_group(store=store2, mode="w", zarr_format=3)
print(group.members())
# (('B',
#  <Array <RemoteStore(S3FileSystem, icechunk-test/ryan/zarr-v3/test-era5-v3-92)>/B shape=(1,) dtype=float64>),)

Array A has disappeared! That's bad!

This might have something to do with #2338. Really not sure. I haven't gotten to the bottom of it, but I thought I would post a report anyway.

@rabernat rabernat added the V3 Affects the v3 branch label Oct 14, 2024
@jhamman
Copy link
Member

jhamman commented Oct 14, 2024

I understand what is going on here. I believe this is the intended behavior for our new stores but am open to the idea that we should change it!

In Zarr-Python 2, stores were just a mapping interface on top of a storage backend. No access mode was applied at the store level - except for stores that required an access mode (ZipStore, DBMStore). Ironically, the FSStore also took a mode argument but w there just meant "for writing" -- not overwrite.

In Zarr-Python 3, we moved the access mode handling to the store so that we could have a single place where we could enforce access. The simplest way to think of this is that the store is kind of like a file object, opening it in w mode will give you a new (empty) store. The analogy is the same as an HDF5 file -- you enter through the root of the file and if you open it with mode='w', you get an empty hierarchy.

In your example, opened the store in w mode which, as expected produced an empty store. What is more surprising to me is that you found array "A" in the group after opening store2 with mode w.

Questions:

  1. Apart from the last "surprise", does the model make sense?
  2. Is this too big of a departure from 2.18? Are there ways to mitigate potential downsides?

Edit: here's a short example using a memory store with what I would consider expected behavior. Note that the only place where mode is specified is when we instantiate the Store.

data = {}
store = zarr.storage.MemoryStore(data, mode='w')

# create a group with two children
group = zarr.open_group(store=store, zarr_format=3)
print(group.members())
# ()

group.create("A", shape=1)
print(group.members())
# (('A', <Array memory://5393639360/A shape=(1,) dtype=float64>),)

group.create("B", shape=1)
print(group.members())
# (('B', <Array memory://5393639360/B shape=(1,) dtype=float64>), ('A', <Array memory://5393639360/A shape=(1,) dtype=float64>))

# re-open the group in a mode
store2 = zarr.storage.MemoryStore(data, mode='a')
group = zarr.open_group(store=store2, zarr_format=3)
print(group.members())
(('B', <Array memory://5393696256/B shape=(1,) dtype=float64>), ('A', <Array memory://5393696256/A shape=(1,) dtype=float64>))

# re-open the group in w mode
store2 = zarr.storage.MemoryStore(data, mode='w')
group = zarr.open_group(store=store2, zarr_format=3)
print(group.members())
# ()

@rabernat
Copy link
Contributor Author

rabernat commented Oct 14, 2024

The simplest way to think of this is that the store is kind of like a file object, opening it in w mode will give you a new (empty) store.

I think this is a really bad default behavior and will lead to catastrophic scenarios of data loss. Simply opening a store in 'w' mode should absolutely not delete the entire contents of the store. IMO, simply opening or initializing a store should have no side effects regardless of the mode, other than perhaps checking for the existence of the store.

On the other hand, the various Array and Group open / create APIs also have a mode, and that's where I would expect things to happen. Calling zarr.open_group(mode="w") should delete only the contents of that group.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 17, 2024

Simply opening a store in 'w' mode should absolutely not delete the entire contents of the store.

I think this is right. @jhamman could you restate (or link to) the original reasoning for moving the mode semantics to the stores and away from arrays / groups? I am curious to see how we can fulfill the goals of that change while avoiding the behavior ryan is decrying.

@jhamman
Copy link
Member

jhamman commented Oct 17, 2024

I don't want to come across like I have a particular stake in the outcome here, but I'll try to describe the argument for how things are today. We can definitely change the semantics and/or order of operations for when actions occure.

Zarr-Python 2

  • mode is handled by the top level API (open_group and open_array), and is passed to some stores (FSStore and ZipStore)
    • FSStore supported w and r modes with no side effects for w mode
    • ZipStore supported w, r, a, and x modes with side effects for existing stores if opened in w or x mode.
    • Most other stores did not include a mode and, because the store api was a simple mutable mapping, no expectation of store state was provided by the store. Likewise, the store API could do nothing to enforce read-only mode.
  • The overwrite: bool argument is used in the Group class to allow for clobbering exiting children. This is detached from the store mode (that may or may not be present)

Zarr-Python 3

  • we saw adding the mode parameter to the store as a way to unify the behavior of the stores and provide the rest of the zarr API with a consistent target.
  • As a result, we have (mostly) been able to avoid permissions checks throughout the Array and Group APIs
  • I can't say there was a ton of thought behind it, but we choose to use standard file mode semantics for what w, r, a, etc meant. To me, there are three modes we should consider (they need not use the string names we have for them today)
    • "fresh start / read-write" - a mode producing an empty store
    • "read-only" - no writes/deletes are permitted to the store
    • "read-write" - full-access - read, write, edit, append, whatever you want

Some additional notes:

  • Do we agree that open_{array, group}(..., path='foo', mode='w') should have the side effect of rm -r foo*? If so, perhaps we need to change the order of operations and only apply Store.clear() after the path has been applied.

This was referenced Oct 17, 2024
@jhamman jhamman added this to the 3.0.0 milestone Oct 18, 2024
@jhamman jhamman changed the title Re-opening an existing RemoteStore V3 group causes the first child to be deleted [v3] Re-opening an existing RemoteStore V3 group causes the first child to be deleted Oct 18, 2024
@jhamman jhamman moved this to Todo in Zarr-Python - 3.0 Oct 18, 2024
@jhamman
Copy link
Member

jhamman commented Oct 21, 2024

I spent some time over the weekend thinking through this situation and I've come up with a new proposal.

  1. Stores continue to support a write mode, but a more limited set of modes, just w or r. Neither have side effects at open time.
  2. Store.clear is never called by zarr when opening a store.
  3. We finish teaching zarr how to delete things , requiring all writable stores implement erase_prefix. Or perhaps provide a default implementation that looks like [store.delete(p) async for p in store.list_prefix(prefix)].
  4. Creating new hierarchy members would a) require mode == 'w' and b) call erase_prefix if overwrite=True (and error if overwrite=False and member already exists).
  5. In user facing zarr api, we would also want to carefully align all the uses of mode, overwrite, and clobber.

@zarr-developers/python-core-devs - what do you think about this plan?

@TomAugspurger
Copy link
Contributor

I like it. Especially that the operations are local to the prefix being updated, and are delayed until the operation is invoked.

@normanrz
Copy link
Contributor

I like having less modes and side effects!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Affects the v3 branch
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants