Edge case of decompressing legacy file with top-level metadata #65

brianzhang01 · 2022-06-03T21:15:49Z

Back in April 2021, I had some TreeSequence objects with top-level metadata that I wanted to write out using tszip, so I proposed PR #35. I've stuck to a version of tszip with that change for the past year.

PR #42 involved a rewrite of tszip that mostly supports legacy versions, but I have found that the top-level metadata in files I had previously written seem to be missing. Top-level metadata from newly tszipped files works fine as it is stored in root.items() and is correctly handled, but the previous metadata is in root.attrs.items(). Another piece of good news is that PR #35 never got its own tszip version on PyPI, so I'm probably the only who uses it. 😄

After quite a bit of debugging, trying to understand what PR #42 did, I came up with a solution that seems to work locally.

def decompress_zarr(root):
    coordinates = root["coordinates"][:]
    dict_repr = {"sequence_length": root.attrs["sequence_length"]}

    # Added by brianzhang01
    for key, value in root.attrs.items():
        if key == "metadata_schema":
            dict_repr[key] = json.dumps(value)
        elif key == "metadata":
            dict_repr[key] = json.dumps(value).encode("utf-8")

Would someone with more knowledge of the codebase be able to take a look and modify as necessary? My PR #35 includes a test, test_small_msprime_top_level_metadata, that can be used to construct a TreeSequence with some top-level metadata. If you write it out with a version right after PR #35 and then read it in with the latest version, I see the following keys for root.attrs.items():

format_name
format_version
metadata
metadata_schema
provenance
sequence_length

and the following are the keys of root.items():

coordinates
edges
individuals
migrations
mutations
nodes
populations
provenances
sites

I would also suggest checking whether the format_name, format_version, and provenance fields of root.attrs are correctly handled for the legacy versions. It seemed fine to me with the provenance correctly carried over, but I notice that only root.attrs["sequence_length"] is accessed in the current decompress_zarr() function.

Thank you very much.

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2022-06-06T09:27:52Z

Thanks for the report @brianzhang01. Just so we're clear, did the version of the file format that you're using ever get released? If not, then it's hard for us to justify adding complexity to the codebase to support it, unless there are end-users out there being affected.

Would a script to fix-up your tszipped files so that they use the same format as the released version help, or do you have files "out in the wild" that use this format?

brianzhang01 · 2022-06-06T21:17:05Z

Thanks @jeromekelleher . I expect myself to be the only end user, as I was using a version of the code between PR #35 and PR #42 (which did not get its own PyPI release). I also don't know of anyone besides me who uses top-level metadata in tskit.TreeSequence objects. I do have thousands of tszipped files which wrote top-level metadata in this way...

If it's simpler, feel free to close the issue and I will add the 5 lines of code posted above to a local version of tszip as a bridge between old and new formats. We can alternatively merge those 5 lines of code in, but it would only benefit myself.

Also, I looked into my earlier comment about "whether the format_name, format_version, and provenance fields of root.attrs are correctly handled". I realize this was a misunderstanding; those fields are not loaded in the decompress_zarr function, but they are used in the check_format and print_summary functions, and their representation has stayed the same before and after PR #42.

jeromekelleher · 2022-06-07T09:21:02Z

Using development versions of any tskit-dev project in production is something we do discourage @brianzhang01, becomes it makes the process of incremental development (where we have to ensure that code is production ready on every commit) far more dificult. We need to be free to try stuff out without taking on the contract of perpetual support. Adding your patch above will end up being quite a lot of work because we'll have to contrive specific test cases for it and maintain it into the future.

Perhaps a reasonable way forward would be to convert your files to use the production format? It should be only a few lines of Python to convert the top-level metadata, and it should be very fast because you won't need to decompress the data arrays. I'm happy to help with this process if you post a small example of one of your zarr files.

I wouldn't recommend using a local patched version of tszip, unless it's for a once-off upgrade to the production format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Edge case of decompressing legacy file with top-level metadata #65

Edge case of decompressing legacy file with top-level metadata #65

brianzhang01 commented Jun 3, 2022 •

edited

Loading

jeromekelleher commented Jun 6, 2022

brianzhang01 commented Jun 6, 2022 •

edited

Loading

jeromekelleher commented Jun 7, 2022

Edge case of decompressing legacy file with top-level metadata #65

Edge case of decompressing legacy file with top-level metadata #65

Comments

brianzhang01 commented Jun 3, 2022 • edited Loading

jeromekelleher commented Jun 6, 2022

brianzhang01 commented Jun 6, 2022 • edited Loading

jeromekelleher commented Jun 7, 2022

brianzhang01 commented Jun 3, 2022 •

edited

Loading

brianzhang01 commented Jun 6, 2022 •

edited

Loading