Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge case of decompressing legacy file with top-level metadata #65

Open
brianzhang01 opened this issue Jun 3, 2022 · 3 comments
Open

Comments

@brianzhang01
Copy link
Member

brianzhang01 commented Jun 3, 2022

Back in April 2021, I had some TreeSequence objects with top-level metadata that I wanted to write out using tszip, so I proposed PR #35. I've stuck to a version of tszip with that change for the past year.

PR #42 involved a rewrite of tszip that mostly supports legacy versions, but I have found that the top-level metadata in files I had previously written seem to be missing. Top-level metadata from newly tszipped files works fine as it is stored in root.items() and is correctly handled, but the previous metadata is in root.attrs.items(). Another piece of good news is that PR #35 never got its own tszip version on PyPI, so I'm probably the only who uses it. 😄

After quite a bit of debugging, trying to understand what PR #42 did, I came up with a solution that seems to work locally.

def decompress_zarr(root):
    coordinates = root["coordinates"][:]
    dict_repr = {"sequence_length": root.attrs["sequence_length"]}

    # Added by brianzhang01
    for key, value in root.attrs.items():
        if key == "metadata_schema":
            dict_repr[key] = json.dumps(value)
        elif key == "metadata":
            dict_repr[key] = json.dumps(value).encode("utf-8")

Would someone with more knowledge of the codebase be able to take a look and modify as necessary? My PR #35 includes a test, test_small_msprime_top_level_metadata, that can be used to construct a TreeSequence with some top-level metadata. If you write it out with a version right after PR #35 and then read it in with the latest version, I see the following keys for root.attrs.items():

format_name
format_version
metadata
metadata_schema
provenance
sequence_length

and the following are the keys of root.items():

coordinates
edges
individuals
migrations
mutations
nodes
populations
provenances
sites

I would also suggest checking whether the format_name, format_version, and provenance fields of root.attrs are correctly handled for the legacy versions. It seemed fine to me with the provenance correctly carried over, but I notice that only root.attrs["sequence_length"] is accessed in the current decompress_zarr() function.

Thank you very much.

@jeromekelleher
Copy link
Member

Thanks for the report @brianzhang01. Just so we're clear, did the version of the file format that you're using ever get released? If not, then it's hard for us to justify adding complexity to the codebase to support it, unless there are end-users out there being affected.

Would a script to fix-up your tszipped files so that they use the same format as the released version help, or do you have files "out in the wild" that use this format?

@brianzhang01
Copy link
Member Author

brianzhang01 commented Jun 6, 2022

Thanks @jeromekelleher . I expect myself to be the only end user, as I was using a version of the code between PR #35 and PR #42 (which did not get its own PyPI release). I also don't know of anyone besides me who uses top-level metadata in tskit.TreeSequence objects. I do have thousands of tszipped files which wrote top-level metadata in this way...

If it's simpler, feel free to close the issue and I will add the 5 lines of code posted above to a local version of tszip as a bridge between old and new formats. We can alternatively merge those 5 lines of code in, but it would only benefit myself.

Also, I looked into my earlier comment about "whether the format_name, format_version, and provenance fields of root.attrs are correctly handled". I realize this was a misunderstanding; those fields are not loaded in the decompress_zarr function, but they are used in the check_format and print_summary functions, and their representation has stayed the same before and after PR #42.

@jeromekelleher
Copy link
Member

Using development versions of any tskit-dev project in production is something we do discourage @brianzhang01, becomes it makes the process of incremental development (where we have to ensure that code is production ready on every commit) far more dificult. We need to be free to try stuff out without taking on the contract of perpetual support. Adding your patch above will end up being quite a lot of work because we'll have to contrive specific test cases for it and maintain it into the future.

Perhaps a reasonable way forward would be to convert your files to use the production format? It should be only a few lines of Python to convert the top-level metadata, and it should be very fast because you won't need to decompress the data arrays. I'm happy to help with this process if you post a small example of one of your zarr files.

I wouldn't recommend using a local patched version of tszip, unless it's for a once-off upgrade to the production format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants