Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some manifest challenges - wrong manifests; regenerating them; more fields needed #1849

Open
ctb opened this issue Feb 21, 2022 · 3 comments

Comments

@ctb
Copy link
Contributor

ctb commented Feb 21, 2022

Over in #1837, I'm discovering some fun challenges with manifests 🎉 .

first, it turns out that manifests do not contain seed or license (also see #1846 for motivation and discovery). so those should be added.

second, in #1837 itself, the get_manifest code has the option of regenerating manifests, but we haven't really standardized the API for getting a 'fresh' manifest. Right now we just iterate over an internal API, if it's available. but some classes don't need to do that - e.g. the SqliteIndex in #1808 generates the index fresh each time, and doesn't support the internal API for iterating over all signatures! Not sure what to do here, but maybe we need a standard API for regenerating a manifest?

third, there are some interesting corner cases popping up in #1837 where the manifest may (or may not) contain all signatures in the database. One specific case is ZipFileLinearIndex, where if the manifest was generated with traverse_all_files, it may contain signatures from files that don't have .sig in the name. This results in oddities where you get different reports out of sourmash sig fileinfo depending on whether you've asked it to regenerate the manifest or not: for example, if you're looking at tests/test-data/prot/all.zip, the included manifest does contain dna-sig.noext, but if you regenerate the manifest from an index loaded without traverse_all_files=True, you'll exclude it. See the test_fileinfo_4_zip* tests as well as the test_sig_manifest_7_allzip tests for tests that explore this behavior.

In some sense this is a known problem with manifests - they can get out of date or be wrong! - and I'm actually kind of happy to have these edge cases around so that we can test weird branches in the code, but I also think they probably are worth a bit of long-term attention ;).

ref: #1599

@ctb
Copy link
Contributor Author

ctb commented Mar 25, 2022

interesting post via luiz - ninja build system thoughts - with a nice section on manifests.

@ctb
Copy link
Contributor Author

ctb commented Mar 26, 2022

from #1352 (comment), an interesting idea:

I guess this could then lead to a gradiation of collection/index storages:

  • level 0, random collection of files, gotta traverse and load them all to figure out if they're correct
  • level 1, partial/incomplete/untrusted manifest allowing ignoring of some of the signatures based on characteristics; this might be something where after a full traversal, a manifest is generated automatically for some cases (like zip files and directory indexes). note, this is actually be a pretty good use case for zip files, which can store things like manifests alongside signatures, unlike .sig files.
  • level 2, contents completely managed by sourmash, manifest is completely trustworthy (e.g. LCA/revindex databases, or SBTs)

@ctb
Copy link
Contributor Author

ctb commented Sep 5, 2022

additional information that could be useful in manifests: the type of sketch (FracMinHash, MinHash, etc) - ref #751 also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant