Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new data integrity checks #54

Open
2 tasks
gregcaporaso opened this issue Jan 12, 2017 · 5 comments
Open
2 tasks

new data integrity checks #54

gregcaporaso opened this issue Jan 12, 2017 · 5 comments

Comments

@gregcaporaso
Copy link
Member

  • naming of OTU directories should be of the format: <similarity-threshold>-otus
  • expected sequences files should be named expected-sequences.fasta, and no other fasta files should be present in those directories
@nbokulich
Copy link
Contributor

I agree with item 1 but have some "devil's advocate" questions regarding item 2.

In some ways, the source directory could be useful as a sort of "junk drawer" for the mock community, and contributors could include other information that don't it elsewhere. For example, a list of Genbank accession #s for whole genome sequences (which might not be appropriate in the "expected taxonomy" directories that are specific for reference databases that provide taxonomy information). Of course, we have control over this so the files would never be "junk", just a collection of useful files that do not fit in the other directories (which are more regulated).

Naming conventions in source could also have some flexibility. For example, expected-sequences.fasta can be rather vague — instead, full-length-16S-expected-sequences.fasta or V4-domain-expected-sequences.fasta could be more informative.

What do you think?

@gregcaporaso
Copy link
Member Author

I think that all makes sense, I'm good with it.

@nbokulich
Copy link
Contributor

What should we do for shotgun metagenome datasets? I think I support keeping the <similarity-threshold>-otus requirement across the board for simplicity's sake, and such datasets could be labeled 100-otus. But would it be better to enforce this rule only for marker-gene datasets, and use different rules for metagenome datasets?

@gregcaporaso
Copy link
Member Author

I think your suggestions would work well.

@nbokulich
Copy link
Contributor

Thanks! I will make that rule standard then when I update the integrity checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants