Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Germplasm and Genotype data #67

Open
laceysanderson opened this issue Oct 12, 2018 · 3 comments
Open

Germplasm and Genotype data #67

laceysanderson opened this issue Oct 12, 2018 · 3 comments

Comments

@laceysanderson
Copy link

I've been enjoying Tripal DevSeed as a quick way to get a testing environment up but need data to test germplasm and genotype-based functionality... I am willing to contribute to this project to see such data added if there is interest :-)

@bradfordcondon
Copy link
Contributor

Absolutely 👍 I just dont know what sort of input to expect so if you can share that we can take it from there.

There are 3 components to dev seed

a) guides on how to load the data
b) scripts to generate/miniaturize data given relevant input
c) the data itself

if we can only get c) in for now thats fine, i think a) is great too, and i'm not sure how useful b) actually is...

@laceysanderson
Copy link
Author

I will definitely contribute both A and C :-)

For the data, I noticed the current minified set focuses on Fraxinus excelsior. Is the preference for public datasets (difficult for germplasm data) or would real-life data that is anonimized as Tripalus databasica also be welcome?

As far as what the data looks like:

  • Genotypic data would be a VCF file. This can be easily loaded via our genotypes_loader. I could also provide a GFF3 for variants and markers, although both can be created by the loader so I'm not sure if it's useful? Might be nice if people want the markers but not the genotypes 🤷‍♀️
  • Germplasm data would be a CSV describing each "Cross" or "Accession" and a second CSV with relationship information. I can provide templates for the Tripal Bulk Loader to load this data.

@bradfordcondon
Copy link
Contributor

bradfordcondon commented Oct 12, 2018

For the data, I noticed the current minified set focuses on Fraxinus excelsior. Is the preference for public datasets (difficult for germplasm data) or would real-life data that is anonimized as Tripalus databasica also be welcome?

This is actually why i switched to writing python scripts that randomly generated biomaterials during the peer review process- to many qestions of which records did i use and why. i didnt go all the way and switch to a tripalus databasica because i still wanted to use "real" biological sequences since i wanted the feature annotations to make sense. So in your case we have to ask "will someone care if the biology makes sense?" In particular with variants, I guess.

Ideal world you would contribute to F excelsior with fake or anonymous data. If thats not possible then I'd be ok with a separate organism folder for T. databasica if thats your preference. If you look at the biomaterial generator (https://github.com/statonlab/tripal_dev_seed/blob/master/bin/generate_biomaterials.py) you might be surprised at how easy it is to build a random metadata generator thanks to faker.

edit- well since that file produces XML its going to be a little extra confusing. how about https://github.com/statonlab/tripal_dev_seed/blob/master/bin/generate_expression.py which generates TSV data?

I could also provide a GFF3 for variants and markers, although both can be created by the loader

Are you hoping to integrate with the pre-generated devSeed Seeder in test suite? If so, would it be necessary to have all the files pre-generated for automatic loading? Or does your genotypes loader automatically load the resulting GFF/variants info? If it does then i dont see the need to provide both unless you want it .

I can provide templates for the Tripal Bulk Loader to load this data.

yes please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants