-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TDR dev
dataset is stale
#17
Comments
Step one is to settle on the subset of projects, more specifically staging areas, that we should use to populate the dataset from. |
Sounds good to me.. if you want to just choose one 10X project that has analysis results and project matrices that would work. And then whatever we chose previously for having a small SS2 dataset and ideally at least one larger one so that we can test scale a bit more. |
Could you link or name the specific projects? |
This would be a good 10X project: 559bb888-7829-41f2-ace5-2c05c7eb81e9 We may want to keep the existing analysis results in dev for that SS2 project, though we may end up replacing them with our second iteration that should fix the project matrices that didn't index properly. Lmk what you think. |
We would also like |
And https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6 since it's the only project with organically described CGMs. |
A consolidated list with all the projects mentioned above:
|
The challenge is now to translate this to a list of staging areas. If that list ends up including a DSS adapter's staging area we will need to decide if we want to import that as is, thereby increasing the size of the new dataset, or if we want to create a stripped down copy of the staging area that only includes the mentioned projects. Another challenge is to match the projects between primary staging area and CGM staging areas. |
Here are the staging areas:
|
What's the plan on only importing the relevant CGM data and metadata files? Will you create a stripped down SA for these? We need to dig up the analysis SA(s) for these. Not sure if there is only one or more than one.
Since these are organic CGMs that are in the same SA as the rest of the project, nothing special needs to be done here.
We need to dig up the analysis SA(s) for these. Not sure if there is only one or more that one.
We need to dig up the analysis SA(s) for these. Not sure if there is only one or more than one.
|
@hannes-ucsc re: your first comment, yes, we will be creating a stripped down SA. |
Cool. Let me know if you'd like us to help with any of that. |
Thanks @hannes-ucsc . For https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f do you have a mechanism for finding the related analysis files? I have been digging them up ad-hoc for the other projects, but the 6800 links files for that project make that impractical and will require a bit of automation on my part to find them. Re: the lattice data for f0f89c14-7460-4bab-9d42-22228a91f185, I note that the staging area contains a single These are the related analysis files for the other 3 projects:
|
We usually use a BQ query like this one: select
json_extract(analysis_file.content, "$.file_core.file_name") as file_name
from `broad-jade-dev-data.hca_dev_20201203___20210524_lattice.links` as links
join unnest(json_extract_array(links.content, '$.links')) as content_links
on json_extract_scalar(content_links, '$.link_type') = 'process_link'
join unnest(json_extract_array(content_links, '$.outputs')) as outputs
on json_extract_scalar(outputs, '$.output_type') = 'analysis_file'
join `broad-jade-dev-data.hca_dev_20201203___20210524_lattice.analysis_file` as analysis_file
on json_extract_scalar(outputs, '$.output_id') = analysis_file.analysis_file_id
where project_id = '8c3c290d-dfff-4553-8868-54ce45f4ba7f'
limit 100
@theathorn can we ask the Stanford folks to repopulate the staging area? I have the feeling this is a temporary condition.
There are a few more metadata entities that are associated with each analysis file. The unit of work should be subgraphs, not data files or individual entities. Let me ask this: Since we're ok with including the dcp1-migrated data (which significantly increases the size of the dataset, both in # of rows and volume of data) wouldn't it be easier to just copy prod to dev so to speak? |
@hannes-ucsc , thanks.
Ah, thanks, we'll take that into account and update the list.
We do not have that ability. However, we will need to build such a thing for the TDR prod migration that we'll be speaking about. |
The team discussed this today after the DCP demo. Consensus is that Monster implements tooling to selectively copy the meta(data) of individual projects between two TDR instances. This would eliminate the need to retain staging areas after they were imported, so that they might be imported again, a need we never actually specified. The ability to only copy selected projects would address the concern of |
Regression from d5c935f
For demo, show diversity of sources in service responses. |
Rather than patching
DataBiosphere/azul#2873
DataBiosphere/azul#2870
we think that it's time to repopulate the
dev
dataset from scratch with a subset of projects from thedcp4
catalog used in prod. This would 1) reduce the size of thedev
catalog and 2) make sure it is more representative of the current production systems. For example, the currentdev
snapshot does not have any intact analysis subgraphs or DCP/2-generated matrices.The text was updated successfully, but these errors were encountered: