Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDR dev dataset is stale #17

Closed
hannes-ucsc opened this issue Apr 23, 2021 · 20 comments
Closed

TDR dev dataset is stale #17

hannes-ucsc opened this issue Apr 23, 2021 · 20 comments
Assignees
Labels
data [subject] Data or metadata [use of this label is uncommon] debt [type] A defect incurring continued engineering cost demoed [process] Successfully demonstrated to team orange [process] Done by the Azul team spike:1 [process] Spike estimate of one point task [type] Resolution requires engineering action other than code changes

Comments

@hannes-ucsc
Copy link
Collaborator

hannes-ucsc commented Apr 23, 2021

Rather than patching

DataBiosphere/azul#2873
DataBiosphere/azul#2870

we think that it's time to repopulate the dev dataset from scratch with a subset of projects from the dcp4 catalog used in prod. This would 1) reduce the size of the dev catalog and 2) make sure it is more representative of the current production systems. For example, the current dev snapshot does not have any intact analysis subgraphs or DCP/2-generated matrices.

@github-actions github-actions bot added the orange [process] Done by the Azul team label Apr 23, 2021
@hannes-ucsc
Copy link
Collaborator Author

Step one is to settle on the subset of projects, more specifically staging areas, that we should use to populate the dataset from.

@kbergin
Copy link
Collaborator

kbergin commented Apr 23, 2021

Sounds good to me.. if you want to just choose one 10X project that has analysis results and project matrices that would work. And then whatever we chose previously for having a small SS2 dataset and ideally at least one larger one so that we can test scale a bit more.

@hannes-ucsc
Copy link
Collaborator Author

Could you link or name the specific projects?

@kbergin
Copy link
Collaborator

kbergin commented May 10, 2021

This would be a good 10X project: 559bb888-7829-41f2-ace5-2c05c7eb81e9
This one for SS2 is already in dev and it's the one we've recently produced analysis results for: 8c3c290d-dfff-4553-8868-54ce45f4ba7f

We may want to keep the existing analysis results in dev for that SS2 project, though we may end up replacing them with our second iteration that should fix the project matrices that didn't index properly. Lmk what you think.

@hannes-ucsc hannes-ucsc removed the orange [process] Done by the Azul team label May 24, 2021
@github-actions github-actions bot added the orange [process] Done by the Azul team label May 24, 2021
@hannes-ucsc hannes-ucsc removed the orange [process] Done by the Azul team label May 24, 2021
@jessebrennan
Copy link
Contributor

We would also like 5b5f05b7-2482-468d-b76d-8f68c04a7a47 (Substantia_nigra_and_locus_coeruleus) in order to validate our solution to DataBiosphere/azul#3095.

@hannes-ucsc
Copy link
Collaborator Author

hannes-ucsc commented Jun 10, 2021

And https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6 since it's the only project with organically described CGMs.

@hannes-ucsc
Copy link
Collaborator Author

The challenge is now to translate this to a list of staging areas. If that list ends up including a DSS adapter's staging area we will need to decide if we want to import that as is, thereby increasing the size of the new dataset, or if we want to create a stripped down copy of the staging area that only includes the mentioned projects. Another challenge is to match the projects between primary staging area and CGM staging areas.

@aherbst-broad
Copy link
Contributor

Here are the staging areas:

@hannes-ucsc
Copy link
Collaborator Author

hannes-ucsc commented Jun 24, 2021

Here are the staging areas:

* From prod:
  
  * @kbergin (10x): https://data.humancellatlas.org/explore/projects/559bb888-7829-41f2-ace5-2c05c7eb81e9?catalog=dcp6
    
    * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/559bb888-7829-41f2-ace5-2c05c7eb81e9
    * CGMs
      
      * gs://broad-dsp-monster-hca-prod-ucsc-storage/cgm_dcp2ebi/data/liver_immune_prohect_annotations.txt
      * gs://broad-dsp-monster-hca-prod-ucsc-storage/cgm_dcp2ebi/data/normalised_expression_matrix.h5

What's the plan on only importing the relevant CGM data and metadata files? Will you create a stripped down SA for these?

We need to dig up the analysis SA(s) for these. Not sure if there is only one or more than one.

  * @hannes-ucsc (organic CGMs): https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6
    
    * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8
    * CGMSs
      
      * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8/data/annotation_200112.csv

Since these are organic CGMs that are in the same SA as the rest of the project, nothing special needs to be done here.

      * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8/data/covid_portal.h5ad
  * @jessebrennan (large analysis subgraph): https://data.humancellatlas.org/explore/projects/5b5f05b7-2482-468d-b76d-8f68c04a7a47?catalog=dcp6

We need to dig up the analysis SA(s) for these. Not sure if there is only one or more that one.

    * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/5b5f05b7-2482-468d-b76d-8f68c04a7a47
    * No CGMs

* From dev:
  
  * @kbergin (SS2): https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f

We need to dig up the analysis SA(s) for these. Not sure if there is only one or more than one.

    * gs://broad-dsp-monster-hca-prod-ucsc-storage/prod/no-analysis/metadata/project
  * @hannes-ucsc ("Lattice") https://dev.singlecell.gi.ucsc.edu/explore/projects/f0f89c14-7460-4bab-9d42-22228a91f185
    
    * gs://broad-dsp-monster-hca-dev-lattice/staging/f0f89c14-7460-4bab-9d42-22228a91f185

@aherbst-broad
Copy link
Contributor

@hannes-ucsc re: your first comment, yes, we will be creating a stripped down SA.
Re: the analysis files, good catch, I will attempt to dig them up.

@hannes-ucsc
Copy link
Collaborator Author

Cool. Let me know if you'd like us to help with any of that.

@aherbst-broad
Copy link
Contributor

Thanks @hannes-ucsc .

For https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f do you have a mechanism for finding the related analysis files? I have been digging them up ad-hoc for the other projects, but the 6800 links files for that project make that impractical and will require a bit of automation on my part to find them.

Re: the lattice data for f0f89c14-7460-4bab-9d42-22228a91f185, I note that the staging area contains a single /data with no associated /metadata, /links or /descriptors directory. My hunch is that those sub-folders got removed at some point between our dev testing and now. Is there another potential project that would fit the bill? If not, we can likely reconstruct out of bigquery or potentially ask lattice to re-stage, but that will take a bit of work as well.

These are the related analysis files for the other 3 projects:

@theathorn theathorn added the orange [process] Done by the Azul team label Jun 29, 2021
@theathorn theathorn added the spike:1 [process] Spike estimate of one point label Jun 29, 2021
@hannes-ucsc
Copy link
Collaborator Author

For https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f do you have a mechanism for finding the related analysis files? I have been digging them up ad-hoc for the other projects, but the 6800 links files for that project make that impractical and will require a bit of automation on my part to find them.

We usually use a BQ query like this one:

select 
    json_extract(analysis_file.content, "$.file_core.file_name") as file_name
from `broad-jade-dev-data.hca_dev_20201203___20210524_lattice.links` as links
join unnest(json_extract_array(links.content, '$.links')) as content_links 
    on json_extract_scalar(content_links, '$.link_type') = 'process_link'
join unnest(json_extract_array(content_links, '$.outputs')) as outputs
    on json_extract_scalar(outputs, '$.output_type') = 'analysis_file'
join `broad-jade-dev-data.hca_dev_20201203___20210524_lattice.analysis_file` as analysis_file
    on json_extract_scalar(outputs, '$.output_id') = analysis_file.analysis_file_id
where project_id = '8c3c290d-dfff-4553-8868-54ce45f4ba7f'
limit 100

Re: the lattice data for f0f89c14-7460-4bab-9d42-22228a91f185, I note that the staging area contains a single /data with no associated /metadata, /links or /descriptors directory. My hunch is that those sub-folders got removed at some point between our dev testing and now. Is there another potential project that would fit the bill? If not, we can likely reconstruct out of bigquery or potentially ask lattice to re-stage, but that will take a bit of work as well.

@theathorn can we ask the Stanford folks to repopulate the staging area? I have the feeling this is a temporary condition.

These are the related analysis files for the other 3 projects:

* https://data.humancellatlas.org/explore/projects/559bb888-7829-41f2-ace5-2c05c7eb81e9?catalog=dcp6
  
  * Analysis files:
    * gs://fc-ece18604-add9-45db-9bc2-78c77e471f71/staging/data/liver-immune-cells-human-blood-10XV2.loom
    * gs://fc-58886c47-be1c-41f7-8d8b-df18aef417cc/staging/data/liver-immune-cells-human-liver-10XV2.loom

* https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6
  
  * Analysis files:
    * None

* https://data.humancellatlas.org/explore/projects/5b5f05b7-2482-468d-b76d-8f68c04a7a47?catalog=dcp6
  * Analysis files:
  * gs://fc-239060e3-44ef-4bfb-93f5-b388950be17e/staging/data/substantia-negra-human-brain-10XV2-nuclei.loom

There are a few more metadata entities that are associated with each analysis file. The unit of work should be subgraphs, not data files or individual entities.

Let me ask this: Since we're ok with including the dcp1-migrated data (which significantly increases the size of the dataset, both in # of rows and volume of data) wouldn't it be easier to just copy prod to dev so to speak?

@hannes-ucsc hannes-ucsc removed their assignment Jun 30, 2021
@aherbst-broad
Copy link
Contributor

aherbst-broad commented Jul 1, 2021

@hannes-ucsc , thanks.

There are a few more metadata entities that are associated with each analysis file. The unit of work should be subgraphs, not data files or individual entities.

Ah, thanks, we'll take that into account and update the list.

Let me ask this: Since we're ok with including the dcp1-migrated data (which significantly increases the size of the dataset, both in # of rows and volume of data) wouldn't it be easier to just copy prod to dev so to speak?

We do not have that ability. However, we will need to build such a thing for the TDR prod migration that we'll be speaking about.

@hannes-ucsc
Copy link
Collaborator Author

hannes-ucsc commented Jul 6, 2021

The team discussed this today after the DCP demo. Consensus is that Monster implements tooling to selectively copy the meta(data) of individual projects between two TDR instances. This would eliminate the need to retain staging areas after they were imported, so that they might be imported again, a need we never actually specified. The ability to only copy selected projects would address the concern of dev being a costly 100% copy of prod.

@theathorn theathorn added stub [process] Placeholder for a ticket to be resovled by another team data [subject] Data or metadata [use of this label is uncommon] task [type] Resolution requires engineering action other than code changes labels Jul 6, 2021
@hannes-ucsc
Copy link
Collaborator Author

@hannes-ucsc
Copy link
Collaborator Author

For demo, show diversity of sources in service responses.

@hannes-ucsc hannes-ucsc added DCP demo no demo [process] Not to be demonstrated at the end of the sprint labels Oct 3, 2021
hannes-ucsc added a commit to DataBiosphere/azul that referenced this issue Oct 4, 2021
@theathorn theathorn added the demoed [process] Successfully demonstrated to team label Oct 12, 2021
@hannes-ucsc hannes-ucsc removed the no demo [process] Not to be demonstrated at the end of the sprint label Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data [subject] Data or metadata [use of this label is uncommon] debt [type] A defect incurring continued engineering cost demoed [process] Successfully demonstrated to team orange [process] Done by the Azul team spike:1 [process] Spike estimate of one point task [type] Resolution requires engineering action other than code changes
Projects
None yet
Development

No branches or pull requests

6 participants