TDR `dev` dataset is stale #17

hannes-ucsc · 2021-04-23T17:22:17Z

Rather than patching

DataBiosphere/azul#2873
DataBiosphere/azul#2870

we think that it's time to repopulate the dev dataset from scratch with a subset of projects from the dcp4 catalog used in prod. This would 1) reduce the size of the dev catalog and 2) make sure it is more representative of the current production systems. For example, the current dev snapshot does not have any intact analysis subgraphs or DCP/2-generated matrices.

The text was updated successfully, but these errors were encountered:

hannes-ucsc · 2021-04-23T17:23:49Z

Step one is to settle on the subset of projects, more specifically staging areas, that we should use to populate the dataset from.

kbergin · 2021-04-23T18:47:38Z

Sounds good to me.. if you want to just choose one 10X project that has analysis results and project matrices that would work. And then whatever we chose previously for having a small SS2 dataset and ideally at least one larger one so that we can test scale a bit more.

hannes-ucsc · 2021-04-23T20:04:16Z

Could you link or name the specific projects?

kbergin · 2021-05-10T19:11:48Z

This would be a good 10X project: 559bb888-7829-41f2-ace5-2c05c7eb81e9
This one for SS2 is already in dev and it's the one we've recently produced analysis results for: 8c3c290d-dfff-4553-8868-54ce45f4ba7f

We may want to keep the existing analysis results in dev for that SS2 project, though we may end up replacing them with our second iteration that should fix the project matrices that didn't index properly. Lmk what you think.

jessebrennan · 2021-06-08T20:04:44Z

We would also like 5b5f05b7-2482-468d-b76d-8f68c04a7a47 (Substantia_nigra_and_locus_coeruleus) in order to validate our solution to DataBiosphere/azul#3095.

hannes-ucsc · 2021-06-10T02:01:49Z

And https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6 since it's the only project with organically described CGMs.

hannes-ucsc · 2021-06-10T19:49:29Z

A consolidated list with all the projects mentioned above:

From prod:
From dev:
- @kbergin (SS2): https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f
- @hannes-ucsc ("Lattice") https://dev.singlecell.gi.ucsc.edu/explore/projects/f0f89c14-7460-4bab-9d42-22228a91f185

hannes-ucsc · 2021-06-10T19:53:28Z

The challenge is now to translate this to a list of staging areas. If that list ends up including a DSS adapter's staging area we will need to decide if we want to import that as is, thereby increasing the size of the new dataset, or if we want to create a stripped down copy of the staging area that only includes the mentioned projects. Another challenge is to match the projects between primary staging area and CGM staging areas.

aherbst-broad · 2021-06-24T17:25:50Z

Here are the staging areas:

From prod:
- @kbergin (10x): https://data.humancellatlas.org/explore/projects/559bb888-7829-41f2-ace5-2c05c7eb81e9?catalog=dcp6
  - gs://broad-dsp-monster-hca-prod-ebi-storage/prod/559bb888-7829-41f2-ace5-2c05c7eb81e9
  - CGMs
    - gs://broad-dsp-monster-hca-prod-ucsc-storage/cgm_dcp2ebi/data/liver_immune_prohect_annotations.txt
    - gs://broad-dsp-monster-hca-prod-ucsc-storage/cgm_dcp2ebi/data/normalised_expression_matrix.h5
- @hannes-ucsc (organic CGMs): https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6
  - gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8
  - CGMSs
    - gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8/data/annotation_200112.csv
    - gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8/data/covid_portal.h5ad
- @jessebrennan (large analysis subgraph): https://data.humancellatlas.org/explore/projects/5b5f05b7-2482-468d-b76d-8f68c04a7a47?catalog=dcp6
  - gs://broad-dsp-monster-hca-prod-ebi-storage/prod/5b5f05b7-2482-468d-b76d-8f68c04a7a47
  - No CGMs
From dev:
- @kbergin (SS2): https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f
  - gs://broad-dsp-monster-hca-prod-ucsc-storage/prod/no-analysis/metadata/project
- @hannes-ucsc ("Lattice") https://dev.singlecell.gi.ucsc.edu/explore/projects/f0f89c14-7460-4bab-9d42-22228a91f185
  - gs://broad-dsp-monster-hca-dev-lattice/staging/f0f89c14-7460-4bab-9d42-22228a91f185

hannes-ucsc · 2021-06-24T17:41:21Z

Here are the staging areas:

* From prod:
  
  * @kbergin (10x): https://data.humancellatlas.org/explore/projects/559bb888-7829-41f2-ace5-2c05c7eb81e9?catalog=dcp6
    
    * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/559bb888-7829-41f2-ace5-2c05c7eb81e9
    * CGMs
      
      * gs://broad-dsp-monster-hca-prod-ucsc-storage/cgm_dcp2ebi/data/liver_immune_prohect_annotations.txt
      * gs://broad-dsp-monster-hca-prod-ucsc-storage/cgm_dcp2ebi/data/normalised_expression_matrix.h5

What's the plan on only importing the relevant CGM data and metadata files? Will you create a stripped down SA for these?

We need to dig up the analysis SA(s) for these. Not sure if there is only one or more than one.

  * @hannes-ucsc (organic CGMs): https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6
    
    * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8
    * CGMSs
      
      * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8/data/annotation_200112.csv

Since these are organic CGMs that are in the same SA as the rest of the project, nothing special needs to be done here.

      * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/b963bd4b-4bc1-4404-8425-69d74bc636b8/data/covid_portal.h5ad
  * @jessebrennan (large analysis subgraph): https://data.humancellatlas.org/explore/projects/5b5f05b7-2482-468d-b76d-8f68c04a7a47?catalog=dcp6

We need to dig up the analysis SA(s) for these. Not sure if there is only one or more that one.

    * gs://broad-dsp-monster-hca-prod-ebi-storage/prod/5b5f05b7-2482-468d-b76d-8f68c04a7a47
    * No CGMs

* From dev:
  
  * @kbergin (SS2): https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f

We need to dig up the analysis SA(s) for these. Not sure if there is only one or more than one.

    * gs://broad-dsp-monster-hca-prod-ucsc-storage/prod/no-analysis/metadata/project
  * @hannes-ucsc ("Lattice") https://dev.singlecell.gi.ucsc.edu/explore/projects/f0f89c14-7460-4bab-9d42-22228a91f185
    
    * gs://broad-dsp-monster-hca-dev-lattice/staging/f0f89c14-7460-4bab-9d42-22228a91f185

aherbst-broad · 2021-06-24T18:31:10Z

@hannes-ucsc re: your first comment, yes, we will be creating a stripped down SA.
Re: the analysis files, good catch, I will attempt to dig them up.

hannes-ucsc · 2021-06-25T00:33:06Z

Cool. Let me know if you'd like us to help with any of that.

aherbst-broad · 2021-06-28T14:28:23Z

Thanks @hannes-ucsc .

For https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f do you have a mechanism for finding the related analysis files? I have been digging them up ad-hoc for the other projects, but the 6800 links files for that project make that impractical and will require a bit of automation on my part to find them.

Re: the lattice data for f0f89c14-7460-4bab-9d42-22228a91f185, I note that the staging area contains a single /data with no associated /metadata, /links or /descriptors directory. My hunch is that those sub-folders got removed at some point between our dev testing and now. Is there another potential project that would fit the bill? If not, we can likely reconstruct out of bigquery or potentially ask lattice to re-stage, but that will take a bit of work as well.

These are the related analysis files for the other 3 projects:

https://data.humancellatlas.org/explore/projects/559bb888-7829-41f2-ace5-2c05c7eb81e9?catalog=dcp6
- Analysis files:
  * gs://fc-ece18604-add9-45db-9bc2-78c77e471f71/staging/data/liver-immune-cells-human-blood-10XV2.loom
  * gs://fc-58886c47-be1c-41f7-8d8b-df18aef417cc/staging/data/liver-immune-cells-human-liver-10XV2.loom
https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6
- Analysis files:
  * None
https://data.humancellatlas.org/explore/projects/5b5f05b7-2482-468d-b76d-8f68c04a7a47?catalog=dcp6
* Analysis files:
* gs://fc-239060e3-44ef-4bfb-93f5-b388950be17e/staging/data/substantia-negra-human-brain-10XV2-nuclei.loom

hannes-ucsc · 2021-06-30T21:13:49Z

For https://dev.singlecell.gi.ucsc.edu/explore/projects/8c3c290d-dfff-4553-8868-54ce45f4ba7f do you have a mechanism for finding the related analysis files? I have been digging them up ad-hoc for the other projects, but the 6800 links files for that project make that impractical and will require a bit of automation on my part to find them.

We usually use a BQ query like this one:

select 
    json_extract(analysis_file.content, "$.file_core.file_name") as file_name
from `broad-jade-dev-data.hca_dev_20201203___20210524_lattice.links` as links
join unnest(json_extract_array(links.content, '$.links')) as content_links 
    on json_extract_scalar(content_links, '$.link_type') = 'process_link'
join unnest(json_extract_array(content_links, '$.outputs')) as outputs
    on json_extract_scalar(outputs, '$.output_type') = 'analysis_file'
join `broad-jade-dev-data.hca_dev_20201203___20210524_lattice.analysis_file` as analysis_file
    on json_extract_scalar(outputs, '$.output_id') = analysis_file.analysis_file_id
where project_id = '8c3c290d-dfff-4553-8868-54ce45f4ba7f'
limit 100

Re: the lattice data for f0f89c14-7460-4bab-9d42-22228a91f185, I note that the staging area contains a single /data with no associated /metadata, /links or /descriptors directory. My hunch is that those sub-folders got removed at some point between our dev testing and now. Is there another potential project that would fit the bill? If not, we can likely reconstruct out of bigquery or potentially ask lattice to re-stage, but that will take a bit of work as well.

@theathorn can we ask the Stanford folks to repopulate the staging area? I have the feeling this is a temporary condition.

These are the related analysis files for the other 3 projects:

* https://data.humancellatlas.org/explore/projects/559bb888-7829-41f2-ace5-2c05c7eb81e9?catalog=dcp6
  
  * Analysis files:
    * gs://fc-ece18604-add9-45db-9bc2-78c77e471f71/staging/data/liver-immune-cells-human-blood-10XV2.loom
    * gs://fc-58886c47-be1c-41f7-8d8b-df18aef417cc/staging/data/liver-immune-cells-human-liver-10XV2.loom

* https://data.humancellatlas.org/explore/projects/b963bd4b-4bc1-4404-8425-69d74bc636b8?catalog=dcp6
  
  * Analysis files:
    * None

* https://data.humancellatlas.org/explore/projects/5b5f05b7-2482-468d-b76d-8f68c04a7a47?catalog=dcp6
  * Analysis files:
  * gs://fc-239060e3-44ef-4bfb-93f5-b388950be17e/staging/data/substantia-negra-human-brain-10XV2-nuclei.loom

There are a few more metadata entities that are associated with each analysis file. The unit of work should be subgraphs, not data files or individual entities.

Let me ask this: Since we're ok with including the dcp1-migrated data (which significantly increases the size of the dataset, both in # of rows and volume of data) wouldn't it be easier to just copy prod to dev so to speak?

aherbst-broad · 2021-07-01T12:48:13Z

@hannes-ucsc , thanks.

There are a few more metadata entities that are associated with each analysis file. The unit of work should be subgraphs, not data files or individual entities.

Ah, thanks, we'll take that into account and update the list.

Let me ask this: Since we're ok with including the dcp1-migrated data (which significantly increases the size of the dataset, both in # of rows and volume of data) wouldn't it be easier to just copy prod to dev so to speak?

We do not have that ability. However, we will need to build such a thing for the TDR prod migration that we'll be speaking about.

hannes-ucsc · 2021-07-06T16:21:48Z

The team discussed this today after the DCP demo. Consensus is that Monster implements tooling to selectively copy the meta(data) of individual projects between two TDR instances. This would eliminate the need to retain staging areas after they were imported, so that they might be imported again, a need we never actually specified. The ability to only copy selected projects would address the concern of dev being a costly 100% copy of prod.

hannes-ucsc · 2021-09-10T20:49:09Z

https://humancellatlas.slack.com/archives/C01360XN04S/p1631200873039200

Regression from d5c935f

hannes-ucsc · 2021-10-03T17:50:19Z

For demo, show diversity of sources in service responses.

github-actions bot added the orange [process] Done by the Azul team label Apr 23, 2021

hannes-ucsc removed the orange [process] Done by the Azul team label May 24, 2021

github-actions bot added the orange [process] Done by the Azul team label May 24, 2021

hannes-ucsc removed the orange [process] Done by the Azul team label May 24, 2021

jessebrennan mentioned this issue Jun 8, 2021

Samples don't reflect downstream entities from stitched subgraphs DataBiosphere/azul#3095

Closed

theathorn added the orange [process] Done by the Azul team label Jun 29, 2021

theathorn assigned hannes-ucsc Jun 29, 2021

theathorn added the spike:1 [process] Spike estimate of one point label Jun 29, 2021

hannes-ucsc removed their assignment Jun 30, 2021

theathorn added stub [process] Placeholder for a ticket to be resovled by another team data [subject] Data or metadata [use of this label is uncommon] task [type] Resolution requires engineering action other than code changes labels Jul 6, 2021

theathorn assigned aherbst-broad Jul 6, 2021

nadove-ucsc pushed a commit to DataBiosphere/azul that referenced this issue Oct 2, 2021

[u r] Fix: TDR dev dataset is stale (HumanCellAtlas/dcp2#17)

dcd6e1b

nadove-ucsc added a commit to DataBiosphere/azul that referenced this issue Oct 2, 2021

[u r] Fix: TDR dev dataset is stale (HumanCellAtlas/dcp2#17, PR #3441)

ff56875

hannes-ucsc added a commit to DataBiosphere/azul that referenced this issue Oct 3, 2021

Really remove lungmap catalogs (HumanCellAtlas/dcp2#17)

14d2200

Regression from d5c935f

hannes-ucsc mentioned this issue Oct 3, 2021

Derive prefix from snapshot size (HumanCellAtlas/dcp2#17) DataBiosphere/azul#3488

Merged

49 tasks

hannes-ucsc added a commit to DataBiosphere/azul that referenced this issue Oct 3, 2021

[r] Derive prefix from snapshot size (HumanCellAtlas/dcp2#17)

2902006

hannes-ucsc added DCP demo no demo [process] Not to be demonstrated at the end of the sprint labels Oct 3, 2021

hannes-ucsc added a commit to DataBiosphere/azul that referenced this issue Oct 3, 2021

[r] Derive prefix from snapshot size (HumanCellAtlas/dcp2#17, PR #3488)

074e0d8

hannes-ucsc mentioned this issue Oct 3, 2021

TDR's GET /repository/v1/snapshots takes too long #46

Closed

hannes-ucsc added a commit to DataBiosphere/azul that referenced this issue Oct 4, 2021

[u] Fix upgrade instructions (HumanCellAtlas/dcp2#17)

3b9cd6f

theathorn added the demoed [process] Successfully demonstrated to team label Oct 12, 2021

theathorn closed this as completed Oct 12, 2021

hannes-ucsc removed the no demo [process] Not to be demonstrated at the end of the sprint label Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDR `dev` dataset is stale #17

TDR `dev` dataset is stale #17

hannes-ucsc commented Apr 23, 2021 •

edited by theathorn

Loading

hannes-ucsc commented Apr 23, 2021

kbergin commented Apr 23, 2021

hannes-ucsc commented Apr 23, 2021

kbergin commented May 10, 2021

jessebrennan commented Jun 8, 2021

hannes-ucsc commented Jun 10, 2021 •

edited

Loading

hannes-ucsc commented Jun 10, 2021

hannes-ucsc commented Jun 10, 2021

aherbst-broad commented Jun 24, 2021

hannes-ucsc commented Jun 24, 2021 •

edited

Loading

aherbst-broad commented Jun 24, 2021

hannes-ucsc commented Jun 25, 2021

aherbst-broad commented Jun 28, 2021

hannes-ucsc commented Jun 30, 2021

aherbst-broad commented Jul 1, 2021 •

edited

Loading

hannes-ucsc commented Jul 6, 2021 •

edited

Loading

hannes-ucsc commented Sep 10, 2021

hannes-ucsc commented Oct 3, 2021

TDR dev dataset is stale #17

TDR dev dataset is stale #17

Comments

hannes-ucsc commented Apr 23, 2021 • edited by theathorn Loading

hannes-ucsc commented Apr 23, 2021

kbergin commented Apr 23, 2021

hannes-ucsc commented Apr 23, 2021

kbergin commented May 10, 2021

jessebrennan commented Jun 8, 2021

hannes-ucsc commented Jun 10, 2021 • edited Loading

hannes-ucsc commented Jun 10, 2021

hannes-ucsc commented Jun 10, 2021

aherbst-broad commented Jun 24, 2021

hannes-ucsc commented Jun 24, 2021 • edited Loading

aherbst-broad commented Jun 24, 2021

hannes-ucsc commented Jun 25, 2021

aherbst-broad commented Jun 28, 2021

hannes-ucsc commented Jun 30, 2021

aherbst-broad commented Jul 1, 2021 • edited Loading

hannes-ucsc commented Jul 6, 2021 • edited Loading

hannes-ucsc commented Sep 10, 2021

hannes-ucsc commented Oct 3, 2021

TDR `dev` dataset is stale #17

TDR `dev` dataset is stale #17

hannes-ucsc commented Apr 23, 2021 •

edited by theathorn

Loading

hannes-ucsc commented Jun 10, 2021 •

edited

Loading

hannes-ucsc commented Jun 24, 2021 •

edited

Loading

aherbst-broad commented Jul 1, 2021 •

edited

Loading

hannes-ucsc commented Jul 6, 2021 •

edited

Loading