Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samples don't reflect downstream entities from stitched subgraphs #3095

Closed
dsotirho-ucsc opened this issue Jun 1, 2021 · 13 comments
Closed
Assignees
Labels
bug [type] A defect preventing use of the system as specified code [subject] Production code demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team orange [process] Done by the Azul team spike:3 [process] Spike estimate of three points

Comments

@dsotirho-ucsc
Copy link
Contributor

dsotirho-ucsc commented Jun 1, 2021

For example there is no bam termFacet entry in Samples, because the bam files are part of the stitched subgraph while the sample is part of the primary subgraph.

[Edit by Jesse to show reproductions and both prod and dev]

dev

  1. Filtering by file format shows projects with BAMs
    https://dev.singlecell.gi.ucsc.edu/explore/projects?filter=%5B%7B%22facetName%22:%22fileFormat%22,%22terms%22:%5B%22bam%22%5D%7D%5D
    projects_with_bams_dev

  2. But selecting the samples tab shows no samples for this project and BAMs
    https://dev.singlecell.gi.ucsc.edu/explore/samples?filter=%5B%7B%22facetName%22:%22fileFormat%22,%22terms%22:%5B%22bam%22%5D%7D,%7B%22facetName%22:%22projectId%22,%22terms%22:%5B%224a95101c-9ffc-4f30-a809-f04518a23803%22%5D%7D%5D
    no_samples_dev

  3. Even though if we select this project with out filtering for BAMs we do have samples
    https://dev.singlecell.gi.ucsc.edu/explore/samples?filter=%5B%7B%22facetName%22:%22projectId%22,%22terms%22:%5B%224a95101c-9ffc-4f30-a809-f04518a23803%22%5D%7D%5D
    samples_exist_dev


prod

  1. Filtering by file format shows projects with BAMs
    https://data.humancellatlas.org/explore/projects?filter=%5B%7B%22facetName%22:%22fileFormat%22,%22terms%22:%5B%22bam%22%5D%7D%5D
    projects_with_bams_prod

  2. But selecting the samples tab shows no samples for this project and BAMs
    https://data.humancellatlas.org/explore/samples?filter=%5B%7B%22facetName%22:%22fileFormat%22,%22terms%22:%5B%22bam%22%5D%7D,%7B%22facetName%22:%22projectId%22,%22terms%22:%5B%22996120f9-e84f-409f-a01e-732ab58ca8b9%22%5D%7D%5D
    no_samples_prod

  3. Even though if we select this project with out filtering for BAMs we do have samples
    https://data.humancellatlas.org/explore/samples?filter=%5B%7B%22facetName%22:%22projectId%22,%22terms%22:%5B%22996120f9-e84f-409f-a01e-732ab58ca8b9%22%5D%7D%5D
    samples_exist_prod

@dsotirho-ucsc dsotirho-ucsc added the orange [process] Done by the Azul team label Jun 1, 2021
@dsotirho-ucsc
Copy link
Contributor Author

dsotirho-ucsc commented Jun 1, 2021

Possibly a dup of #2909

[edit: not a dupe of that (hannes)]

@melainalegaspi melainalegaspi added bug [type] A defect preventing use of the system as specified code [subject] Production code labels Jun 2, 2021
@melainalegaspi melainalegaspi added the spike:3 [process] Spike estimate of three points label Jun 2, 2021
@melainalegaspi
Copy link

@jessebrennan spike to come up with estimate and a more concrete reproduction.

@melainalegaspi
Copy link

@hannes-ucsc: "@jessebrennan to provide reproduction for both prod and dev."

@jessebrennan
Copy link
Contributor

After studying and familiarizing myself with the code, I think the issue title is accurate. The motivation for why we don't include entities from stitched subgraphs is explained here

I'm not sure what the potential draw-backs are, but an obvious solution to me would be to allow traversal of stitched subgraphs when compiling sample entities.

@jessebrennan
Copy link
Contributor

@hannes-ucsc: "Add screenshots to prove there are samples for these projects when BAMs are not filtered."

@jessebrennan
Copy link
Contributor

After talking with @hannes-ucsc, we want to try allowing sample entities from stitched subgraphs:

Index: src/azul/plugins/metadata/hca/transform.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/metadata/hca/transform.py b/src/azul/plugins/metadata/hca/transform.py
--- a/src/azul/plugins/metadata/hca/transform.py	(revision 57732bbf42a9c678b0fac430453949e5a2ca1a31)
+++ b/src/azul/plugins/metadata/hca/transform.py	(date 1623181565508)
@@ -1333,7 +1333,7 @@
         project = self._get_project(self.api_bundle)
         samples: MutableMapping[str, Sample] = dict()
         for file in api.not_stitched(self.api_bundle.files.values()):
-            self._find_ancestor_samples(file, samples, stitched=False)
+            self._find_ancestor_samples(file, samples, stitched=True)
         for sample in samples.values():
             assert not sample.is_stitched, sample
             visitor = TransformerVisitor()

but first want to test that this indexes properly with the projects that caused #2868.

To do this, we'll add one of these projects to dev HumanCellAtlas/dcp2#17 (comment).

@hannes-ucsc
Copy link
Member

More like

Index: src/azul/plugins/metadata/hca/transform.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/metadata/hca/transform.py b/src/azul/plugins/metadata/hca/transform.py
--- a/src/azul/plugins/metadata/hca/transform.py	(revision 3968bc5c6cd29142dbaa62a0f5d23fc90222d559)
+++ b/src/azul/plugins/metadata/hca/transform.py	(date 1629753272419)
@@ -443,9 +443,7 @@
 
     def _find_ancestor_samples(self,
                                entity: api.LinkedEntity,
-                               samples: MutableMapping[str, Sample],
-                               *,
-                               stitched=True,
+                               samples: MutableMapping[str, Sample]
                                ):
         """
         Populate the `samples` argument with the sample ancestors of the given
@@ -461,11 +459,10 @@
         :
         """
         if isinstance(entity, sample_types):
-            if stitched or not entity.is_stitched:
-                samples[str(entity.document_id)] = entity
+            samples[str(entity.document_id)] = entity
         else:
             for parent in entity.parents.values():
-                self._find_ancestor_samples(parent, samples, stitched=stitched)
+                self._find_ancestor_samples(parent, samples)
 
     def _visit_file(self, file):
         visitor = TransformerVisitor()
@@ -1378,7 +1375,7 @@
         project = self._get_project(self.api_bundle)
         samples: MutableMapping[str, Sample] = dict()
         for file in api.not_stitched(self.api_bundle.files.values()):
-            self._find_ancestor_samples(file, samples, stitched=False)
+            self._find_ancestor_samples(file, samples)
         for sample in samples.values():
             assert not sample.is_stitched, sample
             visitor = TransformerVisitor()

Just like for #3281, it is possible that removing the condition reintroduces hitting size and time limits of the sort that caused #2868 and we'll need to test that against prod before we promote this change to prod. Assignee to come up with strategy on how to do that and describe the strategy here. I'd love to test on dev but we can't currently do that because dev is stale and doesn't have the analysis bundles that would cause those types of issues.

@hannes-ucsc
Copy link
Member

Consider solving this and #3281 in the same PR.

@nadove-ucsc
Copy link
Contributor

Assignee to come up with strategy on how to do that and describe the strategy here.

  1. Can all the subgraphs that timed out in Timeout/OOM for some subgraphs in snapshot hca_prod_20201120_dcp2___20210302 #2868 (they're all in the the snapshot currently indexed in the DCP8 catalog on prod). This will require operator permissions. Conveniently, I am currently the operator.
  2. Write a simple Repository Plugin subclass that hard-codes those bundles into its list_bundles and fetch_bundle methods, thus allowing them to be indexed without authorizing the service account to access any prod resources.
  3. Index those bundles into a new catalog on my personal deployment using the hard-coded plugin.

@nadove-ucsc
Copy link
Contributor

Alternatively: temporarily edit the TDR plugin to use my personal credentials instead of the service account credentials, and then just index the bundles in my personal without writing a new hard-coding plugin. Note that I had to implement this hack anyway in order to can the bundles.

@nadove-ucsc
Copy link
Contributor

Much less convoluted solution wold just be index the canned bundles during the integration test. However we need to make sure the ElasticSearch instance sizes matches prod.

@hannes-ucsc
Copy link
Member

Sounds good. #3339 will increase the sandbox ES instance size to the same as prod.

@hannes-ucsc
Copy link
Member

For demo, attempt to reproduce.

@theathorn theathorn added the demo [process] To be demonstrated at the end of the sprint label Dec 2, 2021
@melainalegaspi melainalegaspi added the demoed [process] Successfully demonstrated to team label Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug [type] A defect preventing use of the system as specified code [subject] Production code demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team orange [process] Done by the Azul team spike:3 [process] Spike estimate of three points
Projects
None yet
Development

No branches or pull requests

6 participants