-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Samples don't reflect downstream entities from stitched subgraphs #3095
Comments
[edit: not a dupe of that (hannes)] |
@jessebrennan spike to come up with estimate and a more concrete reproduction. |
@hannes-ucsc: "@jessebrennan to provide reproduction for both prod and dev." |
After studying and familiarizing myself with the code, I think the issue title is accurate. The motivation for why we don't include entities from stitched subgraphs is explained here I'm not sure what the potential draw-backs are, but an obvious solution to me would be to allow traversal of stitched subgraphs when compiling sample entities. |
@hannes-ucsc: "Add screenshots to prove there are samples for these projects when BAMs are not filtered." |
After talking with @hannes-ucsc, we want to try allowing sample entities from stitched subgraphs:
but first want to test that this indexes properly with the projects that caused #2868. To do this, we'll add one of these projects to |
More like Index: src/azul/plugins/metadata/hca/transform.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/metadata/hca/transform.py b/src/azul/plugins/metadata/hca/transform.py
--- a/src/azul/plugins/metadata/hca/transform.py (revision 3968bc5c6cd29142dbaa62a0f5d23fc90222d559)
+++ b/src/azul/plugins/metadata/hca/transform.py (date 1629753272419)
@@ -443,9 +443,7 @@
def _find_ancestor_samples(self,
entity: api.LinkedEntity,
- samples: MutableMapping[str, Sample],
- *,
- stitched=True,
+ samples: MutableMapping[str, Sample]
):
"""
Populate the `samples` argument with the sample ancestors of the given
@@ -461,11 +459,10 @@
:
"""
if isinstance(entity, sample_types):
- if stitched or not entity.is_stitched:
- samples[str(entity.document_id)] = entity
+ samples[str(entity.document_id)] = entity
else:
for parent in entity.parents.values():
- self._find_ancestor_samples(parent, samples, stitched=stitched)
+ self._find_ancestor_samples(parent, samples)
def _visit_file(self, file):
visitor = TransformerVisitor()
@@ -1378,7 +1375,7 @@
project = self._get_project(self.api_bundle)
samples: MutableMapping[str, Sample] = dict()
for file in api.not_stitched(self.api_bundle.files.values()):
- self._find_ancestor_samples(file, samples, stitched=False)
+ self._find_ancestor_samples(file, samples)
for sample in samples.values():
assert not sample.is_stitched, sample
visitor = TransformerVisitor() Just like for #3281, it is possible that removing the condition reintroduces hitting size and time limits of the sort that caused #2868 and we'll need to test that against prod before we promote this change to prod. Assignee to come up with strategy on how to do that and describe the strategy here. I'd love to test on dev but we can't currently do that because dev is stale and doesn't have the analysis bundles that would cause those types of issues. |
Consider solving this and #3281 in the same PR. |
|
Alternatively: temporarily edit the TDR plugin to use my personal credentials instead of the service account credentials, and then just index the bundles in my personal without writing a new hard-coding plugin. Note that I had to implement this hack anyway in order to can the bundles. |
Much less convoluted solution wold just be index the canned bundles during the integration test. However we need to make sure the ElasticSearch instance sizes matches prod. |
Sounds good. #3339 will increase the sandbox ES instance size to the same as |
For demo, attempt to reproduce. |
For example there is no
bam
termFacet entry inSamples
, because thebam
files are part of the stitched subgraph while the sample is part of the primary subgraph.[Edit by Jesse to show reproductions and both prod and dev]
dev
Filtering by file format shows projects with BAMs
https://dev.singlecell.gi.ucsc.edu/explore/projects?filter=%5B%7B%22facetName%22:%22fileFormat%22,%22terms%22:%5B%22bam%22%5D%7D%5D
But selecting the samples tab shows no samples for this project and BAMs
https://dev.singlecell.gi.ucsc.edu/explore/samples?filter=%5B%7B%22facetName%22:%22fileFormat%22,%22terms%22:%5B%22bam%22%5D%7D,%7B%22facetName%22:%22projectId%22,%22terms%22:%5B%224a95101c-9ffc-4f30-a809-f04518a23803%22%5D%7D%5D
Even though if we select this project with out filtering for BAMs we do have samples
https://dev.singlecell.gi.ucsc.edu/explore/samples?filter=%5B%7B%22facetName%22:%22projectId%22,%22terms%22:%5B%224a95101c-9ffc-4f30-a809-f04518a23803%22%5D%7D%5D
prod
Filtering by file format shows projects with BAMs
https://data.humancellatlas.org/explore/projects?filter=%5B%7B%22facetName%22:%22fileFormat%22,%22terms%22:%5B%22bam%22%5D%7D%5D
But selecting the samples tab shows no samples for this project and BAMs
https://data.humancellatlas.org/explore/samples?filter=%5B%7B%22facetName%22:%22fileFormat%22,%22terms%22:%5B%22bam%22%5D%7D,%7B%22facetName%22:%22projectId%22,%22terms%22:%5B%22996120f9-e84f-409f-a01e-732ab58ca8b9%22%5D%7D%5D
Even though if we select this project with out filtering for BAMs we do have samples
https://data.humancellatlas.org/explore/samples?filter=%5B%7B%22facetName%22:%22projectId%22,%22terms%22:%5B%22996120f9-e84f-409f-a01e-732ab58ca8b9%22%5D%7D%5D
The text was updated successfully, but these errors were encountered: