feat(elasticsearch): elasticsearch implementation #26

williamhaley · 2021-06-22T19:54:41Z

Jira Ticket: PXP-8190

Sibling PR to uc-cdis/cloud-automation#1638

New Features

Use Elasticsearch to power aggregate metadata service APIs

Dependency updates

Remove dependencies for Redis and add dependencies for Elasticsearch

drop-in replace redis with ES

themarcelor · 2021-06-28T21:56:55Z

src/mds/agg_mds/datastore/elasticsearch_dao.py

+from mds import logger
+
+
+agg_mds_index = "commons-index"


Should we parameterize this to facilitate any future blue-green deployment?

I think we can keep it static for now and parameterize when we see a demand/need

themarcelor · 2021-06-28T22:01:31Z

src/mds/agg_mds/datastore/elasticsearch_dao.py

+elastic_search_client = None
+
+
+async def init(hostname: str = "0.0.0.0", port: int = 9200):


Should the Kubernetes deployment descriptor's ReadinessProbe be updated to make sure this host:port are responsive before standing up a new pod and adding it to the load balancer rotation?
https://github.com/uc-cdis/cloud-automation/blob/master/kube/services/metadata/metadata-deploy.yaml#L60

Or maybe this check should be included in the logic that is executed as part of the _status endpoint (which is currently instrumented by the k8s readinessProbe).

@router.get("/_status") async def get_status(): now = await db.scalar("SELECT now()") return dict(status="OK", timestamp=now)

You're right. I'll figure out the best way to incorporate that.

mcannalte · 2021-06-29T02:08:52Z

src/mds/agg_mds/datastore/elasticsearch_dao.py

+        # Flatten out this structure
+        doc = doc[key]["gen3_discovery"]
+
+        normalize_string_or_object(doc, "__manifest")


Would it be possible to set a configurable list of, say, FIELDS_TO_NORMALIZE=["__manifest",...] ?
That might save some time later, since the discovery metadata schema has gone through a few changes already and seems prone to change

Edit: sorry for the late comment, I thought I left these a couple of days ago, but turns out I left it as a pending review. These are more intended as guiding questions anyhow, no need to let these get in the way of progress

That's a good point. The way I have this now is a bit messy, especially with the TODO. I'd meant to follow up on this and forgot. Ideally I'd like to clean up the source data so that the normalization aren't necessary at all, but I agreee that at least setting a variable for the fields to be normalized is meaningful to start.

mcannalte · 2021-06-29T02:35:58Z

src/mds/populate.py

+        exit(1)
+
+    await datastore.init(hostname, port)
+    await datastore.drop_all()


I'm mostly unfamiliar with ElasticSearch, so this may be impossible/impractical, but is there a way to do this update more atomically, rather than delete + insert?
If this populate is only run once/day, then maybe it's not worth it. But if it runs often enough and there is enough metadata, it seems like there could be meaningful downtime between the delete and the end of the update.

Great points. I punted on that rather than doing the work to diff/incrementally update records. At this point dropping/re-populating is definitely easier, but I'll spend some time to look at an incremental diff. One way or another I'm sure we're going to need that in time and better to at least get some insight on that before we desperately need it.

williamhaley · 2021-06-30T21:01:20Z

Per core-product discussions I am merging this functionality in now so that it can be utilized for the HEAL MVP

feat(elasticsearch): elasticsearch implementation

a67df2a

drop-in replace redis with ES

github-actions bot added the test-apis-metadataIngestionTest label Jun 22, 2021

themarcelor reviewed Jun 28, 2021

View reviewed changes

mcannalte reviewed Jun 30, 2021

View reviewed changes

williamhaley force-pushed the feat/elasticsearch branch 2 times, most recently from dc1f17d to 1bf9d80 Compare June 30, 2021 15:24

fix(cleanup): pep8 standards, better /_status

1bf9d80

williamhaley merged commit aacfa0c into feat/aggregation Jun 30, 2021

williamhaley deleted the feat/elasticsearch branch June 30, 2021 21:01

williamhaley mentioned this pull request Jun 30, 2021

Feat/aggregation metadata-service aggregate APIs #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(elasticsearch): elasticsearch implementation #26

feat(elasticsearch): elasticsearch implementation #26

williamhaley commented Jun 22, 2021

themarcelor Jun 28, 2021

williamhaley Jun 30, 2021

themarcelor Jun 28, 2021

williamhaley Jun 30, 2021

mcannalte Jun 29, 2021 •

edited

Loading

williamhaley Jun 30, 2021

mcannalte Jun 29, 2021

williamhaley Jun 30, 2021

williamhaley commented Jun 30, 2021

		elastic_search_client = None


		async def init(hostname: str = "0.0.0.0", port: int = 9200):

feat(elasticsearch): elasticsearch implementation #26

feat(elasticsearch): elasticsearch implementation #26

Conversation

williamhaley commented Jun 22, 2021

New Features

Dependency updates

themarcelor Jun 28, 2021

Choose a reason for hiding this comment

williamhaley Jun 30, 2021

Choose a reason for hiding this comment

themarcelor Jun 28, 2021

Choose a reason for hiding this comment

williamhaley Jun 30, 2021

Choose a reason for hiding this comment

mcannalte Jun 29, 2021 • edited Loading

Choose a reason for hiding this comment

williamhaley Jun 30, 2021

Choose a reason for hiding this comment

mcannalte Jun 29, 2021

Choose a reason for hiding this comment

williamhaley Jun 30, 2021

Choose a reason for hiding this comment

williamhaley commented Jun 30, 2021

mcannalte Jun 29, 2021 •

edited

Loading