Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(elasticsearch): elasticsearch implementation #26

Merged
merged 2 commits into from
Jun 30, 2021

Conversation

williamhaley
Copy link
Contributor

Jira Ticket: PXP-8190

Sibling PR to uc-cdis/cloud-automation#1638

New Features

  • Use Elasticsearch to power aggregate metadata service APIs

Dependency updates

  • Remove dependencies for Redis and add dependencies for Elasticsearch

from mds import logger


agg_mds_index = "commons-index"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we parameterize this to facilitate any future blue-green deployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep it static for now and parameterize when we see a demand/need

elastic_search_client = None


async def init(hostname: str = "0.0.0.0", port: int = 9200):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Kubernetes deployment descriptor's ReadinessProbe be updated to make sure this host:port are responsive before standing up a new pod and adding it to the load balancer rotation?
https://github.com/uc-cdis/cloud-automation/blob/master/kube/services/metadata/metadata-deploy.yaml#L60

Or maybe this check should be included in the logic that is executed as part of the _status endpoint (which is currently instrumented by the k8s readinessProbe).

@router.get("/_status")
async def get_status():
    now = await db.scalar("SELECT now()")
    return dict(status="OK", timestamp=now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I'll figure out the best way to incorporate that.

# Flatten out this structure
doc = doc[key]["gen3_discovery"]

normalize_string_or_object(doc, "__manifest")
Copy link
Contributor

@mcannalte mcannalte Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to set a configurable list of, say, FIELDS_TO_NORMALIZE=["__manifest",...] ?
That might save some time later, since the discovery metadata schema has gone through a few changes already and seems prone to change

Edit: sorry for the late comment, I thought I left these a couple of days ago, but turns out I left it as a pending review. These are more intended as guiding questions anyhow, no need to let these get in the way of progress

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. The way I have this now is a bit messy, especially with the TODO. I'd meant to follow up on this and forgot. Ideally I'd like to clean up the source data so that the normalization aren't necessary at all, but I agreee that at least setting a variable for the fields to be normalized is meaningful to start.

exit(1)

await datastore.init(hostname, port)
await datastore.drop_all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly unfamiliar with ElasticSearch, so this may be impossible/impractical, but is there a way to do this update more atomically, rather than delete + insert?
If this populate is only run once/day, then maybe it's not worth it. But if it runs often enough and there is enough metadata, it seems like there could be meaningful downtime between the delete and the end of the update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points. I punted on that rather than doing the work to diff/incrementally update records. At this point dropping/re-populating is definitely easier, but I'll spend some time to look at an incremental diff. One way or another I'm sure we're going to need that in time and better to at least get some insight on that before we desperately need it.

@williamhaley williamhaley force-pushed the feat/elasticsearch branch 2 times, most recently from dc1f17d to 1bf9d80 Compare June 30, 2021 15:24
@williamhaley
Copy link
Contributor Author

Per core-product discussions I am merging this functionality in now so that it can be utilized for the HEAL MVP

@williamhaley williamhaley merged commit aacfa0c into feat/aggregation Jun 30, 2021
@williamhaley williamhaley deleted the feat/elasticsearch branch June 30, 2021 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants