-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(elasticsearch): elasticsearch implementation #26
Conversation
drop-in replace redis with ES
from mds import logger | ||
|
||
|
||
agg_mds_index = "commons-index" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we parameterize this to facilitate any future blue-green deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep it static for now and parameterize when we see a demand/need
elastic_search_client = None | ||
|
||
|
||
async def init(hostname: str = "0.0.0.0", port: int = 9200): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the Kubernetes deployment descriptor's ReadinessProbe be updated to make sure this host:port are responsive before standing up a new pod and adding it to the load balancer rotation?
https://github.com/uc-cdis/cloud-automation/blob/master/kube/services/metadata/metadata-deploy.yaml#L60
Or maybe this check should be included in the logic that is executed as part of the _status
endpoint (which is currently instrumented by the k8s readinessProbe).
@router.get("/_status")
async def get_status():
now = await db.scalar("SELECT now()")
return dict(status="OK", timestamp=now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'll figure out the best way to incorporate that.
# Flatten out this structure | ||
doc = doc[key]["gen3_discovery"] | ||
|
||
normalize_string_or_object(doc, "__manifest") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to set a configurable list of, say, FIELDS_TO_NORMALIZE=["__manifest",...]
?
That might save some time later, since the discovery metadata schema has gone through a few changes already and seems prone to change
Edit: sorry for the late comment, I thought I left these a couple of days ago, but turns out I left it as a pending review. These are more intended as guiding questions anyhow, no need to let these get in the way of progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. The way I have this now is a bit messy, especially with the TODO. I'd meant to follow up on this and forgot. Ideally I'd like to clean up the source data so that the normalization aren't necessary at all, but I agreee that at least setting a variable for the fields to be normalized is meaningful to start.
exit(1) | ||
|
||
await datastore.init(hostname, port) | ||
await datastore.drop_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly unfamiliar with ElasticSearch, so this may be impossible/impractical, but is there a way to do this update more atomically, rather than delete + insert?
If this populate
is only run once/day, then maybe it's not worth it. But if it runs often enough and there is enough metadata, it seems like there could be meaningful downtime between the delete and the end of the update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great points. I punted on that rather than doing the work to diff/incrementally update records. At this point dropping/re-populating is definitely easier, but I'll spend some time to look at an incremental diff. One way or another I'm sure we're going to need that in time and better to at least get some insight on that before we desperately need it.
dc1f17d
to
1bf9d80
Compare
Per core-product discussions I am merging this functionality in now so that it can be utilized for the HEAL MVP |
Jira Ticket: PXP-8190
Sibling PR to uc-cdis/cloud-automation#1638
New Features
Dependency updates