-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NOAA harvest duplicates in catalog #2744
Comments
Heard from Geoplatform. They've recorded a total of 25787 duplicates, 25709 are geospatial. We should re-examine this issue, either for removing duplicates via our de-duper script or fixing issue at the source. |
For example, you can see |
We will need to modify our de-dupe scripts to handle guid. Currently the de-dupe system is setup to handle DCAT-US files, which use There may be other breaking changes related to |
We are getting the de-dupe code setup locally, and then perform the following steps:
|
So this will handle duplicated GUID within an org, not duplicated GUID across multiple orgs.
|
Yes, this shortcut would handle use case of organization only. Otherwise we could fully parameterize it and make it an optional parameter... |
So by running this we immediately run into this bug: #2413. Essentially NOAA datasets have weird/bad tags (eg
|
For future: options to run this are |
I don't see |
It's in the most recent PR: GSA/datagov-dedupe#23. It's definitely not a bad idea to use the read-only fast version of the site to do read-only api calls. I don't think it speeds it up by a huge amount like db indexes did, but if it puts less load on the admin server I think that's a good thing. |
OK. But be aware that read-only api is talking to a SOLR replica. The SOLR replication delay might cause some issues if we are reading a dataset right after updating it. |
We will go with option 1 for now. |
We'll be on CKAN 2.9 soon. Once that happens, we can just run the de-dupe script. |
Closing this as we're tracking this in other issues. |
A number of NOAA harvests have duplicates. The number is much less than there was in catalog-classic. This is a follow on to
Investigate NOAA Harvests Duplicates.
catalog-next duplicates:
IOOS WAF 7
NGDC MGG Geology 1
NGDC MGG Sonar Water Column 3
How to reproduce
Review the harvest job report for duplicates for these WAFs
Expected behavior
If the records are duplicates this is the right behavior. If not, then determine why this is happening.
Sketch
The text was updated successfully, but these errors were encountered: