Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOAA harvest duplicates in catalog #2744

Closed
1 of 2 tasks
chris-macdermaid opened this issue Feb 4, 2021 · 14 comments
Closed
1 of 2 tasks

NOAA harvest duplicates in catalog #2744

chris-macdermaid opened this issue Feb 4, 2021 · 14 comments
Assignees
Labels
bug Software defect or bug harvest-duplicates Issues related to Duplicated Datasets support Issues from agency requests or affecting users

Comments

@chris-macdermaid
Copy link
Contributor

chris-macdermaid commented Feb 4, 2021

A number of NOAA harvests have duplicates. The number is much less than there was in catalog-classic. This is a follow on to
Investigate NOAA Harvests Duplicates.

catalog-next duplicates:
IOOS WAF 7
NGDC MGG Geology 1
NGDC MGG Sonar Water Column 3

How to reproduce

  1. Description of steps to reproduce the issue.

Review the harvest job report for duplicates for these WAFs

Expected behavior

If the records are duplicates this is the right behavior. If not, then determine why this is happening.

Sketch

  • Investigate why duplicates are occurring on catalog-next If a NOAA issue, stop and close
  • Identify path to removing/fixing long term (offending code, manual intervention, etc). New issues created to resolve.
@chris-macdermaid chris-macdermaid added the bug Software defect or bug label Feb 4, 2021
@pjsharpe07 pjsharpe07 changed the title NOAA harvest dublicates in ckan-next NOAA harvest duplicates in ckan-next Feb 4, 2021
@jbrown-xentity
Copy link
Contributor

Heard from Geoplatform. They've recorded a total of 25787 duplicates, 25709 are geospatial.
You can confirm by hitting this API to get the duplicates of guid (be warned, it's a lot).
According to Geoplatform and visual scanning, much of the duplicates are from NOAA (which are WAF harvest sources).

We should re-examine this issue, either for removing duplicates via our de-duper script or fixing issue at the source.

@jbrown-xentity jbrown-xentity changed the title NOAA harvest duplicates in ckan-next NOAA harvest duplicates in catalog Oct 7, 2021
@mogul mogul added the support Issues from agency requests or affecting users label Oct 7, 2021
@jbrown-xentity
Copy link
Contributor

For example, you can see noaa-tree-5046 has 5 datasets via this API call. The metadata_modified date shows the first dataset was loaded 8/6/2021, and then every week (8/13, 8/20, 8/27, 9/3) after until 9/3. Looks suspicious for sure, seems directly related to the harvest process. All is harvested from the same source: ee428166-33c7-4eef-aee8-66156e0e9e08, NGDC Paleo. All have different harvest_object_id's, except 1 doesn't have any harvest info.

@jbrown-xentity
Copy link
Contributor

We will need to modify our de-dupe scripts to handle guid. Currently the de-dupe system is setup to handle DCAT-US files, which use identifier as the unique key per dataset. ISO metadata uses guid.
We will need to parameterize this usage of identifier to allow guid search as well. We could do a search on the organization to find out what kind of harvest source is being used, just hit both https://catalog.data.gov/api/action/package_search?q=identifier:*%20AND%20organization:noaa-gov and https://catalog.data.gov/api/action/package_search?q=guid:*%20AND%20organization:noaa-gov and see which call gives > 0 results.
Use the default functionality to get an output of what cleanup would occur. Run pipenv run python duplicates-identifier-api.py noaa-gov

There may be other breaking changes related to identifier that will need to be handled...

@jbrown-xentity jbrown-xentity self-assigned this Oct 29, 2021
@jbrown-xentity
Copy link
Contributor

We are getting the de-dupe code setup locally, and then perform the following steps:

  • Run through usage steps
  • Test running for NOAA using organization name noaa-gov
  • Then change identifier noted above to guid, and iterate

@FuhuXia
Copy link
Member

FuhuXia commented Nov 3, 2021

So this will handle duplicated GUID within an org, not duplicated GUID across multiple orgs.

...
We could do a search on the organization to find out what kind of harvest source is being used, just hit both https://catalog.data.gov/api/action/package_search?q=identifier:*%20AND%20organization:noaa-gov and https://catalog.data.gov/api/action/package_search?q=guid:*%20AND%20organization:noaa-gov and see which call gives > 0 results
...

@jbrown-xentity
Copy link
Contributor

Yes, this shortcut would handle use case of organization only. Otherwise we could fully parameterize it and make it an optional parameter...

@jbrown-xentity
Copy link
Contributor

So by running this we immediately run into this bug: #2413. Essentially NOAA datasets have weird/bad tags (eg university of north carolina coastal studies institute (unc-csi), a bunch of spaces). The spatial harvester overrides TEMPORARILY the tag validation of normal CKAN, so these datasets can be inserted/edited via harvesting. However these datasets can't be edited through normal CKAN processes without "fixing" the tag to be CKAN compliant (tag can only contain alpha-numeric characters and -_.). We fixed this in our branch of ckanext-spatial (https://github.com/GSA/ckanext-spatial/pull/12/files), but we planned on only implementing on CKAN2.9 on py3, where it's much easier to write integration tests (see https://github.com/GSA/catalog.data.gov/pull/356/files). Not sure how to tackle this from here, options would be

  1. Wait until on CKAN2.9
  2. Implement patch on ckanext-spatial for FCS environment
  3. Remove update_package code in deduper (is not critical to the process, but makes it much safer).

@jbrown-xentity
Copy link
Contributor

For future: options to run this are
while ! pipenv run python duplicates-identifier-api.py noaa-gov --newest --geospatial --api-read-url https://catalog.data.gov --commit; do sleep 60; done

@FuhuXia
Copy link
Member

FuhuXia commented Nov 15, 2021

I don't see --api-read-url flag in the usage readme. It was created in an attempt to speed up execution? Now since the speed issue is resolved by db index, does this flag still have a reason to exist?

@jbrown-xentity
Copy link
Contributor

I don't see --api-read-url flag in the usage readme. It was created in an attempt to speed up execution? Now since the speed issue is resolved by db index, does this flag still have a reason to exist?

It's in the most recent PR: GSA/datagov-dedupe#23. It's definitely not a bad idea to use the read-only fast version of the site to do read-only api calls. I don't think it speeds it up by a huge amount like db indexes did, but if it puts less load on the admin server I think that's a good thing.

@FuhuXia
Copy link
Member

FuhuXia commented Nov 15, 2021

OK. But be aware that read-only api is talking to a SOLR replica. The SOLR replication delay might cause some issues if we are reading a dataset right after updating it.

@jbrown-xentity
Copy link
Contributor

We will go with option 1 for now.
In order to get off the spatial fork for GSA, we should consider implementing the iDatasetForm on https://github.com/GSA/ckanext-datagovcatalog instead.

@mogul
Copy link
Contributor

mogul commented Dec 2, 2021

We'll be on CKAN 2.9 soon. Once that happens, we can just run the de-dupe script.

@mogul mogul moved this to Product Backlog in data.gov team board Mar 23, 2022
@hkdctol
Copy link
Contributor

hkdctol commented Sep 29, 2022

Closing this as we're tracking this in other issues.

@hkdctol hkdctol closed this as completed Sep 29, 2022
Repository owner moved this from Product Backlog to Done in data.gov team board Sep 29, 2022
@btylerburton btylerburton added the harvest-duplicates Issues related to Duplicated Datasets label Dec 21, 2023
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug harvest-duplicates Issues related to Duplicated Datasets support Issues from agency requests or affecting users
Projects
Archived in project
Development

No branches or pull requests

7 participants