Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't add invenio-classifier keywords to records #554

Closed
michamos opened this issue Aug 26, 2024 · 4 comments
Closed

don't add invenio-classifier keywords to records #554

michamos opened this issue Aug 26, 2024 · 4 comments

Comments

@michamos
Copy link
Collaborator

michamos commented Aug 26, 2024

Currently, we're adding keywords extracted by invenio-classifier in the article workflows to Literature records. These are only temporary, and are supposed to be later replaced by manually assigned keywords. However, DESY has stopped assigning those keywords manually, so we should stop putting the automated keywords in records too.
This requires removing the prepare_keywords workflow step from the article and core_selection workflows.

Once that's implemented, we'll also need to clean up existing records.

@PascalEgn
Copy link
Collaborator

PascalEgn commented Aug 29, 2024

Instructions to cleanup after inspirehep/inspire-next#4350 in deployed on prod:

Exec into primary postgres node, run psql, change db \c inspirhep, change table \d records_metadata and run the sql script:

UPDATE records_metadata
SET json = (
    SELECT jsonb_set(
        json::jsonb,
        '{keywords}',
        (
            SELECT jsonb_agg(elem)
            FROM jsonb_array_elements(json::jsonb->'keywords') AS elem
            WHERE elem->>'source' != 'classifier'
        )
    )
)
WHERE json::jsonb->'keywords' IS NOT NULL;

@michamos
Copy link
Collaborator Author

@PascalEgn it's a very bad idea to modify records in the DB directly, as many things are done at the application level during the update (reindexing this record and dependent records if needed, storing a new version in the records_metadata_versions table, triggering updates at ORCID and HAL if needed, etc.). This kind of large-scale metadata updates should be done using https://github.com/inspirehep/curation-scripts if possible or with custom code. In any case, it's not clear exactly how to clean this up in existing records due to merges where classifier metadata overwrote human-generated keywords. I'll update this once I have a clearer picture.

@drjova drjova assigned michamos and unassigned PascalEgn Sep 10, 2024
@michamos
Copy link
Collaborator Author

michamos commented Sep 10, 2024

For the cleanup script, the following simple search can be used in the end: keywords.source:classifier, and all keywords with source=classifier should be removed. This will remove a bit too much due to the merger problem but DESY still have the data and will reupload the correct keywords.

@PascalEgn
Copy link
Collaborator

I opened a PR on the curation script repo, you can check if its fine @michamos :)

@drjova drjova closed this as completed Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants