Runprov extractor #433

jsheunis · 2024-03-06T11:16:17Z

Closes #432

This ports functionality from datalad-metalad's extractors: core and runprov

The former was added previously to the abcdj branch and is now cherry-picked into this branch.

The latter is newly added as a script catalog_runprov that runs a slightly refactored version of the 'runprov' extractor in datalad-metalad. Additionally, it translates the output of that code into a metadata record that is compliant witht datalad-catalog's dataset schema, such that the script's output can be directly 'catalog-added' as an entry to an existing catalog.

The main reason for porting this functionality here is to have self-contained scripts inside the package that makes dependence on metalad unnecessary.

This ports most functionality from datalad_metalad.extractors.core into a script that also then adds translation into the catalog schema. The script receives a path to a datalad dataset as parameter and outputs a metadata record that can immediatelt be added toa catalog. In this way, the dependence on metalad is removed and the explicit Translator functionality of the catalog (which also depends on jq bindings) does not have to be used. The reason for doing this is to have a self-contained script that could in future just be ripped and replaced with whatever new functionality supercedes this.

this commit adds a script that runs a slightly refactored version of the 'runprov' extractor in datalad-metalad. Additionally, it translates the output of that code into a metadata record that is compliant witht datalad-catalog's dataset schema, such that the script's output can be directly 'catalog-added' as an entry to an existing catalog. The main reason for porting this functionality is to have a self-contained script inside the package that makes dependence on metalad unnecessary.

netlify · 2024-03-06T11:16:32Z

✅ Deploy Preview for datalad-catalog canceled.

Name	Link
🔨 Latest commit	`608bc40`
🔍 Latest deploy log	https://app.netlify.com/sites/datalad-catalog/deploys/664fa49c235d9b0008913099

jsheunis · 2024-03-06T12:04:47Z

The following script:

takes paths to a dataset and to a catalog as arguments
extracts core metadata as well as runprov metadata from the dataset
translates these records to catalog-ready records
adds the records to the catalog

from argparse import ArgumentParser
import json
from pathlib import Path

from datalad_catalog.extractors import (
    catalog_core,
    catalog_runprov,
)
from datalad_catalog.constraints import EnsureWebCatalog
from datalad_next.constraints.dataset import EnsureDataset



def get_metadata_records(dataset):
    """"""
    # first get core dataset-level metadata
    core_record = catalog_core.get_catalog_metadata(dataset)
    # then get runprov dataset-level metadata
    runprov_record = catalog_runprov.get_catalog_metadata(
        source_dataset=dataset,
        process_type='dataset')
    # return both
    return core_record, runprov_record


def add_to_catalog(records, catalog):
    from datalad.api import  (
        catalog_add,
        catalog_set,
    )
    # Add metadata to the catalog
    for r in records:
        catalog_add(
            catalog=catalog,
            metadata=json.dumps(r),
        )    


if __name__ == "__main__":

    parser = ArgumentParser()
    parser.add_argument(
        "dataset_path", type=str, help="Path to the datalad dataset",
    )
    parser.add_argument(
        "catalog_path", type=str, help="Path to the catalog",
    )
    args = parser.parse_args()
    # Ensure is a dataset
    ds = EnsureDataset(
        installed=True, purpose="extract metadata", require_id=True
    )(args.dataset_path).ds
    # Ensure is a catalog
    catalog = EnsureWebCatalog()(args.catalog_path)
    core_record, runprov_record = get_metadata_records(ds)
    
    print(json.dumps(core_record))
    print("\n")
    print(json.dumps(runprov_record))

    # Add metadata to catalog
    add_to_catalog([core_record, runprov_record], catalog)

the script shows how core and runprov metadata can be extracted from a datalad dataset, translated into the catalog schema, and added to an existing catalog

jsheunis added 3 commits March 5, 2024 15:18

refactor core metadata extractor, abstract out some functionality

82e1e0b

linting fixes

fb5690c

jsheunis mentioned this pull request Apr 10, 2024

Apply provenance-related features from upstream sfb1451/metadata-catalog#94

Closed

jsheunis added 2 commits May 23, 2024 22:13

add a helper script for core and runprov metadata

8bd5dfa

the script shows how core and runprov metadata can be extracted from a datalad dataset, translated into the catalog schema, and added to an existing catalog

linting improvements

608bc40

jsheunis merged commit 68b7665 into main May 23, 2024
9 of 13 checks passed

jsheunis mentioned this pull request May 23, 2024

Add ability to ingest, generate and render runprov metadata #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runprov extractor #433

Runprov extractor #433

jsheunis commented Mar 6, 2024

netlify bot commented Mar 6, 2024 •

edited

Loading

jsheunis commented Mar 6, 2024

Runprov extractor #433

Runprov extractor #433

Conversation

jsheunis commented Mar 6, 2024

netlify bot commented Mar 6, 2024 • edited Loading

✅ Deploy Preview for datalad-catalog canceled.

jsheunis commented Mar 6, 2024

netlify bot commented Mar 6, 2024 •

edited

Loading