This repository hosts our Data Dictionary (DD) and accompanying utilities. These utilities include tools for converting the dictionary between XLSX, the definitive YAML spec, and the JSON representation that is to be referenced via Helm values.
- Update Dictionary:
- Our dictionary maintainer is expected to update the contents of the XLSX file at path
gdcdictionary/xlsx/nodes_schema_impowr.xlsx
- Our dictionary maintainer is expected to update the contents of the XLSX file at path
- Propose Changes:
- Once the updates are performed on the file, the dictionary maintainer is expected to open a pull request in Github containing the new file contents
- Generate Specification:
-
After the pull request has been merged, a developer or automation will run the following scripts:
- Convert the XLSX file to a rendered set of YAML files
python3 dictionaryutils/setup.py install --force python3 dictionaryutils/utils/tsv2yaml.py -i gdcdictionary/xlsx/ -o gdcdictionary/schemas/ -e xlsx
- Aggregate the YAML files and dump the specification to a single JSON file
python3 dictionaryutils/utils/dump_schema.py
-
- Deploy Dictionary:
- With the new YAML spec and JSON dump created, a developer or automation will then deploy the dictionary updates to our data commons
- The JSON artifact will be uploaded to a bucket for hosting
- The IMPOWR gen3 Helm chart will be updated to reference the latest JSON dump
- With the new YAML spec and JSON dump created, a developer or automation will then deploy the dictionary updates to our data commons
The IMPOWR data dictionary was initialized via the DCF data dictionary provided by the gen3 team. The DCF data dictionary is a baseline dictionary, it allows users to create their own dictionaries by serving as a start point for new dictionaries. The flexibility of the DCF dictionary makes the process of creating a new dictionary relatively simple. You may read more official documentation on the DCF data dictionary here.
Our dictionary is maintained via the XLSX file format. We selected this format because it is the friendliest for non-developers that are more focused on research and analysis. You will find that there are 3 sheets:
- nodes_impowr
- Nodes in the data model are strongly typed and individually defined for a specific data type
- Nodes may not have hyphens in their name, use underscores instead
- Subsequently, link names should follow the same requirement
- Nodes are grouped up into categories that represent broad roles for the node such as
analysis
orbiospecimen
- Nodes have a series of
systemProperties
; these properties are those that will be automatically filled by the system unless otherwise defined by the user - Nodes may be the parent or child of another node
- These node relationships are expressed as a link which points a child to its parent
link_name
: name of the relationship- no hyphens, only underscores as a delimiter
link_label
: description of the relationshiplink_backref
: child typelink_target
: parent typelink_multiplicity
: type of relationship, e.g.one_to_many
,many_to_one
,many_to_many
, etc.link_required
: whether the relationship is required or not (True
,False
, or empty)
- These node relationships are expressed as a link which points a child to its parent
- variables_impowr
- Each node will have a collection of
properties
. Each row in this sheet is represents a singularproperty
of a specified node - Properties will have a specified
type
- Possible types include
string
,number
,boolean
,integer
,enum
, etc.enum
indicates that the property has a defined set of values that it my take
- Possible types include
- You may also configure a property's
maximum
,minimum
,default
value,format
, etc.
- Each node will have a collection of
- enums_impowr
- Each
enum
property will list its available options in this sheet - In this sheet, a given row will specify the node, property, and value for an
enum
option
- Each
It is likely that our dictionary maintainer will be using Github Desktop to interact with our repository and propose dictionary changes. There will be a general order of operations that is conducive to this process:
-
If this is your first time opening Github Desktop, then you will need to clone (download) this repository to your local machine.
-
Create a new branch for proposing dictionary changes
-
Open the XLSX dictionary file in Excel
-
Perform your dictionary modifications, save the XLSX file, and close Excel
-
Commit your changes and create a pull request
- Install dependencies
python3 dictionaryutils/setup.py install --force
- Convert XLSX to YAML
python3 dictionaryutils/utils/tsv2yaml.py -i gdcdictionary/xlsx/ -o gdcdictionary/schemas/ -e xlsx
- Convert YAML to XLSX
python3 dictionaryutils/utils/yaml2tsv.py -i gdcdictionary/schemas/ -o gdcdictionary/xlsx/ -e xlsx -d impowr
- Dump YAML schema to JSON
python3 -m dictionaryutils.utils.dump_schema