Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review and fix "required" and "default" flags for vocab download #278

Open
mik-ohdsi opened this issue Apr 7, 2022 · 14 comments
Open

Review and fix "required" and "default" flags for vocab download #278

mik-ohdsi opened this issue Apr 7, 2022 · 14 comments
Assignees
Labels
bug Defects

Comments

@mik-ohdsi
Copy link

A recent forum post highlighted the issue that some vocabularies are set to "required" and as such will be always part of a download bundle / cannot be deselected. In particular among these, the vocabularies Korean Revenue Code, OSM and SOPT seemed a little off to be indispensable for an OMOP CDM.

These are the ones currently marked as "OMOP required":

vocabulary_id_v5
CDM
Cohort Type
Concept Class
Condition Status
Condition Type
Cost
Cost Type
Death Type
Device Type
Domain
Drug Type
Episode
Korean Revenue Code
Meas Type
Metadata
None
Note Type
Observation Type
Obs Period Type
OSM
Plan
Plan Stop Reason
Procedure Type
Relationship
SOPT
Sponsor
Type Concept
UB04 Point of Origin
UB04 Pri Typ of Adm
UB04 Pt dis status
UB04 Typ bill
UCUM
US Census
Visit
Visit Type
Vocabulary

I guess we can remove all the ones with a "Type" in their name except for the new Type Concepts as they have replaced them. The respective concepts per vocabulary ID could probably also be retired.

There was also the notion to mark more vocabularies as default that have standard concepts.

Here are the ones with standard concepts or classifications and their respective count together with a proposal how to set the default and required flags:

vocabulary_id description S / C row_count default now default future required now required future
ABMS Provider Specialty (American Board of Medical Specialties) S 85 X X    
AMT Australian Medicines Terminology (NEHTA) S 6839        
APC Ambulatory Payment Classification (CMS) S 715        
ATC WHO Anatomic Therapeutic Chemical Classification C 6509 X X    
BDPM Public Database of Medications (Social-Sante) S 1106        
Cancer Modifier Diagnostic Modifiers of Cancer (OMOP) S 3251        
CDM OMOP Common DataModel S 1045 X   X X
CDT Current Dental Terminology (ADA) S 869        
CMS Place of Service Place of Service Codes for Professional Claims (CMS) S 51 X X    
Cohort Legacy OMOP HOI or DOI cohort C 78        
Condition Status OMOP Condition Status S 22 X   X X
Cost OMOP Cost S 51     X X
CPT4 Current Procedural Terminology version 4 (AMA) C 3492 X X    
CPT4 Current Procedural Terminology version 4 (AMA) S 12922 X X    
Currency International Currency Symbol (ISO 4217) S 180 X X    
CVX CDC Vaccine Administered CVX (NCIRD) S 217        
DA_France Disease Analyzer France (IQVIA) S 6366        
dm+d Dictionary of Medicines and Devices (NHS) S 21071        
DRG Diagnosis-related group (CMS) S 752        
EphMRA ATC Anatomical Classification of Pharmaceutical Products (EphMRA) C 895        
Episode OMOP Episode S 14 X   X X
ETC Enhanced Therapeutic Classification (FDB) C 2755        
Ethnicity OMOP Ethnicity S 2 X X    
Gemscript Gemscript (Resip) S 64761        
Gender OMOP Gender S 2 X     X
GGR Commented Drug Directory (BCFI) S 751        
GRR Global Reference Repository (IQVIA) S 138739        
HCPCS Healthcare Common Procedure Coding System (CMS) S 8427 X X    
HemOnc HemOnc C 367        
HemOnc HemOnc S 2015        
HES Specialty Hospital Episode Statistics Specialty (NHS) S 57        
ICD10PCS ICD-10 Procedure Coding System (CMS) S 194874        
ICD9Proc International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 3 (NCHS) S 2223 X X    
ICDO3 International Classification of Diseases for Oncology, Third Edition (WHO) S 56972        
Indication Indications and Contraindications (FDB) C 4739        
ISBT Information Standard for Blood and Transplant 128 Product (ICCBBA) S 17336        
ISBT Attribute Information Standard for Blood and Transplant 128 Product Attribute (ICCBBA) C 1657        
JMDC Japan Medical Data Center Drug Code (JMDC) S 1313        
KDC Korean Drug Code (HIRA) S 112        
KNHIS Korean National Health Information System S 3        
Korean Revenue Code Korean Revenue Code S 7 X      
LOINC Logical Observation Identifiers Names and Codes (Regenstrief Institute) C 48305 X X    
LOINC Logical Observation Identifiers Names and Codes (Regenstrief Institute) S 110702 X X    
LPD_Australia Longitudinal Patient Data Australia (IQVIA) S 1620        
MDC Major Diagnostic Categories (CMS) S 26        
MedDRA Medical Dictionary for Regulatory Activities (MSSO) C 76939        
Medicare Specialty Medicare provider/supplier specialty codes (CMS) S 112 X X    
Metadata Metadata S 1 X   X X
MMI Modernizing Medicine (MMI) S 4        
NAACCR Data Standards & Data Dictionary Volume II (NAACCR) S 26105        
NCIt NCI Thesaurus (National Cancer Institute) S 1899        
NDC National Drug Code (FDA and manufacturers) S 11219 X X    
Nebraska Lexicon Nebraska Lexicon S 4187        
NFC New Form Code (EphMRA) C 692        
NUCC National Uniform Claim Committee Health Care Provider Taxonomy Code Set (NUCC) S 674 X X    
OMOP Extension OMOP Extension (OHDSI) S 553 X X    
OMOP Genomic OMOP Genomic vocabulary S 79791        
OPCS4 OPCS Classification of Interventions and Procedures version 4 (NHS) S 2373        
OSM OpenStreetMap S 203339     X  
PCORNet National Patient-Centered Clinical Research Network (PCORI) S 2        
Plan Health Plan - contract to administer healthcare transactions by the payer, facilitated by the sponsor S 11 X   X X
Plan Stop Reason Plan Stop Reason - Reason for termination of the Health Plan S 13 X   X X
PPI AllOfUs_PPI (Columbia) S 2120        
Provider OMOP Provider S 6       X
Race Race and Ethnicity Code Set (USBC) S 50 X X    
Relationship OMOP Relationship S 14 X   X X
Revenue Code UB04/CMS1450 Revenue Codes (CMS) S 538 X X    
RxNorm RxNorm (NLM) C 35087 X X    
RxNorm RxNorm (NLM) S 148139 X X    
RxNorm Extension RxNorm Extension (OHDSI) S 1819247 X X    
SMQ Standardised MedDRA Queries (MSSO) C 318        
SNOMED Systematic Nomenclature of Medicine - Clinical Terms (IHTSDO) S 540590 X X    
SNOMED Veterinary SNOMED Veterinary S 31994        
SOPT Source of Payment Typology (PHDSC) S 162 X X X  
SPL Structured Product Labeling (FDA) C 573209 X X    
Sponsor Sponsor - institution or individual financing healthcare transactions S 6 X X X X
Type Concept OMOP Type Concept S 79 X   X X
UB04 Pri Typ of Adm UB04 Claim Inpatient Admission Type Code (CMS) S 6     X X
UB04 Typ bill UB04 Type of Bill - Institutional (USHIK) S 4     X X
UCUM Unified Code for Units of Measure (Regenstrief Institute) S 922 X   X X
UK Biobank UK Biobank C 292        
UK Biobank UK Biobank S 3837        
US Census United States Census Bureau S 13     X X
Visit OMOP Visit S 19 X   X X

Please review @cgreich and @fdefalco !

Thanks - mik

@fdefalco
Copy link

fdefalco commented Apr 7, 2022

I agree with most of the entries in the table. Ones I would question if they should be required in the future:

US Census
UB04 Typ bill
UB04 Pri Typ of Adm
Sponsor

We could also remove the idea of "Required" in the interest of transparency and have a note appear on the page that a vocabulary is "Highly Recommended" when it is what we currently consider "Required" but still afford the user the opportunity to deselect it.

Then we would only have a boolean for "Default" for each vocabulary that can be edited by the user when creating their vocabulary download.

@mik-ohdsi
Copy link
Author

Hi, thanks for the input, @fdefalco
I think we have to keep required for very foundational data that you would need for a CDM to function (CDM, Metadata, a couple others).
The ones you listed we would keep in for ease of use (they are small and in most cases needed), as it would not really make sense to NOT load them.
@cgreich , can you provide more input?
thanks - Mik

@mik-ohdsi mik-ohdsi added the bug Defects label Apr 22, 2022
@fdefalco
Copy link

Is there a timeline for implementation of this particular feature?

@mik-ohdsi
Copy link
Author

I had hoped. @cgreich would give us his final "placet". I would then hand over the above list for processing by the vocab team and it should go to Athena with the next release.

@mik-ohdsi
Copy link
Author

bumping up this issue, @cgreich and @fdefalco
What is the verdict?
I would also add the CVX vocabulary to default. And we have that funny OMOP supplier vocabulary with one non-standard concept in it... Do we need that?

@mik-ohdsi
Copy link
Author

@ssuvorov-fls - could you check, if the above new settings would somewhat break something once they end up in Athena? Can we test run this in any QA instance?

@Alexdavv
Copy link
Member

Alexdavv commented May 13, 2022

Korean Revenue Code, OSM and SOPT seemed a little off to be indispensable for an OMOP CDM

I think, the unspoken convention was to include everything that goes to the Domain missing its respective tables so that you don't miss the concepts for such "service" things as gender_concept_id, unit_concept_id, modifier_concept_id, route_concept_id, etc. Because it's not really obvious what vocabularies to pick if you want to add one more table/domain to your CDM. Region_concept_id somehow didn't materialize into a field but explains why OSM and US Cencus are there.

I guess we can remove all the ones with a "Type" in their name except for the new Type Concepts as they have replaced them

I wouldn't do it because the users that are updating their ETLs from some old vocabulary versions will just lose the concepts that appear it their mappings. I would never do it for the "service" small vocabularies.

Here are the ones with standard concepts or classifications and their respective count together with a proposal how to set the default and required flags

I didn't get the logic behind. How the gender is more important than the race? And why Sponsor is better than a Geography?
We need to come up with the clear rules.

There was also the notion to mark more vocabularies as default that have standard concepts

Don't think it's a great choice before we cleaned up the EAV data. Otherwise, people will start map to UKB, PPI and NAACCR. And it's already the case.

@mik-ohdsi
Copy link
Author

mik-ohdsi commented May 13, 2022

I think, the unspoken convention was to include everything that goes to the Domain missing its respective tables so that you don't miss the concepts for such "service" things as gender_concept_id, unit_concept_id, modifier_concept_id, route_concept_id, etc. Because it's not really obvious what vocabularies to pick if you want to add one more table/domain to your CDM. Region_concept_id somehow didn't materialize into a field but explains why OSM and US Cencus are there.

OSM is however one of the reasons, this whole discussion started... I guess I would still take it out of "required".

I wouldn't do it because the users that are updating their ETLs from some old vocabulary versions will just lose the concepts that appear it their mappings. I would never do it for the "service" small vocabularies.

hmm... have we mapped old type concepts over to the new ones? If so, it would make sense to keep them. but otherwise aren't they simply useless now and all non-standard?

I didn't get the logic behind. How the gender is more important than the race? And why Sponsor is better than a Geography? We need to come up with the clear rules.

Well, this is derived a little from how it was before. Gender is really indispensable, whereas Race & Ethnicity is, as we know, US centric... and they are still marked as default, so most people will keep them in their download. They just have a choice to deselect.

There was also the notion to mark more vocabularies as default that have standard concepts

Don't think it's a great choice before we cleaned up the EAV data. Otherwise, people will start map to UKB, PPI and NAACCR. And it's already the case.

Of course we would not follow that notion blindly and hence the above are not marked as default. But you cannot prevent people from selecting them for download, unless we would make them something like license restricted (only not license but something else).

@fdefalco
Copy link

The original intent of the discussion was to promote transparency and flexibility in vocabulary download. As it stands, vocabularies that are not listed or selected are included in the download, so for transparency, they should be listed and selected by default. For flexibility the user can have the option to unselect vocabularies. I'm not sure what benefit preventing a user from unselecting a vocabulary would provide, if you reject defaults you should be doing so for a well understood reason. Perhaps a warning on the page that says 'Default vocabularies are selected to provide important concepts to most ETL processes, remove them from the selected vocabularies at your own risk.' :)

@mik-ohdsi
Copy link
Author

@cgreich has an even stricter view on this. I think he used the word "dogmatic". Let's hear him out. (Christian, one exception to the rule should be vocabularies that have standard items but are also license restricted such as CDT or ISBT).

@fdefalco
Copy link

I think Patrick echoed my concern on transparency here: https://forums.ohdsi.org/t/osm-vocabulary/16303/11

@cgreich
Copy link
Contributor

cgreich commented May 17, 2022

Are we debating here or there?

@fdefalco
Copy link

We are discussing the changes to be made as part of this issue here, informed by the conversation there. I don't think there is any debate regarding the need for transparency of vocabularies that are included in a download. I imagine the remaining debate is whether or not to provide the user the ability to control whether not 'default' vocabularies are included. My vote is that the user is provide control with a stern warning about why defaults should be left as is.

@cgreich
Copy link
Contributor

cgreich commented May 21, 2022

@fdefalco:

Hang on a sec. Right now, the thinking is we have three categories (not two):

  • Default. This is what everybody always has to have, since it is part of the OMOP CDM. These are the standard and classification concepts. Not vocabularies.
  • Recommended. These are vocabularies (beyond their standard concepts) which are pre-clicked, but can be unclicked. The problem is what they are. It strongly depends on the geography of the data source what should be recommended. In the US, NDC would be recommended (only the devices are standard, until we figure that out), but, say, in France NDC is a huge corpus of useless concepts.
  • Rest. These are the vocabularies that are not recommended. They are no checked, but could be clicked.

The proprietary vocabularies are in the Rest category, since they need to be individually clicked and processed anyway.

We will have to change Athena to always include all standard concepts (easy), and create different sets of recommended vocabularies (North America, Europe, Rest of World maybe). Not a big deal, but will require some work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Defects
Projects
None yet
Development

No branches or pull requests

5 participants