Replies: 9 comments 10 replies
-
Jenny, |
Beta Was this translation helpful? Give feedback.
-
@bold, ***@***.***>
Then, we need to be specific as to how different encounters are uploaded for a given person_id. Need a recipe as how to upload distinct hospitalizations. A recipe could be to just append to the OMOP tables? (for example, add a second line to the visit_occurence table to reflect that second admission and modify the observation_period table to reflect the last available data element for that person_id, all the other tables would be straight unions across admissions)
Gilles
From: Del Bold ***@***.***>
Sent: Thursday, November 30, 2023 3:18 PM
To: chorus-ai/data_acq_SOP ***@***.***>
Cc: Clermont, Gilles ***@***.***>; Mention ***@***.***>
Subject: Re: [chorus-ai/data_acq_SOP] Target folder structure (Discussion #14)
@clermontg<https://github.com/clermontg> Could I just suggest, not to enforce visit_occurence_ID on the folders. For instance of continues waveforms, sometimes it can belong to several encounters. Where do we put this waveform? We wouldn't want to break the waveform. Sometime encounters overlap, in the instance of ICU stay and patient get up to get procedure done. There should be one physical continues waveform. You know what I mean?
Sites Directory Structure
* /person_id
* /OMOP
* PERSON.csv
* OBSERVATION_PERIOD.csv
* VISIT_OCCURRENCE.csv
* VISIT_DETAIL.csv
* CONDITION_OCCURRENCE.csv
* DRUG_EXPOSURE.csv
* PROCEDURE_OCCURRENCE.csv
* DEVICE_EXPOSURE.csv
* MEASUREMENT.csv
* OBSERVATION.csv
* DEATH.csv
* NOTE.csv
* NOTE_NLP.csv
* FACT_RELATIONSHIP.csv
* /IMAGES
* person_id-study_id.hdf5.
* /WAVEFORMS
* person_id-starttime1-range.hdf5
* person_id-starttime2-range.hdf5
With person_id and starttime, we can align the waveform with multiple encounters. -Del
-
Reply to this email directly, view it on GitHub<#14 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AL33RKWKOTEKJECZIIV3CK3YHDSV7AVCNFSM6AAAAAA7QI3XHWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TOMRRHE4TS>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
We not requesting the NOTE table (yet).
Chester, not sure what you mean by your comment about occurrence and detail. Visit_detail reflects all the nursing units a patient has been moved across during a single occurrence. So, it is essential to compute ICU length of stay and datetime of ICU admission and discharge. It is true one can build visit_occurence from visit_detail. Is this what you meant?
Gilles
From: Chester Guan (Ziyuan Guan) ***@***.***>
Sent: Thursday, November 30, 2023 4:05 PM
To: chorus-ai/data_acq_SOP ***@***.***>
Cc: Clermont, Gilles ***@***.***>; Mention ***@***.***>
Subject: Re: [chorus-ai/data_acq_SOP] Target folder structure (Discussion #14)
Here are some of my naïve ideas, and I think our key goal right now is to upload the data and keep all the capabilities to organize the data as we want eventually.
@del42<https://github.com/del42> , I think maybe you guys used visit_occurrence_id as visit_detail_id, and that's fine.
Although UF data is okay to use visit_occurrence_id to link, I still vote for Del's idea. The person_id + start_timestamp will be unique identifier for waveform files and will not lose any information to link with EHR data.
we may need to consider the timezone issue when we used timestamp. UNIX Epoch will be okay.
Besides that, I would suggest each site to provide crosswalk files when uploading unstructured data which indicates:
For waveform:
* patientID
* visit_occurrence_id or visit_detail_id or other EHR indexer
* admissionStartTime -> it can be ICU admission or hospital admission based on data's granularity
* admissionEndTime -> same to above
* waveFormStartTime
* waveFormFilePath or other unique file path identifier
* (optional) waveFormEndTime(or it can be any like recording period)
Although the above information can be captured/extracted from target HDF5 or CCDEF waveform files, I believe a simple crosswalk file will be much easier.
For Images: (I am not sure how to use studyID to locate the images, but from my understanding, it's similar to waveform files)
* patientID
* visit_occurrence_id or visit_detail_id or other EHR indexer
* admissionStartTime
* admissionEndTime
* imageFilePath or other unique file handler
Only if the image was produced during the admission period, It should be fine.
—
Reply to this email directly, view it on GitHub<#14 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AL33RKX5IXHYYBQRVBSL5F3YHDYH5AVCNFSM6AAAAAA7QI3XHWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TOMRSGQ2DK>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
Could I suggest how we should name the waveform files on the waveform and image folder? Clinical Waveform Data Naming Convention in Source FormatAdjustments to the naming convention for clinical waveform data stored in this case in HDF5 format are made to include the duration in seconds and exclude the modality or type of waveform. The format is as follows:
Example File Name
This example represents a recording for patient 123 on January 1, 2023, starting at 10:15:30, with a duration of 3600 seconds, and a unique identifier. (DICOM) Image Naming Convention
Example of a DICOM Image Name
This example would represent a CT scan for a patient, conducted on January 1, 2023, at 10:15:30. This image is the first in its series and the first in that series. Notes
|
Beta Was this translation helpful? Give feedback.
-
So, if a recording covers 8 days of data, the date and time of the first entry in the h5 should be used, correct?
From: Del Bold ***@***.***>
Sent: Monday, December 4, 2023 1:01 PM
To: chorus-ai/data_acq_SOP ***@***.***>
Cc: Clermont, Gilles ***@***.***>; Mention ***@***.***>
Subject: Re: [chorus-ai/data_acq_SOP] Target folder structure (Discussion #14)
Could I suggest how we should name the waveform files on the waveform and image folder?
This is how I would like to submit. Please take it or leave it.
Clinical Waveform Data Naming Convention in Source Format
Adjustments to the naming convention for clinical waveform data stored in this case in HDF5 format are made to include the duration in seconds and exclude the modality or type of waveform. The format is as follows:
* Patient Identification: A unique person ID, typically a number or numeric digits from OMOP Person Table.
* Study Date: The date when the waveform data was recorded, in the format YYYYMMDD.
* Start Time: The start time of the recording, in HHMMSS format.
* Duration in Seconds: The duration of the recording in seconds.
Example File Name
Person123_20230101_101530_3600.h5
This example represents a recording for patient 123 on January 1, 2023, starting at 10:15:30, with a duration of 3600 seconds, and a unique identifier.
(DICOM) Image Naming Convention
* Patient Identification: Typically includes a patient ID, which could be a number or a combination of letters and numbers, assigned by the medical facility.
* Study Date: The date when the study was conducted, usually in the format YYYYMMDD.
* Study Time: The time of the study, often in HHMMSS format.
* Modality: Refers to the type of equipment used for the scan, such as MR (Magnetic Resonance), CT (Computed Tomography), or US (Ultrasound).
* Series Number: Indicates the sequence of a particular series of images in a study.
* Instance Number: Represents the specific image number within a series.
Example of a DICOM Image Name
PatientID_20230101_101530_CT_01_001.dcm
This example would represent a CT scan for a patient, conducted on January 1, 2023, at 10:15:30. This image is the first in its series and the first in that series.
Notes
* This naming convention simplifies the retrieval for the APIs and central cloud processing ETLs.
-
Reply to this email directly, view it on GitHub<#14 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AL33RKTYLDCPZ6S66AFOZWTYHYFWPAVCNFSM6AAAAAA7QI3XHWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TONJWGEYTC>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
Del,
I will not micromanage this and am fine with any scheme that works. Things that make sense for me usually do not make sense for most other humans 😉 Did you modify the text in the Draft request on do you want me to?
Gilles
From: Del Bold ***@***.***>
Sent: Monday, December 4, 2023 10:51 AM
To: chorus-ai/data_acq_SOP ***@***.***>
Cc: Clermont, Gilles ***@***.***>; Mention ***@***.***>
Subject: Re: [chorus-ai/data_acq_SOP] Target folder structure (Discussion #14)
All I am suggesting is that we leave the waveform file name with person_id and timestamp and duration. That is how I managed in the past and linking the raw waveform with those info is much better. This won't require as to break the continues file and also make duplicates on overlapping encounters. Plus, I am bit afraid to link in to wrong encounter and make the file not usable.
—
Reply to this email directly, view it on GitHub<#14 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AL33RKTFVGETYIU3SYCUUMTYHXWN5AVCNFSM6AAAAAA7QI3XHWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TONJUGY3TQ>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
@del42
OK, this one we do need to talk about.
So, how do you suggest representing two distinct admissions in our suggested folder structure?
Gilles
From: Del Bold ***@***.***>
Sent: Monday, December 4, 2023 10:46 AM
To: chorus-ai/data_acq_SOP ***@***.***>
Cc: Clermont, Gilles ***@***.***>; Mention ***@***.***>
Subject: Re: [chorus-ai/data_acq_SOP] Target folder structure (Discussion #14)
@clermontg<https://github.com/clermontg> I am bit confused with your comment on we needing a recipe as how to upload distinct hospitalizations. Encounters are visit_occurrences in OMOP and person could have several visit_occurrences. We probably don't need to add a second line to the visit_occurence table to reflect that second admission and modify the observation_period table to reflect the last available data element for that person_id. It is already there to my understanding.
-
Reply to this email directly, view it on GitHub<#14 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AL33RKQTWCCJOSLMHTICALDYHXV4BAVCNFSM6AAAAAA7QI3XHWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TONJUGYZTC>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
Gilles are we sure the inclusion of a per-patient OMOP folder in this structure is required? Having the OMOP data including all the vocabulary in two places would seem to:
If the rationale is very clear, I don't want to suggest that this approach should be reconsidered. Is there a source of information that gives a clear understanding of
Given the ability to link files to the aggregate OMOP structure, we might be wise to make sure we're not taking on large tool development tasks for problems that might already be mostly solved like querying data to define cohorts or generating person-level data structures for feature engineering. Querying to define cohorts and the generation of data frames (in python or R) are already handled by mature software pipelines by the OHDSI tool stack. If we need different functionality for doing that with waveforms and images, an alternative might be needed. The documentation of the data extraction functions in the PLP package illustrate how that data extraction and transformation to a row-per-person data frame is supported within the OHDSI for ML model development. Those functions use the FeatureExtraction package to do that work. FeatureExtraction gets data from a relational OMOP DB structure and puts it into a row-per-patient data frame. Data objects output from FeatureExtraction is stored within an object that included associated metadata about meaning of default indicator variables (e.g. condition occurrence) or custom covariates (lab or other numeric values or trends etc.). The OHDSI pipelines like those in PLP use those objects for feature engineering by the many flavors of ML supported in the OHDSI HADES packages like the DeepPatientLevelPrediction package or other approaches depending on use case requirements. These pipelines are quite full-featured and flexible. They support ensemble models and are tools for clinically informed inspection of data-driven feature selection results, automated displays of results in various standard outputs, linkage with the [OHDSI Prediction Model library](https://delphi.ohdsi.org/] that supports upload and download of models and cumulative collection of model performance metadata at different institutions, etc. In general, they standardize inputs in ways that leverage the OMOP information model and phenotyping resources, allow very flexible adaptation of model types - hyperparameter specification and other design options - or use of completely novel classifiers, and provide standardized outputs that conform to ML best practices for model inspection and sharing. I sketch this mature OMOP/OHDSI-based support just to help us collectively clarify where we think new approaches and associated tools need to be developed based on new ways of storing and accessing the OMOP data as suggested by this file structure. It seems fairly urgent to reach a clear conclusion about this approach to prevent sites from having to redo work plans for data contribution and give a clear guidance to people who are working on tool development to meet our project requirements. That type of concern is the motivation I had for us all to clearly define what the requirements are and the existing support for them. I hope this comment is useful and conveys the deep respect I have for your expertise in working in this space. |
Beta Was this translation helpful? Give feedback.
-
Andrew,
The problem at hand is collection and upload. What you suggest has been discussed as an alternative model with the main argument that sites would not have to deconstruct their OMOP tables into person_id-specific tables. Writing a script to do this though is trivial. A main advantage is that the structure allows all domains of data pertaining to a patient to be easily identified and it is a more natural way to segregate waveform and image files. This is also the model that MIMIC has adopted for their waveform files. If there were only OMOP tables, an argument could be made for all patients to be included in single clinical tables. The person_id-centric arrangement also does favor data integrity as sites add new patients or update individual patients. You know where to look. With the alternative model, you would have to develop procedures as to how to properly append/update/replace data. This seems risky. There is no data redundancy, just a larger number of files and folders.
Centrally, all .csv will be assembled in a single database of course, so OHDSI script can be run to construct cohorts, etc. A script to do this is also simple to develop (a script will need to be developed anyway to merge data from different sites).
So, this will work.
Gilles
From: Andrew Williams ***@***.***>
Sent: Thursday, December 7, 2023 11:46 AM
To: chorus-ai/data_acq_SOP ***@***.***>
Cc: Clermont, Gilles ***@***.***>; Mention ***@***.***>
Subject: Re: [chorus-ai/data_acq_SOP] Target folder structure (Discussion #14)
***@***.***> are we sure the inclusion of a per-patient OMOP folder in this structure is required?
Having the OMOP data including all the vocabulary in two places would seem to:
* create very large redundancies between the patient-level OMOP and the central OMOP
* data integrity challenges
If the rationale is very clear, I don't want to suggest that this approach should be reconsidered.
Is there a source of information that gives a clear understanding of
* how those redundancy and integrity risks are managed
* the added value the person-level OMOP instance for required functions like defining cohorts or ML feature engineering
Given the ability to link files to the aggregate OMOP structure, we might be wise to make sure we're not taking on large tool development tasks for problems that might already be mostly solved like querying data to define cohorts or generating person-level data structures for feature engineering.
Querying to define cohorts and the generation of data frames (in python or R) are already handled by mature software pipelines by the OHDSI tool stack. If we need different functionality for doing that with waveforms and images, an alternative might be needed. The documentation of the data extraction functions in the PLP package<https://ohdsi.github.io/PatientLevelPrediction/reference/index.html> illistrate how that happens for ML model development.
Those functions use the FeatureExtraction<http://ohdsi.github.io/FeatureExtraction/> package to do that work. FeatureExtraction gets data from a relational OMOP DB structure and puts it into a row-per-patient data frame.. Data objects output from FeatureExtraction is stored within an object that included associated metadata about meaning of default indicator variables (e.g. condition occurrence) or custom covariates (lab or other numeric values or trends etc.). The OHDSI pipelines like those in PLP use those objects for feature engineering by the many flavors of ML supported in the OHDSI HADES packages like the DeepPatientLevelPrediction<https://github.com/OHDSI/DeepPatientLevelPrediction> package or other approaches depending on use preferences.
It seems fairly urgent to reach a clear conclusion about this approach to prevent sites from having to redo work plans for data contribution and give a clear guidance to people who are working on tool development to meet our project requirements.
That type of concern is the motivation I had for us all to clearly define what the requirements are and the existing support for them. I hope this comment is useful and conveys the deep respect I have for your expertise in working in this space.
-
Reply to this email directly, view it on GitHub<#14 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AL33RKU2EKNUO3FPT7BOSCDYIHXDZAVCNFSM6AAAAAA7QI3XHWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TOOJRGEYTG>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
Hi community, can someone please clarify the target folder structure in the DA SOP? https://chorus-ai.github.io/data_acq_SOP/docs/Data-Uploading/Data_uploading/
Should we split each structured table by patient and make 13 tables for each patient (total being 13*N in each upload), or should we have 13 tables, each table would have all patients in each upload?
Beta Was this translation helpful? Give feedback.
All reactions