Medical Data Extraction Pipeline

Overview

A comprehensive solution that integrates Azure Machine Learning Services (LLMS) to analyze and process healthcare data. This pipeline is specifically designed for the competition hosted on Azure. The main components include data preprocessing, translation, interaction with a scoring service, and storage of results in Azure Blob Storage.

Code Structure

Transcription and Translation

The code begins by loading patient transcripts from the provided input_1.json file.
Utilizes the Googletrans library for translation, ensuring that the data is in English for further processing.

Azure ML Scoring Service Integration

The pipeline leverages Azure Machine Learning Services (LLMS) to interact with a scoring service deployed at https://roberta.eastus.inference.ml.azure.com/score.
A set of predefined questions related to patient information is used to query the service for answers.
The obtained scores and answers are then collected for analysis.

Data Processing

The results are organized into a structured format, creating a DataFrame that includes information such as Transcript ID, Question, Answer, and Scores.
This DataFrame is saved as a CSV file named input_1.csv.

Azure Blob Storage

The pipeline utilizes Azure Blob Storage for storing the processed data.
A connection string and container information are configured to upload the CSV file to Azure Blob Storage.

Azure Pipeline

The pipeline showcases the integration of various Azure services for end-to-end data processing.
Data is translated, sent to a scoring service, and the results are stored in Azure Blob Storage for easy access.

How to Run

Install Dependencies:
- Ensure that all required dependencies, including Azure ML SDK, Googletrans, and Pandas, are installed. Use the following command:
```
pip install azureml-sdk googletrans==4.0.0-rc1 pandas
```
Azure ML Authentication:
- Make sure to set up your Azure Machine Learning authentication by configuring your Azure ML workspace. Refer to the Azure ML SDK documentation.
Run the Code:
- Execute the provided Python script (competition_pipeline.py) to run the competition pipeline.
Review Results:
- Check the output CSV file (input_1.csv) to see the results of the scoring service for each transcript and question.

Azure Pipeline Configuration

The Azure Pipeline is not explicitly provided in this code snippet. However, the pipeline can be configured on Azure DevOps or any CI/CD platform to automate the execution of the provided script.

Note

Ensure that you have the necessary permissions and configurations in your Azure environment to use Azure ML services and Azure Blob Storage.

Architecture Diagram

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
Advanced_Readings_in_Linguistics_Final_Report.pdf		Advanced_Readings_in_Linguistics_Final_Report.pdf
Azure_Data_Pipeline_Code.py		Azure_Data_Pipeline_Code.py
README.md		README.md
transcripts.json		transcripts.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Data Extraction Pipeline

Overview

Code Structure

Transcription and Translation

Azure ML Scoring Service Integration

Data Processing

Azure Blob Storage

Azure Pipeline

How to Run

Azure Pipeline Configuration

Note

Architecture Diagram

About

Releases

Packages

Languages

jashshah-dev/Large-Language-Models-for-Medical-Data-Extraction

Folders and files

Latest commit

History

Repository files navigation

Medical Data Extraction Pipeline

Overview

Code Structure

Transcription and Translation

Azure ML Scoring Service Integration

Data Processing

Azure Blob Storage

Azure Pipeline

How to Run

Azure Pipeline Configuration

Note

Architecture Diagram

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages