John Snow Labs LangTest 1.9.0: Hugging Face Callback Integration, Advanced Templatic Augmentation, Comprehensive Model Benchmarking, Expanded Clinical Dataset Support (MedQA, PubMedQA, MedMCQ), Insightful Blogposts, and Enhanced User Experience with Key Bug Fixes #911
ArshaanNazir
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
📢 Highlights
🌟 LangTest 1.9.0 Release by John Snow Labs
We're excited to announce the latest release of LangTest, featuring significant enhancements that bolster its versatility and user-friendliness. This update introduces the seamless integration of Hugging Face Callback, empowering users to effortlessly utilize this renowned platform. Another addition is our Enhanced Templatic Augmentation with Automated Sample Generation. We also expanded LangTest's utility in language testing by conducting comprehensive benchmarks across various models and datasets, offering deep insights into performance metrics. Moreover, the inclusion of additional Clinical Datasets like MedQA, PubMedQA, and MedMCQ broadens our scope to cater to diverse testing needs. Coupled with insightful blog posts and numerous bug fixes, this release further cements LangTest as a robust and comprehensive tool for language testing and evaluation.
Integration of Hugging Face's callback class in LangTest facilitates seamless incorporation of an automatic testing callback into transformers' training loop for flexible and customizable model training experiences.
Enhanced Templatic Augmentation with Automated Sample Generation: A key addition in this release is our innovative feature that auto-generates sample templates for templatic augmentation. By setting generate_templates to True, users can effortlessly create structured templates, which can then be reviewed and customized with the show_templates option.
In our Model Benchmarking initiative, we conducted extensive tests on various models across diverse datasets (MMLU-Clinical, OpenBookQA, MedMCQA, MedQA), revealing insights into their performance and limitations, enhancing our understanding of the landscape for robustness testing.
Enhancement: Implemented functionality to save model responses (actual and expected results) for original and perturbed questions from the language model (llm) in a pickle file. This enables efficient reuse of model outputs on the same dataset, allowing for subsequent evaluation without the need to rerun the model each time.
Optimized API Efficiency with Bug Fixes in Model Calls.
🔥 Key Enhancements:
🤗 Hugging Face Callback Integration
We introduced the callback class for utilization in transformers model training. Callbacks in transformers are entities that can tailor the training loop's behavior within the PyTorch or Keras Trainer. These callbacks have the ability to examine the training loop state, make decisions (such as early stopping), or execute actions (including logging, saving, or evaluation). LangTest effectively leverages this capability by incorporating an automatic testing callback. This class is both flexible and adaptable, seamlessly integrating with any transformers model for a customized experience.
Create a callback instance with one line and then use it in the callbacks of trainer:
🚀 Enhanced Templatic Augmentation with Automated Sample Generation
Users can now enable the automatic generation of sample templates by setting generate_templates to True. This feature utilizes the advanced capabilities of LLMs to create structured templates that can be used for templatic augmentation.To ensure quality and relevance, users can review the generated templates by setting show_templates to True.
🚀 Benchmarking Different Models
In our Model Benchmarking initiative, we conducted comprehensive tests on a range of models across diverse datasets. This rigorous evaluation provided valuable insights into the performance of these models, pinpointing areas where even large language models exhibit limitations. By scrutinizing their strengths and weaknesses, we gained a deeper understanding of the landscape
MMLU-Clinical
We focused on extracting clinical subsets from the MMLU dataset, creating a specialized MMLU-clinical dataset. This curated dataset specifically targets clinical domains, offering a more focused evaluation of language understanding models. It includes questions and answers related to clinical topics, enhancing the assessment of models' abilities in medical contexts. Each sample presents a question with four choices, one of which is the correct answer. This curated dataset is valuable for evaluating models' reasoning, fact recall, and knowledge application in clinical scenarios.
How the Dataset Looks
OpenBookQA
The OpenBookQA dataset is a collection of multiple-choice questions that require complex reasoning and inference based on general knowledge, similar to an “open-book” exam. The questions are designed to test the ability of natural language processing models to answer questions that go beyond memorizing facts and involve understanding concepts and their relations. The dataset contains 500 questions,
each with four answer choices and one correct answer. The questions cover various topics in science, such as biology, chemistry, physics, and astronomy.
How the Dataset Looks
MedMCQA
The MedMCQA is a large-scale benchmark dataset of Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions.
How the Dataset Looks
Dataset info:
MedQA
The MedQA is a benchmark dataset of Multiple choice question answering based on the United States Medical License Exams (USMLE). The dataset is collected from the professional medical board exams.
How the Dataset Looks
🚀 Community Contributions:
Our team has published the below enlightening blogs on Hugging Face's community platform:
Streamlining ML Workflows: Integrating MLFlow Tracking with LangTest for Enhanced Model Evaluations
Evaluating Large Language Models on Gender-Occupational Stereotypes Using the Wino Bias Test
🚀 New LangTest Blogs:
🐛 Bug Fixes
What's Changed
Full Changelog: 1.8.0...1.9.0
This discussion was created from the release John Snow Labs LangTest 1.9.0: Hugging Face Callback Integration, Advanced Templatic Augmentation, Comprehensive Model Benchmarking, Expanded Clinical Dataset Support (MedQA, PubMedQA, MedMCQ), Insightful Blogposts, and Enhanced User Experience with Key Bug Fixes.
Beta Was this translation helpful? Give feedback.
All reactions