🐢 Open-Source Evaluation & Testing for ML & LLM systems
-
Updated
Nov 8, 2024 - Python
🐢 Open-Source Evaluation & Testing for ML & LLM systems
A curated list of awesome responsible machine learning resources.
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Deliver safe & effective language models
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Aligning AI With Shared Human Values (ICLR 2021)
RuLES: a benchmark for evaluating rule-following in language models
Code accompanying the paper Pretraining Language Models with Human Preferences
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
📚 A curated list of papers & technical articles on AI Quality & Safety
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
Toolkits to create a human-in-the-loop approval layer to monitor and guide AI agents workflow in real-time.
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
[CCS'24] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
Attack to induce LLMs within hallucinations
Reading list for adversarial perspective and robustness in deep reinforcement learning.
Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.
To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."