- AutoPlan: don't use demonstrations from human, collect feedback from the environment and generate reflections
- RetroFormer: freezes the base LLM and trains reinforcement learning models to refine reflections through policy gradient methods
- ADAPTING LLM AGENTS THROUGH COMMUNICATION : applies PPO training directly to an open-source LLM based on feedback and agent exploration trajectories
- Text2Reward: transforms feedback into code to minimize feedback ambiguity
- ExpeL: LLM Agents Are Experiential Learners: utilizes inter-task feedback from both successes and failures to enhance model learning.
- ALIGNING LANGUAGE MODELS WITH JUDGMENTS: create contrasting samples with correct/incorrect prediction and feedback to train LLM toget better alignment
- Reflexion: uses reflection to improve the performance, uses oracle to determine whether the reasoning should stop
- RAP: uses MCTS, uses LLM as world model
- RATS: combines MCTS, reward evaluation and reflection, in the LLM search process
- []