🧀 BigCheese.ai


Show HN: Next-Gen AI Training: LLM-RLHF-Tuning with PPO and DPO


This repository offers a comprehensive toolkit for Reinforcement Learning from Human Feedback (RLHF), including instruction fine-tuning, reward model training, and PPO and DPO algorithms with configurations for Alpaca, LLaMA, and LLaMA2 models.

  • Added LLaMA2 and DPO support.
  • PPO training for single base model.
  • Includes reward model training.
  • Supports LoRA training method.
  • Project encourages contributions.