๐Ÿ›ก๏ธ NAIL Institute โ€” AVE Database

โ† Back to Database

Preference Data Manipulation

๐ŸŸ  HIGH model_poisoning theoretical AVE-2025-0080

ยท aka: RLHF Poisoning

Summary

Manipulation of human preference data used in RLHF to systematically bias model outputs.

Blast Radius

Systematic output bias across all model deployments.

Prerequisites

Access to RLHF preference data collection.

Environment

  • Frameworks: LangGraph
  • Models tested: [Available in NAIL SDK]
  • Multi-agent: No
  • Tools required: No
  • Memory required: No