Apr 06, 2022

Aligning Generative Language Models with Human Values

NAACL 2022 - Findings

This paper proposes SENSEI, a new reinforcement learning based method that can embed human values judgements into each step of language generation. SENSEI deploys an Actor-Critic framework, where the Critic is a reward distributor that simulates the reward assignment procedure of humans, while the Actor guides the generation towards the maximum reward direction. Compared with five existing methods in three human values alignment datasets, SENSEI not only achieves higher alignment performance in terms of both automatic and human evaluations, but also shows improvements on robustness and transfer learning on unseen human values.

Similar thoughts

Training Socially Aligned Language Models in Simulated Human Society

Mind's Eye: Grounded Language Model Reasoning Through Simulation

Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits

Non-Parallel Text Style Transfer with Self-Parallel Supervision

Knowledge Infused Decoding