Self-Criticism: Aligning Large Language Models with their Understanding of Helpfulness, Honesty, and Harmlessness

Helpfulness Honesty
DOI: 10.18653/v1/2023.emnlp-industry.62 Publication Date: 2023-12-10T21:58:19Z
ABSTRACT
Recently, there has been a notable surge in the significance of large language models (LLMs) that engage conversational-style interactions, such as ChatGPT and Claude, they contribute significantly to progress artificial general intelligence (AGI). Typically, these undergo two-phase fine-tuning process: instruction (IF) reinforcement learning from human feedback (RLHF). These methods aim align LLMs be helpful, honest, harmless (HHH). However, RLHF, which incorporates independent reward trained on high-quality datasets, incurs high costs terms hardware resources efforts. Therefore, we explore possibility aligning with their own understanding HHH through IF in-context (ICL). In this study, propose novel framework called Self-Criticism, allows themselves based definition learned large-scale text corpus. We begin by employing given set discrimination few-shot ICL. Subsequently, evaluate generated responses learn produce "better" self-judgment. Finally, model is retrained self-generated distill whole process. By analyzing our proposed method, also find interesting connections between Self-Criticism goal-conditioned learning, pseudo-labeling. Experimental results demonstrate method achieves nearly identical performance RLHF both evaluation other LLMs, only minimal alignment tax.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (6)