Self-Criticism: Aligning Large Language Models with their Understanding of Helpfulness, Honesty, and Harmlessness
Helpfulness
Honesty
DOI:
10.18653/v1/2023.emnlp-industry.62
Publication Date:
2023-12-10T21:58:19Z
AUTHORS (7)
ABSTRACT
Recently, there has been a notable surge in the significance of large language models (LLMs) that engage conversational-style interactions, such as ChatGPT and Claude, they contribute significantly to progress artificial general intelligence (AGI). Typically, these undergo two-phase fine-tuning process: instruction (IF) reinforcement learning from human feedback (RLHF). These methods aim align LLMs be helpful, honest, harmless (HHH). However, RLHF, which incorporates independent reward trained on high-quality datasets, incurs high costs terms hardware resources efforts. Therefore, we explore possibility aligning with their own understanding HHH through IF in-context (ICL). In this study, propose novel framework called Self-Criticism, allows themselves based definition learned large-scale text corpus. We begin by employing given set discrimination few-shot ICL. Subsequently, evaluate generated responses learn produce "better" self-judgment. Finally, model is retrained self-generated distill whole process. By analyzing our proposed method, also find interesting connections between Self-Criticism goal-conditioned learning, pseudo-labeling. Experimental results demonstrate method achieves nearly identical performance RLHF both evaluation other LLMs, only minimal alignment tax.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (6)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....