When in Doubt, Cascade: Towards Building Efficient and Capable Guardrails

FOS: Computer and information sciences Computer Science - Computation and Language Computation and Language (cs.CL)
DOI: 10.48550/arxiv.2407.06323 Publication Date: 2024-07-08
ABSTRACT
Large language models (LLMs) have convincing performance in a variety of downstream tasks. However, these systems are prone to generating undesirable outputs such as harmful and biased text. In order remedy generations, the development guardrail (or detector) has gained traction. Motivated by findings from developing detector for social bias, we adopt notion use-mention distinction - which identified primary source under-performance preliminary versions our bias detector. Armed with this information, describe fully extensible reproducible synthetic data generation pipeline leverages taxonomy-driven instructions create targeted labeled data. Using pipeline, generate over 300K unique contrastive samples provide extensive experiments systematically evaluate on suite open datasets. We show that method achieves competitive fraction cost compute offers insight into iteratively efficient capable models. Warning: This paper contains examples text toxic, biased, potentially harmful.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....