When in Doubt, Cascade: Towards Building Efficient and Capable Guardrails
FOS: Computer and information sciences
Computer Science - Computation and Language
Computation and Language (cs.CL)
DOI:
10.48550/arxiv.2407.06323
Publication Date:
2024-07-08
AUTHORS (4)
ABSTRACT
Large language models (LLMs) have convincing performance in a variety of downstream tasks. However, these systems are prone to generating undesirable outputs such as harmful and biased text. In order remedy generations, the development guardrail (or detector) has gained traction. Motivated by findings from developing detector for social bias, we adopt notion use-mention distinction - which identified primary source under-performance preliminary versions our bias detector. Armed with this information, describe fully extensible reproducible synthetic data generation pipeline leverages taxonomy-driven instructions create targeted labeled data. Using pipeline, generate over 300K unique contrastive samples provide extensive experiments systematically evaluate on suite open datasets. We show that method achieves competitive fraction cost compute offers insight into iteratively efficient capable models. Warning: This paper contains examples text toxic, biased, potentially harmful.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....