Self-Regularization with Latent Space Explanations for Controllable LLM-based Classification

Regularization
DOI: 10.48550/arxiv.2502.14133 Publication Date: 2025-02-19
ABSTRACT
Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these provide automatic and effective representations for model training. However, they also introduce a challenge: we lose the ability manually remove unintended such as sensitive or task-irrelevant guarantee regulatory compliance improve generalizability of models. This limitation arises because LLM are opaque difficult interpret. In this paper, propose novel framework identify regularize features in latent space. Specifically, first pre-train sparse autoencoder (SAE) extract interpretable spaces. To ensure SAE can capture task-specific further fine-tune it datasets. training model, simple regularizer, by minimizing similarity between classifier weights identified feature, impacts toward classification. We evaluate proposed three real-world tasks, including toxic chat detection, reward modeling, disease diagnosis. Results show that significantly classifier's regularizing those not semantically correlated each task. work pioneers controllable spaces leveraging interpreted address generalizability, fairness, privacy challenges. will release our code data once accepted.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....