NFDI4DS | UHH-SEMS - Publication Details

Self-Regularization with Latent Space Explanations for Controllable LLM-based Classification

Regularization

DOI: 10.48550/arxiv.2502.14133 Publication Date: 2025-02-19

Abstract Supplemental Material References Cited by

AUTHORS (4)

Xuansheng Wu

Wenhao Yu

Xiaoming Zhaı

Ninghao Liu

ABSTRACT

Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these provide automatic and effective representations for model training. However, they also introduce a challenge: we lose the ability manually remove unintended such as sensitive or task-irrelevant guarantee regulatory compliance improve generalizability of models. This limitation arises because LLM are opaque difficult interpret. In this paper, propose novel framework identify regularize features in latent space. Specifically, first pre-train sparse autoencoder (SAE) extract interpretable spaces. To ensure SAE can capture task-specific further fine-tune it datasets. training model, simple regularizer, by minimizing similarity between classifier weights identified feature, impacts toward classification. We evaluate proposed three real-world tasks, including toxic chat detection, reward modeling, disease diagnosis. Results show that significantly classifier's regularizing those not semantically correlated each task. work pioneers controllable spaces leveraging interpreted address generalizability, fairness, privacy challenges. will release our code data once accepted.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Self-Regularization with Latent Space Explanations for Controllable LLM-based Classification

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....