PCSD-BERT: Establishment and Validation of a Syndrome Differentiation Model for Pancreatic Cancer in Traditional Chinese Medicine (Preprint)

DOI: 10.2196/preprints.70602 Publication Date: 2025-01-06T13:59:17Z
ABSTRACT
BACKGROUND Background: Syndrome differentiation plays a crucial role in traditional Chinese medicine (TCM) diagnosis and treatment planning. However, this process is highly dependent on expert experience, thereby limiting systematically standardization. OBJECTIVE Objective: The present study established a Bidirectional Encoder Representations from Transformers (BERT)-based TCM syndrome differentiation model (PCSD-BERT) with validation using in-house pancreatic cancer medical records. This model aims to digitalize expert knowledge, enabling its storage and reuse to support standardized syndrome differentiation in TCM clinical practice. METHODS Methods: This study retrospectively collected pancreatic cancer case records from the Department of Integrative Oncology at Fudan University Shanghai Cancer Center between 2011 and 2024. Feature engineering was conducted based on relevant guidelines and expert knowledge, and syndromes with at least 500 case records were included for training. PCSD-BERT was trained using a masked language model (MLM) and multi-class classification tasks, with ten-fold cross-validation to enhance generalizability. Comparative analyses were conducted between PCSD-BERT and commonly used language models embedded in existing TCM diagnostic tools (LSTM and Text-CNN), a BERT model without fine-tuning, and various large language models (LLMs) utilizing Prompt engineering, including ChatGPT 4, ChatGPT 4o, ChatGPT o1-Pro, Kimi, Ernie Bot 4.0 Turbo, HuaTuoGPT II, and Zhipu Qingyan. After training, PCSD-BERT’s syndrome differentiation performance was evaluated in practical applications using in-house data, with attention mechanism visualizations to observe word association patterns in syndrome differentiation tasks. Additionally, integrated gradients were employed to assess the model’s capability in associating terms with syndrome labels. RESULTS Results: Following model establishment, a total of 6,830 case records were included, defining four syndrome labels. In the test dataset, PCSD-BERT demonstrated superior performance over all baseline models and LLMs utilizing Prompt engineering, with a Precision of 0.955±0.020, Recall of 0.935±0.039, F1-score of 0.951±0.23, and Accuracy of 0.919±0.025. The results demonstrated PCSD-BERT yielded syndrome differentiation results consistent with expert diagnoses across all syndrome categories. Visualization of the attention mechanism indicated that the model effectively identified relationships among TCM terms, constructing accurate inter-word associations. Integrated gradient analysis further revealed a high degree of concordance between the model’s predictions and clinical criteria, supporting alignment with TCM diagnostic principles. CONCLUSIONS Conclusions: The PCSD-BERT model demonstrated precise identification of TCM symptoms and syndrome patterns in medical case records, showcasing its irreplaceable efficiency in syndrome differentiation compared to LLMs and the embedded models in TCM diagnostic tools. This model has preliminarily achieved the digital storage and standardized application of expert knowledge, laying a foundation for multimodal integration tasks related to syndrome differentiation.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (30)
CITATIONS (0)