- Cardiac Arrest and Resuscitation
- Software System Performance and Reliability
- Cardiac Structural Anomalies and Repair
- Semiconductor materials and devices
- Topic Modeling
- Software-Defined Networks and 5G
- Traditional Chinese Medicine Studies
- Neurological Disease Mechanisms and Treatments
- Machine Learning and Algorithms
- Mechanical Circulatory Support Devices
- Interconnection Networks and Systems
- Cloud Computing and Resource Management
- Mineral Processing and Grinding
- Anesthesia and Neurotoxicity Research
- Cardiac Ischemia and Reperfusion
- Natural Language Processing Techniques
- Brain Tumor Detection and Classification
- Network Time Synchronization Technologies
- Anomaly Detection Techniques and Applications
- Advanced Neural Network Applications
- Neurological Disorders and Treatments
Fujian Medical University
2015-2021
Union Hospital
2015-2021
Alibaba Group (United States)
2020
Microsoft Research (United Kingdom)
2016
Over the past one and half years, we have been using RDMA over commodity Ethernet (RoCEv2) to support some of Microsoft's highly-reliable, latency-sensitive services. This paper describes challenges encountered during process solutions devised address them. In order scale RoCEv2 beyond VLAN, designed a DSCP-based priority flow-control (PFC) mechanism ensure large-scale deployment. We addressed safety brought by PFC-induced deadlock (yes, it happened!), transport livelock, NIC PFC pause frame...
We present the design, implementation and engineering experience in building deploying MegaScale, a production system for training large language models (LLMs) at scale of more than 10,000 GPUs. Training LLMs this brings unprecedented challenges to efficiency stability. take full-stack approach that co-designs algorithmic components across model block optimizer computation communication overlapping, operator optimization, data pipeline, network performance tuning. Maintaining high throughout...
Deep neural networks (DNNs) have gained tremendous attractions as compelling solutions for applications such image classification, object detection, speech recognition, and so forth. Its great success comes with excessive trainings to make sure the model accuracy is good enough those applications. Nowadays, it becomes challenging train a DNN because of 1) size data keep increasing, which usually needs more iterations train; 2) algorithms evolve rapidly, requires training phase be short quick...
Large-scale distributed model training requires simultaneous on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, task can encounter two faults per day average, possibly leading halt for hours. To address the drawbacks time-consuming and labor-intensive manual scrutiny, we propose Minder, automatic faulty detector tasks. The key idea Minder automatically efficiently detect distinctive monitoring metric...
The use of ischemic preconditioning (IPC) to protect the myocardium is usually not effective in elderly patients. aim present study was design new methods achieve enhanced myocardial protection, based on differential role endogenous adenosine (ADO) and ADO receptors (ARs) effects IPC young old animals. An improved New Zealand white rabbit model ischemia/reperfusion established Langendorff model. Adult or hearts, with without exposure IPC, were used order assess roles ARs different IPC....
High-speed RDMA networks are getting rapidly adopted in the industry for their low latency and reduced CPU overheads. To verify that can be used production, system administrators need to understand set of application workloads potentially trigger abnormal performance behaviors (e.g., unexpected throughput, PFC pause frame storm). We design implement Collie, a tool users systematically uncover anomalies subsystems without access hardware internal designs. Instead individually testing each...
Background: Acute myocardial infarction–induced cardiac arrest has high mortality rate. Objective: To investigate the risk factors of extracorporeal membrane oxygenation combined with percutaneous coronary intervention in rescuing acute arrest. Methods: Forty-three eligible patients were assigned into death and survival groups. Their general clinical data, treatment outcomes, various indicators 24, 48, 72 h after implantation compared. The affecting outcomes determined by multivariate...