Shuai Lu

ORCID: 0000-0001-7466-2064
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Software Engineering Research
  • Topic Modeling
  • Software Testing and Debugging Techniques
  • Natural Language Processing Techniques
  • Advanced Malware Detection Techniques
  • Software Reliability and Analysis Research
  • Bone health and osteoporosis research
  • Multimodal Machine Learning Applications
  • Web Data Mining and Analysis
  • Electrodeposition and Electroless Coatings
  • Corrosion Behavior and Inhibition
  • High Entropy Alloys Studies
  • High-Temperature Coating Behaviors
  • Electric and Hybrid Vehicle Technologies
  • Microstructure and Mechanical Properties of Steels
  • Aluminum Alloys Composites Properties
  • Climate Change and Health Impacts
  • Aluminum Alloy Microstructure Properties
  • Software System Performance and Reliability
  • Adversarial Robustness in Machine Learning
  • Metal Alloys Wear and Properties
  • Space Satellite Systems and Control
  • Industrial Vision Systems and Defect Detection
  • GDF15 and Related Biomarkers
  • Nanoporous metals and alloys

Peking University
2017-2025

Beijing Institute of Neurosurgery
2025

Beijing Jishuitan Hospital
2022-2025

Capital Medical University
2025

Changzhou University
2024-2025

Harbin University of Science and Technology
2024-2025

University of Science and Technology Beijing
2023-2024

Microsoft Research Asia (China)
2022-2024

Nanjing Medical University
2024

Henan University of Science and Technology
2024

Code summarization, aiming to generate succinct natural language description of source code, is extremely useful for code search and comprehension. It has played an important role in software maintenance evolution. Previous approaches summaries by retrieving from similar snippets. However, these heavily rely on whether snippets can be retrieved, how the are, fail capture API knowledge which carries vital information about functionality code. In this paper, we propose a novel approach, named...

10.24963/ijcai.2018/314 article EN 2018-07-01

Evaluation metrics play a vital role in the growth of an area as it defines standard distinguishing between good and bad models. In code synthesis, commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because originally designed natural language, neglecting important syntactic semantic features accuracy too strict thus underestimates different outputs with same logic. To remedy this, we introduce new automatic metric, dubbed...

10.48550/arxiv.2009.10297 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code activities necessitate developers viewing, understanding and even running programs assess logic, functionality, latency, style other factors. It turns out that have spend far too much time reviewing their peers. Accordingly, in significant demand automate process. In this research, we focus on utilizing pre-training techniques for tasks scenario. We collect a...

10.1145/3540250.3549081 article EN Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2022-11-07

Software development life cycle is profoundly influenced by bugs; their introduction, identification, and eventual resolution account for a significant portion of software cost. This has motivated engineering researchers practitioners to propose different approaches automating the identification repair defects.

10.1145/3611643.3613892 article EN 2023-11-30

In recent years, artificial intelligence (AI) has made incredible progress. Advanced foundation models such as ChatGPT can offer powerful conversation, in-context learning, and code generation abilities for a broad range of open-domain tasks. They also generate high-level solution outlines domain-specific tasks based on their acquired common-sense knowledge. Nonetheless, they still face difficulties in specialized because lack sufficient data during pretraining make errors neural network...

10.34133/icomputing.0063 article EN cc-by Intelligent Computing 2023-11-13

Producing the embedding of a sentence in anunsupervised way is valuable to natural language matching and retrieval problems practice. In this work, we conduct thorough examination pretrained model based unsupervised embeddings. We study on fourpretrained models massive experiments seven datasets regarding semantics. have three main findings. First, averaging all tokens better than only using [CLS] vector. Second, combining both topand bottom layers toplayers. Lastly, an easy whitening-based...

10.18653/v1/2021.findings-emnlp.23 preprint EN cc-by 2021-01-01

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, completion, summarization, etc. However, existing pre-trained regard snippet sequence tokens, while ignoring the inherent structure code, which provides crucial semantics and would enhance understanding process. We present GraphCodeBERT, model that considers code. Instead taking syntactic-level like abstract syntax tree (AST), we use data flow in...

10.48550/arxiv.2009.08366 preprint EN other-oa arXiv (Cornell University) 2020-01-01

In programming, the names for program entities, especially methods, are intuitive characteristic understanding functionality of code. To ensure readability and maintainability programs, method should be named properly. Specifically, meaningful consistent with other used in related contexts their codebase. recent years, many automated approaches proposed to suggest among which neural machine translation (NMT) based models widely have achieved state-of-the-art results. However, these NMT-based...

10.1145/3510003.3510154 article EN Proceedings of the 44th International Conference on Software Engineering 2022-05-21

The study investigates the effects of austenitizing temperature on microstructure and wear resistance hot work die steels with different silicon (Si) content. results indicate that steel high Si content, austenitized at 1110 °C, exhibits superior resistance, which can be attributed to precipitation a large amount fine vanadium carbides during tempering process. elevated facilitates martensite transformation quenching increases hardness steels. Low impact toughness is obtained in low content...

10.1002/srin.202400952 article EN steel research international 2025-02-21

<title>Abstract</title> Late-diagnosis is one of the main bottlenecks in musculoskeletal aging-related diseases prevention, and it urgent to build early detection model. Twenty-two features were included models based on binary multiple classification respectively by XGBoost. In testing, accuracy rate (63.74%~92.40%) AUC (0.74 ~ 0.96) binary-classification higher than (61.40% ~85.96%) (0.63 0.86) multiple-classification models. The optimal model had an 87.13% 0.92 including cooking, drinking...

10.21203/rs.3.rs-6124947/v1 preprint EN Research Square (Research Square) 2025-04-16

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, benchmark dataset to foster machine learning for program understanding and generation. CodeXGLUE includes collection of 10 tasks across 14 platform model evaluation comparison. also features three baseline systems, including the BERT-style, GPT-style, Encoder-Decoder models, make it easy researchers use platform. The availability such data baselines can...

10.48550/arxiv.2102.04664 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior code. However, most pre-trained models for code intelligence ignore trace and only rely on source syntactic structures. In this paper, we investigate how well can understand perform execution. We develop mutation-based data augmentation technique to create large-scale realistic Python dataset task execution, which challenges existing such as Codex. then present CodeExecutor, Transformer...

10.18653/v1/2023.findings-acl.308 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2023-01-01

Most advanced unsupervised anomaly detection (UAD) methods rely on modeling feature representations of frozen encoder networks pre-trained large-scale datasets, e.g. ImageNet. However, the features extracted from encoders that are borrowed natural image domains coincide little with required in target UAD domain, such as industrial inspection and medical imaging. In this paper, we propose a novel epistemic method, namely ReContrast, which optimizes entire network to reduce biases towards...

10.48550/arxiv.2306.02602 preprint EN cc-by arXiv (Cornell University) 2023-01-01

This paper addresses the question: In neural dialog systems, why do sequence-to-sequence (Seq2Seq) networks generate short and meaningless replies for open-domain response generation? We conjecture that in a system, due to randomness of spoken language, there may be multiple equally plausible one utterance, causing deficiency Seq2Seq model. To evaluate our conjecture, we propose systematic way mimic scenario machine translation systems with both real datasets toy generated elaborately....

10.1109/icassp.2019.8682634 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

This paper addresses the question: Why do neural dialog systems generate short and meaningless replies? We conjecture that, in a system, an utterance may have multiple equally plausible replies, causing deficiency of networks application. propose systematic way to mimic scenario machine translation manage reproduce phenomenon generating less meaningful sentences setting, showing evidence our conjecture.

10.48550/arxiv.1712.02250 preprint EN other-oa arXiv (Cornell University) 2017-01-01
Coming Soon ...