- Software System Performance and Reliability
- Network Security and Intrusion Detection
- Software Engineering Research
- Cloud Computing and Resource Management
- Advanced Malware Detection Techniques
- Anomaly Detection Techniques and Applications
- Service-Oriented Architecture and Web Services
- Software Reliability and Analysis Research
- Data Quality and Management
- Security and Verification in Computing
- Caching and Content Delivery
- Green IT and Sustainability
- Topic Modeling
- Natural Language Processing Techniques
- IoT and Edge/Fog Computing
- Software Testing and Debugging Techniques
- Advanced Software Engineering Methodologies
- Internet of Things and Social Network Interactions
- Energy and Environmental Systems
- Mobile and Web Applications
- Adversarial Robustness in Machine Learning
- Recommender Systems and Techniques
- Liver Disease and Transplantation
- Web Applications and Data Management
- Organ Transplantation Techniques and Outcomes
Microsoft Research Asia (China)
2019-2024
Microsoft (Germany)
2024
University of Chinese Academy of Sciences
2024
Microsoft Research (United Kingdom)
2021-2024
Henan Provincial People's Hospital
2021
Zhengzhou University
2021
Shanghai Ninth People's Hospital
2019
Shanghai Jiao Tong University
2018-2019
Korea Telecom (South Korea)
2006-2018
Fudan University
2017-2018
Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for incidents. Traditional RCA methods, which rely on manual investigations data sources such as logs traces, are often laborious, error-prone, challenging on-call engineers. In this paper, we introduce RCACopilot, an innovative system empowered by large language model automating RCACopilot matches incoming incidents to corresponding incident handlers based their alert types,...
Overlapping community structure is a significant feature of large-scale complex networks. Some existing detection algorithms cannot be applied to networks due their high time or space complexity. Label propagation were proposed for detecting communities in because linear complexity, however most which can only detect non-overlapping communities, the results are inaccurate and unstable. Aimed at defects, we an improved overlapping algorithm, LPANNI (Label Propagation Algorithm with Neighbor...
Logs provide first-hand information for engineers to diagnose failures in large-scale online service systems. Log parsing, which transforms semi-structured raw log messages into structured data, is a prerequisite of automated analysis such as log-based anomaly detection and diagnosis. Almost all existing parsers follow the general idea extracting common part templates dynamic parameters. However, these parsing methods, often neglect semantic meaning messages. Furthermore, high diversity...
Logging, which aims to determine the position of logging statements, verbosity levels, and log messages, is a crucial process for software reliability enhancement. In recent years, numerous automatic tools have been designed assist developers in one tasks (e.g., providing suggestions on whether try-catch blocks). These are useful certain situations yet cannot provide comprehensive solution general. Moreover, although research has started explore end-to-end logging, it still largely...
With the rapid growth of cloud service systems and their increasing complexity, failures become unavoidable. Outages, which are critical failures, could dramatically degrade system availability impact user experience. To minimize downtime ensure high availability, we develop an intelligent outage management approach, called AirAlert, can forecast occurrence outages before they actually happen diagnose root cause after indeed occur. AirAlert works as a global watcher for entire system,...
The management of cloud service incidents (unplanned interruptions or outages a service/product) greatly affects customer satisfaction and business revenue. After years efforts, enterprises are able to solve most automatically timely. However, in practice, we still observe critical that occurred an unexpected manner orchestrated diagnosis workflow failed mitigate them. In order accelerate the understanding unprecedented provide actionable recommendations, modern incident system employs...
In recent years, more and traditional shrink-wrapped software is provided as 7x24 online services. Incidents (events that lead to service disruptions or outages) could affect availability cause great financial loss. Therefore, mitigating the incidents important time critical. practice, a document describing mitigation process, called troubleshooting guide (TSG), usually used reduce Time To Mitigate (TTM). investigate usage of TSGs in real-world services, we conduct first empirical study on...
Log parsing, which extracts log templates and parameters, is a critical prerequisite step for automated analysis techniques. Though existing parsers have achieved promising accuracy on public datasets, they still face many challenges when applied in the industry. Through studying characteristics of real-world data analyzing limitations parsers, we identify two problems. Firstly, it non-trivial to scale parser vast number logs, especially scenarios where extremely imbalanced. Secondly,...
Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents occurring within these can lead to service disruptions and adversely affect user experience. To swiftly resolve such incidents, on-call engineers depend on crafting domain-specific language (DSL) queries analyze telemetry data. writing be challenging time-consuming. This paper presents thorough empirical study the utilization of KQL, DSL employed for incident management large-scale system at...
Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for repository successfully run. Existing studies highlight that dependency-related issues cause over 40\% of observed runtime errors on generated repository. To address this, we introduce DI-BENCH, large-scale benchmark evaluation framework specifically designed assess LLMs' capability...
The advancement of large language models has intensified the need to modernize enterprise applications and migrate legacy systems secure, versatile languages. However, existing code translation benchmarks primarily focus on individual functions, overlooking complexities involved in translating entire repositories, such as maintaining inter-module coherence managing dependencies. While some recent repository-level attempt address these challenges, they still face limitations, including poor...
Code translation is a crucial activity in the software development and maintenance process, researchers have recently begun to focus on using pre-trained large language models (LLMs) for code translation. However, existing LLMs only learn contextual semantics of during pre-training, neglecting executability information closely related execution state code, which results unguaranteed unreliable automated To address this issue, we propose ExeCoder, an LLM specifically designed translation,...
The urban climate in South China is marked by high complexity and substantial precipitation, posing significant challenges to road performance. This study focuses on the importance of precise zoning for roads application performance grade (PG) asphalt grading technology enhance pavement durability. Meteorological data from multiple stations across region were analyzed identify key climatic indicators. Using spatial interpolation methods fuzzy c-means clustering, classified into five distinct...
The rising popularity of service-oriented architecture to construct versatile distributed systems makes Web service recommendation and composition a hot research topic. It's challenge design accurate personalized QoS prediction approaches for due the unpredictable Internet environment sparsity available historical information. In this paper, we propose novel landmark-based framework then present two clustering-based algorithms services, named UBC WSBC, aiming at enhancing accuracy via...
In large-scale online service systems, incidents occur frequently due to a variety of causes, from updates software and hardware changes in operation environment. These could significantly degrade system's availability customers' satisfaction. Some are linked because they duplicate or inter-related. The can greatly help on-call engineers find mitigation solutions identify the root causes. this work, we investigate their links representative real-world incident management (IcM) system. Based...
Although tremendous efforts have been devoted to the quality assurance of online service systems, in reality, these systems still come across many incidents (i.e., unplanned interruptions and outages), which can decrease user satisfaction or cause economic loss. To better understand characteristics improve incident management process, we perform first large-scale empirical analysis collected from 18 real-world Microsoft. Surprisingly, find that although a large number could occur over short...
Rapid UI responsiveness is a key consideration to Android app developers. However, the complicated concurrency model of makes it hard for developers understand and further diagnose performance. This paper presents DiagDroid, tool specifically designed performance diagnosis. The notion DiagDroid that UI-triggered asynchronous executions contribute performance, hence their runtime dependency should be properly captured facilitate there are tremendous ways start executions, posing great...
Logs are crucial to the management and maintenance of software systems. In recent years, log analysis research has achieved notable progress on various topics such as parsing log-based anomaly detection. However, real voices from front-line practitioners seldom heard. For example, what pain points in practice? this work, we conduct a comprehensive survey study at Microsoft. We collected feedback 105 employees through questionnaire 13 questions individual interviews with 12 employees....
Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications services hosted on the are affected by a outage, users can experience slow response times, connection issues or total service disruption, resulting significant negative business impact. Outages usually comprised of several concurring events/source causes, therefore understanding context outages is very challenging yet crucial first step toward...
Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur economic losses. Locating the root-cause service, service contains root cause of outage, is a crucial step to mitigate outage. In current industrial practice, this generally performed bootstrap manner largely depends on human efforts: directly causes outage identified first, suspected traced back...
Cloud providers use automated watchdogs or monitors to continuously observe service availability and proactively report incidents when system performance degrades. Improper monitoring can lead delays in the detection mitigation of production incidents, which be extremely expensive terms customer impacts manual toil from engineering resources. Therefore, a systematic understanding pitfalls current practices how they is crucial for ensuring continuous reliability cloud services.
Cloud incidents (service interruptions or performance degradation) dramatically degrade the reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss. With years efforts, providers are able to solve most automatically rapidly. The secret this ability is intelligent incident detection. Only when detected timely, accurately, comprehensively, can they be diagnosed mitigated at a satisfiable speed. To overcome limitations traditional rule-based detection, we...
Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over last decade. In this paper, we focus on diagnosing batch failures, occur a instances same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability performance. Manual investigation large volume high-dimensional telemetry data logs, traces, metrics) labor-intensive time-consuming, like finding needle haystack....