NFDI4DS | UHH-SEMS - Publication Details

Yu Kang

ORCID: 0009-0004-1735-5876

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5043798385

Research Areas

Software System Performance and Reliability
Network Security and Intrusion Detection
Software Engineering Research
Cloud Computing and Resource Management
Advanced Malware Detection Techniques
Anomaly Detection Techniques and Applications
Service-Oriented Architecture and Web Services
Software Reliability and Analysis Research
Data Quality and Management
Security and Verification in Computing
Caching and Content Delivery
Green IT and Sustainability
Topic Modeling
Natural Language Processing Techniques
IoT and Edge/Fog Computing
Software Testing and Debugging Techniques
Advanced Software Engineering Methodologies
Internet of Things and Social Network Interactions
Energy and Environmental Systems
Mobile and Web Applications
Adversarial Robustness in Machine Learning
Recommender Systems and Techniques
Liver Disease and Transplantation
Web Applications and Data Management
Organ Transplantation Techniques and Outcomes

Microsoft Research Asia (China)
2019-2024

Microsoft (Germany)
2024

University of Chinese Academy of Sciences
2024

Microsoft Research (United Kingdom)
2021-2024

Henan Provincial People's Hospital
2021

Zhengzhou University
2021

Shanghai Ninth People's Hospital
2019

Shanghai Jiao Tong University
2018-2019

Korea Telecom (South Korea)
2006-2018

Fudan University
2017-2018

Automatic Root Cause Analysis via Large Language Models for Cloud Incidents

OPENALEX - Publications

Yinfang Chen Huaibing Xie Minghua Ma Yu Kang Xin Gao and 13 more

Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for incidents. Traditional RCA methods, which rely on manual investigations data sources such as logs traces, are often laborious, error-prone, challenging on-call engineers. In this paper, we introduce RCACopilot, an innovative system empowered by large language model automating RCACopilot matches incoming incidents to corresponding incident handlers based their alert types,...

10.1145/3627703.3629553 article EN 2024-04-18

LPANNI: Overlapping Community Detection Using Label Propagation in Large-Scale Complex Networks

OPENALEX - Publications

Meilian Lu Zhenglin Zhang Zhihe Qu Yu Kang

Overlapping community structure is a significant feature of large-scale complex networks. Some existing detection algorithms cannot be applied to networks due their high time or space complexity. Label propagation were proposed for detecting communities in because linear complexity, however most which can only detect non-overlapping communities, the results are inaccurate and unstable. Aimed at defects, we an improved overlapping algorithm, LPANNI (Label Propagation Algorithm with Neighbor...

10.1109/tkde.2018.2866424 article EN IEEE Transactions on Knowledge and Data Engineering 2018-08-21

UniParser: A Unified Log Parser for Heterogeneous Log Data

OPENALEX - Publications

Yudong Liu Xu Zhang Shilin He Hongyu Zhang Liqun Li and 7 more

Logs provide first-hand information for engineers to diagnose failures in large-scale online service systems. Log parsing, which transforms semi-structured raw log messages into structured data, is a prerequisite of automated analysis such as log-based anomaly detection and diagnosis. Almost all existing parsers follow the general idea extracting common part templates dynamic parameters. However, these parsing methods, often neglect semantic meaning messages. Furthermore, high diversity...

10.1145/3485447.3511993 article EN Proceedings of the ACM Web Conference 2022 2022-04-25

UniLog: Automatic Logging via LLM and In-Context Learning

OPENALEX - Publications

Junjielong Xu Z.Y. Cui Yuan Zhao Xu Zhang Shilin He and 7 more

Logging, which aims to determine the position of logging statements, verbosity levels, and log messages, is a crucial process for software reliability enhancement. In recent years, numerous automatic tools have been designed assist developers in one tasks (e.g., providing suggestions on whether try-catch blocks). These are useful certain situations yet cannot provide comprehensive solution general. Moreover, although research has started explore end-to-end logging, it still largely...

10.1145/3597503.3623326 article EN 2024-02-06

Outage Prediction and Diagnosis for Cloud Service Systems

OPENALEX - Publications

Yujun Chen Xian Yang Qingwei Lin Hongyu Zhang Feng Gao and 7 more

With the rapid growth of cloud service systems and their increasing complexity, failures become unavoidable. Outages, which are critical failures, could dramatically degrade system availability impact user experience. To minimize downtime ensure high availability, we develop an intelligent outage management approach, called AirAlert, can forecast occurrence outages before they actually happen diagnose root cause after indeed occur. AirAlert works as a global watcher for entire system,...

10.1145/3308558.3313501 article EN 2019-05-13

Towards intelligent incident management: why we need it and how we make it

OPENALEX - Publications

Zhuangbin Chen Yu Kang Liqun Li Xu Zhang Hongyu Zhang and 12 more

The management of cloud service incidents (unplanned interruptions or outages a service/product) greatly affects customer satisfaction and business revenue. After years efforts, enterprises are able to solve most automatically timely. However, in practice, we still observe critical that occurred an unexpected manner orchestrated diagnosis workflow failed mitigate them. In order accelerate the understanding unprecedented provide actionable recommendations, modern incident system employs...

10.1145/3368089.3417055 article EN 2020-11-08

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

OPENALEX - Publications

Xuchao Zhang Supriyo Ghosh Chetan Bansal Rujia Wang Minghua Ma and 2 more

10.1145/3663529.3663846 article EN 2024-07-10

How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems

OPENALEX - Publications

Jiajun Jiang Weihai Lu Junjie Chen Qingwei Lin Pu Zhao and 7 more

In recent years, more and traditional shrink-wrapped software is provided as 7x24 online services. Incidents (events that lead to service disruptions or outages) could affect availability cause great financial loss. Therefore, mitigating the incidents important time critical. practice, a document describing mitigation process, called troubleshooting guide (TSG), usually used reduce Time To Mitigate (TTM). investigate usage of TSGs in real-world services, we conduct first empirical study on...

10.1145/3368089.3417054 article EN 2020-11-08

SPINE: a scalable log parser with feedback guidance

OPENALEX - Publications

Xuheng Wang Xu Zhang Liqun Li Shilin He Hongyu Zhang and 7 more

Log parsing, which extracts log templates and parameters, is a critical prerequisite step for automated analysis techniques. Though existing parsers have achieved promising accuracy on public datasets, they still face many challenges when applied in the industry. Through studying characteristics of real-world data analyzing limitations parsers, we identify two problems. Firstly, it non-trivial to scale parser vast number logs, especially scenarios where extremely imbalanced. Secondly,...

10.1145/3540250.3549176 article EN Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2022-11-07

Xpert: Empowering Incident Management with Query Recommendations via Large Language Models

OPENALEX - Publications

Yuxuan Jiang Chaoyun Zhang Shilin He Z. Q. Yang Minghua Ma and 6 more

Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents occurring within these can lead to service disruptions and adversely affect user experience. To swiftly resolve such incidents, on-call engineers depend on crafting domain-specific language (DSL) queries analyze telemetry data. writing be challenging time-consuming. This paper presents thorough empirical study the utilization of KQL, DSL employed for incident management large-scale system at...

10.1145/3597503.3639081 article EN 2024-04-12

DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

OPENALEX - Publications

Linghao Zhang Junhao Wang Shilin He Chaoyun Zhang Yu Kang and 11 more

Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for repository successfully run. Existing studies highlight that dependency-related issues cause over 40\% of observed runtime errors on generated repository. To address this, we introduce DI-BENCH, large-scale benchmark evaluation framework specifically designed assess LLMs' capability...

10.48550/arxiv.2501.13699 preprint EN arXiv (Cornell University) 2025-01-23

Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation

OPENALEX - Publications

Xing Zhang Jian Wen Fangkai Yang Pu Zhao Yu Kang and 9 more

The advancement of large language models has intensified the need to modernize enterprise applications and migrate legacy systems secure, versatile languages. However, existing code translation benchmarks primarily focus on individual functions, overlooking complexities involved in translating entire repositories, such as maintaining inter-module coherence managing dependencies. While some recent repository-level attempt address these challenges, they still face limitations, including poor...

10.48550/arxiv.2501.16050 preprint EN arXiv (Cornell University) 2025-01-27

ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation

OPENALEX - Publications

Minghua He Fangkai Yang Pu Zhao Wenjie Yin Yu Kang and 4 more

Code translation is a crucial activity in the software development and maintenance process, researchers have recently begun to focus on using pre-trained large language models (LLMs) for code translation. However, existing LLMs only learn contextual semantics of during pre-training, neglecting executability information closely related execution state code, which results unguaranteed unreliable automated To address this issue, we propose ExeCoder, an LLM specifically designed translation,...

10.48550/arxiv.2501.18460 preprint EN arXiv (Cornell University) 2025-01-30

Research on Urban Road Design Method in South China Based on Climate Zoning

OPENALEX - Publications

Huanyu Chang X.‐S. Wang Naren Fang Yu Kang

The urban climate in South China is marked by high complexity and substantial precipitation, posing significant challenges to road performance. This study focuses on the importance of precise zoning for roads application performance grade (PG) asphalt grading technology enhance pavement durability. Meteorological data from multiple stations across region were analyzed identify key climatic indicators. Using spatial interpolation methods fuzzy c-means clustering, classified into five distinct...

10.3390/su17041671 article EN Sustainability 2025-02-17

A Clustering-Based QoS Prediction Approach for Web Service Recommendation

OPENALEX - Publications

Jieming Zhu Yu Kang Zibin Zheng Michael R. Lyu

The rising popularity of service-oriented architecture to construct versatile distributed systems makes Web service recommendation and composition a hot research topic. It's challenge design accurate personalized QoS prediction approaches for due the unpredictable Internet environment sparsity available historical information. In this paper, we propose novel landmark-based framework then present two clustering-based algorithms services, named UBC WSBC, aiming at enhancing accuracy via...

10.1109/isorcw.2012.27 article EN 2012-04-01

Identifying linked incidents in large-scale online service systems

OPENALEX - Publications

Yujun Chen Xian Yang Hang Dong Xiaoting He Hongyu Zhang and 7 more

In large-scale online service systems, incidents occur frequently due to a variety of causes, from updates software and hardware changes in operation environment. These could significantly degrade system's availability customers' satisfaction. Some are linked because they duplicate or inter-related. The can greatly help on-call engineers find mitigation solutions identify the root causes. this work, we investigate their links representative real-world incident management (IcM) system. Based...

10.1145/3368089.3409768 article EN 2020-11-08

How incidental are the incidents?

OPENALEX - Publications

Junjie Chen Shu Zhang Xiaoting He Qingwei Lin Hongyu Zhang and 6 more

Although tremendous efforts have been devoted to the quality assurance of online service systems, in reality, these systems still come across many incidents (i.e., unplanned interruptions and outages), which can decrease user satisfaction or cause economic loss. To better understand characteristics improve incident management process, we perform first large-scale empirical analysis collected from 18 real-world Microsoft. Surprisingly, find that although a large number could occur over short...

10.1145/3324884.3416624 article EN 2020-12-21

MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language Models

OPENALEX - Publications

Zhaoyang Yu Minghua Ma Chaoyun Zhang Si Qin Yu Kang and 7 more

10.1145/3663529.3663826 article EN 2024-07-10

DiagDroid: Android performance diagnosis via anatomizing asynchronous executions

OPENALEX - Publications

Yu Kang Yangfan Zhou Hui Xu Michael R. Lyu

Rapid UI responsiveness is a key consideration to Android app developers. However, the complicated concurrency model of makes it hard for developers understand and further diagnose performance. This paper presents DiagDroid, tool specifically designed performance diagnosis. The notion DiagDroid that UI-triggered asynchronous executions contribute performance, hence their runtime dependency should be properly captured facilitate there are tremendous ways start executions, posing great...

10.1145/2950290.2950316 article EN 2016-11-01

An empirical study of log analysis at Microsoft

OPENALEX - Publications

Shilin He Xu Zhang Pinjia He Yong Xu Liqun Li and 6 more

Logs are crucial to the management and maintenance of software systems. In recent years, log analysis research has achieved notable progress on various topics such as parsing log-based anomaly detection. However, real voices from front-line practitioners seldom heard. For example, what pain points in practice? this work, we conduct a comprehensive survey study at Microsoft. We collected feedback 105 employees through questionnaire 13 questions individual interviews with 12 employees....

10.1145/3540250.3558963 article EN Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2022-11-07

Assess and Summarize: Improve Outage Understanding with Large Language Models

OPENALEX - Publications

Pengxiang Jin Shenglin Zhang Minghua Ma Haozhe Li Yu Kang and 11 more

Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications services hosted on the are affected by a outage, users can experience slow response times, connection issues or total service disruption, resulting significant negative business impact. Outages usually comprised of several concurring events/source causes, therefore understanding context outages is very challenging yet crucial first step toward...

10.1145/3611643.3613891 article EN 2023-11-30

Fast Outage Analysis of Large-Scale Production Clouds with Service Correlation Mining

OPENALEX - Publications

Yaohui Wang Guozheng Li Zijian Wang Yu Kang Yangfan Zhou and 11 more

Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur economic losses. Locating the root-cause service, service contains root cause of outage, is a crucial step to mitigate outage. In current industrial practice, this generally performed bootstrap manner largely depends on human efforts: directly causes outage identified first, suspected traced back...

10.1109/icse43902.2021.00085 article EN 2021-05-01

Detection Is Better Than Cure: A Cloud Incidents Perspective

OPENALEX - Publications

Vaibhav Ganatra Anjaly Parayil Supriyo Ghosh Yu Kang Minghua Ma and 3 more

Cloud providers use automated watchdogs or monitors to continuously observe service availability and proactively report incidents when system performance degrades. Improper monitoring can lead delays in the detection mitigation of production incidents, which be extremely expensive terms customer impacts manual toil from engineering resources. Therefore, a systematic understanding pitfalls current practices how they is crucial for ensuring continuous reliability cloud services.

10.1145/3611643.3613898 article EN 2023-11-30

An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection

OPENALEX - Publications

Yichen Li Xu Zhang Shilin He Zhuangbin Chen Yu Kang and 9 more

Cloud incidents (service interruptions or performance degradation) dramatically degrade the reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss. With years efforts, providers are able to solve most automatically rapidly. The secret this ability is intelligent incident detection. Only when detected timely, accurately, comprehensively, can they be diagnosed mitigated at a satisfiable speed. To overcome limitations traditional rule-based detection, we...

10.1145/3544497.3544499 article EN ACM SIGOPS Operating Systems Review 2022-06-14

CONAN: Diagnosing Batch Failures for Cloud Systems

OPENALEX - Publications

Liqun Li Xu Zhang Shilin He Yu Kang Hongyu Zhang and 6 more

Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over last decade. In this paper, we focus on diagnosing batch failures, occur a instances same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability performance. Manual investigation large volume high-dimensional telemetry data logs, traces, metrics) labor-intensive time-consuming, like finding needle haystack....

10.1109/icse-seip58684.2023.00018 article EN 2023-05-01

Coming Soon ...