- Software System Performance and Reliability
- Software Engineering Research
- Cloud Computing and Resource Management
- Data Quality and Management
- Anomaly Detection Techniques and Applications
- Topic Modeling
- Network Security and Intrusion Detection
- Software Testing and Debugging Techniques
- Software Engineering Techniques and Practices
- Web Data Mining and Analysis
- IoT and Edge/Fog Computing
- Advanced Malware Detection Techniques
- Cloud Data Security Solutions
- Distributed and Parallel Computing Systems
- Spam and Phishing Detection
- Big Data and Business Intelligence
- Information Retrieval and Search Behavior
- Natural Language Processing Techniques
- Security and Verification in Computing
- Caching and Content Delivery
- Web Application Security Vulnerabilities
- Data Stream Mining Techniques
- Advanced Computational Techniques and Applications
- Open Source Software Innovations
- Semantic Web and Ontologies
Microsoft (United States)
2012-2025
Google (United States)
2025
Seattle University
2025
University of Washington
2025
University of California, Los Angeles
2025
University of Illinois Urbana-Champaign
2025
Menlo School
2025
Microsoft (Germany)
2024
University of Chinese Academy of Sciences
2024
Microsoft Research (United Kingdom)
2019-2024
Incident management for cloud services is a complex process involving several steps and has huge impact on both service health developer productivity. On-call engineers require significant amount of domain knowledge manual effort root causing mitigation production incidents. Recent advances in artificial intelligence resulted state-of-the-art large language models like GPT-3.x (both GPT-3.0 GPT-3.5), which have been used to solve variety problems ranging from question answering text...
Social sign-on and social sharing are becoming an ever more popular feature of web applications. This success is largely due to the APIs support offered by prominent networks, such as Facebook, Twitter, Google, on basis new open standards OAuth 2.0 authorization protocol. A formal analysis these protocols must account for malicious websites common application vulnerabilities, cross-site request forgery redirectors. We model several configurations protocol in applied pi-calculus verify them...
Social sign-on and social sharing are becoming an ever more popular feature of web applications. This success is largely due to the APIs support offered by prominent networks, such as Facebook, Twitter Google, on basis new open standards OAuth 2.0 authorization pro tocol. A formal analysis these protocols must account for malicious websites common application vulnerabilities, cross-site request forgery redirectors. We model several configurations protocol in applied pi-calculus verify them...
Today's software development is distributed and involves continuous changes for new features yet, their cycle has to be fast agile. An important component of enabling this agility selecting the right reviewers every code-change - smallest unit cycle. Modern tool-based code review proven an effective way achieve appropriate changes. However, selection in these systems at best manual. As teams scale, poses challenge reviewers, which turn determines quality over time. While previous work...
Effort estimation models have been long studied in software engineering research. help organizations and individuals plan track progress of their projects individual tasks to delivery milestones better. Towards this end, there is a large body work that has done on effort for but little an checkin (Pull Request) level. In paper we present methodology provides estimates developer check-ins which displayed developers them items. Given the cloud development infrastructure pervasive companies, it...
Production incidents in today's large-scale cloud services can be extremely expensive terms of customer impacts and engineering resources required to mitigate them. Despite continuous reliability efforts, still experience severe due various root-causes. Worse, many these last for a long period as existing techniques practices fail quickly detect To better understand the problems, we carefully study hundreds recent high severity their postmortems Microsoft-Teams, distributed based service...
Large scale cloud services use Key Performance Indicators (KPIs) for tracking and monitoring performance. They usually have Service Level Objectives (SLOs) baked into the customer agreements which are tied to these KPIs. Dependency failures, code bugs, infrastructure other problems can cause performance regressions. It is critical minimize time manual effort in diagnosing triaging such issues reduce impact. volume of logs mixed type attributes (categorical, continuous) makes diagnosis...
Pull requests are a key part of the collaborative software development and code review process today. However, pull can also slow down when reviewer(s) or author do not actively engage with request. In this work, we design an end-to-end service, Nudge, for accelerating overdue towards completion by reminding to their requests. First, use models based on effort estimation machine learning predict time given Second, activity detection filter out that may be overdue, but which sufficient action...
Cloud providers introduce features (e.g., Spot VMs, Harvest and Burstable VMs) optimizations oversubscription, auto-scaling, power harvesting, overclocking) to improve efficiency reliability. To effectively utilize these features, it's crucial understand the characteristics of workloads running in cloud. However, workload can be complex depend on multiple signals, making manual characterization difficult unscalable. In this study, we conduct first large-scale examination first-party at...
Cloud platforms remain underutilized despite multiple proposals to improve their utilization (e.g., disaggregation, harvesting, and oversubscription). Our characterization of the resource virtual machines (VMs) in Azure reveals that, while CPU is main resource, we need provide a solution manage all resources holistically. We also observe that many VMs exhibit complementary temporal patterns, which can be leveraged oversubscription resources. Based on these insights, propose Coach: system...
Recent advances in generative AI have led to large multi-modal models (LMMs) capable of simultaneously processing inputs various modalities such as text, images, video, and audio. While these demonstrate impressive capabilities, efficiently serving them production environments poses significant challenges due their complex architectures heterogeneous resource requirements. We present the first comprehensive systems analysis two prominent LMM architectures, decoder-only cross-attention, on...
Recent Large Language Models (LLMs) have demonstrated satisfying general instruction following ability. However, small LLMs with about 7B parameters still struggle fine-grained format (e.g., JSON format), which seriously hinder the advancements of their applications. Most existing methods focus on benchmarking while overlook how to improve specific ability for LLMs. Besides, these often rely evaluations based advanced GPT-4), can introduce intrinsic bias and be costly due API calls. In this...
Cloud systems are the backbone of today's computing industry. Yet, these remain complicated to design, build, operate, and improve. All tasks require significant manual effort by both developers operators systems. To reduce this burden, in paper we set forth a vision for achieving holistic automation, intent-based system design operation. We propose intent as new abstraction within context Intent encodes functional operational requirements at high-level, which can be used automate...
Large Language Model (LLM) inference workloads handled by global cloud providers can include both latency-sensitive and insensitive tasks, creating a diverse range of Service Level Agreement (SLA) requirements. Managing these mixed is challenging due to the complexity stack, which includes multiple LLMs, hardware configurations, geographic distributions. Current optimization strategies often silo tasks ensure that SLAs are met for but this leads significant under-utilization expensive GPU...
The move from boxed products to services and the widespread adoption of cloud computing has had a huge impact on software development life cycle DevOps processes. Particularly, incident management become critical for developing operating large-scale services. Prior work heavily focused challenges with triaging de-duplication. In this work, we address fundamental problem structured knowledge extraction service incidents. We have built SoftNER, framework unsupervised frame as Named-Entity...