- Service-Oriented Architecture and Web Services
- Cloud Computing and Resource Management
- Parallel Computing and Optimization Techniques
- Access Control and Trust
- Advanced Neural Network Applications
- Distributed and Parallel Computing Systems
- Stochastic Gradient Optimization Techniques
- Advanced Data Storage Technologies
- Distributed systems and fault tolerance
- Business Process Modeling and Analysis
- Software Engineering and Design Patterns
- Advanced Database Systems and Queries
- Caching and Content Delivery
- Interconnection Networks and Systems
- Research in Social Sciences
- Urban and Freight Transport Logistics
- Advanced Software Engineering Methodologies
- Web Applications and Data Management
- Social and Educational Sciences
- Information and Cyber Security
- Quality and Supply Management
- Mobile and Web Applications
- Advanced Graph Neural Networks
- Mobile Agent-Based Network Management
- Peer-to-Peer Network Technologies
Microsoft (United States)
2011-2017
Cloud computing is a new paradigm, combining diverse client devices -- PCs, smartphones, sensors, single-function, and embedded with computation data storage in the cloud. As every advance computing, programming fundamental challenge, as cloud concurrent, distributed system running on unreliable hardware networks.
Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this, we propose Blink, a collective communication library that dynamically generates optimal primitives by packing spanning trees. We techniques to minimize number trees generated and extend Blink heterogeneous channels faster data transfers....
Multi-server distributed systems are becoming increasingly popular with the emergence of cloud computing. These need to provide high throughput low latency, which is a difficult task achieve. Manual performance tuning and diagnosis such systems, however, hard as amount relevant data large. To help system developers diagnosis, we have developed tool called Performance Anomaly Detector (PAD). PAD combines user-driven navigation analysis automatic correlation comparative techniques. The...
Many service applications use actors as a programming model for the middle tier, to simplify synchronization, fault-tolerance, and scalability. However, efficient operation of such in multiple, geographically distant datacenters is challenging, due very high communication latency. Caching replication are essential hide latency exploit locality; but it not priori clear how combine these techniques with actor model. We present Geo, an open-source geo-distributed system that improves...