- Distributed and Parallel Computing Systems
- Parallel Computing and Optimization Techniques
- Cloud Computing and Resource Management
- Advanced Data Storage Technologies
- Interconnection Networks and Systems
- Embedded Systems Design Techniques
- Scientific Computing and Data Management
- Ferroelectric and Negative Capacitance Devices
- Advanced Memory and Neural Computing
- Computational Physics and Python Applications
- Underwater Vehicles and Communication Systems
- CCD and CMOS Imaging Sensors
- Target Tracking and Data Fusion in Sensor Networks
- Big Data and Business Intelligence
- Advanced Neural Network Applications
- Computational Drug Discovery Methods
- Genetics, Bioinformatics, and Biomedical Research
- Advanced Electron Microscopy Techniques and Applications
- Robotics and Sensor-Based Localization
- Protein Degradation and Inhibitors
- Bioinformatics and Genomic Networks
- Protein Structure and Dynamics
- Quantum-Dot Cellular Automata
KTH Royal Institute of Technology
2023-2025
Quantum computer simulators are an indispensable tool for prototyping quantum algorithms and verifying the functioning of existing hardware. The current largest computers feature more than one thousand qubits, challenging their classical simulators. State-vector challenged by exponential increase representable states with respect to number making fifty qubits practically unfeasible. A appealing approach simulating is adopting tensor network approach, whose memory requirements fundamentally...
Memory disaggregation has recently been adopted in data centers to improve resource utilization, motivated by cost and sustainability. Recent studies on large-scale HPC facilities have also highlighted memory underutilization. A promising non-disruptive option for is rack-scale pooling, where node-local supplemented shared pools. This work outlines the prospects requirements adoption clarifies several misconceptions. We propose a quantitative method dissecting application system from top...
The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources the system or release underutilized during execution. In this paper, we present Kub, methodology that enables elastic execution workloads on Kubernetes so allocated to can be dynamically scaled One main optimization our method maximize reuse originally disruption running minimized. scaling procedure coordinated among nodes through remote calls for deploying cloud. We...
Recent development in lightweight OS-level virtualization, containers, provides a potential solution for running HPC applications on the cloud platform. In this work, we focus impact of different layers containerized environment when migrating containers from dedicated system to On three ARM-based platforms, including latest Nvidia Grace CPU, use six representative characterize container host OS and kernel, rootless privileged execution. Our results indicate less than 4\% overhead DGEMM,...
Memory management across discrete CPU and GPU physical memory is traditionally achieved through explicit allocations data copy or unified virtual memory. The Grace Hopper Superchip, for the first time, supports an integrated CPU-GPU system page table, hardware-level addressing of allocated memory, cache-coherent NVLink-C2C interconnect, bringing alternative solution enabling a Unified system. In this work, we provide in-depth study on in both in-memory oversubscription scenarios. We suite...
Memory management across discrete CPU and GPU physical memory is traditionally achieved through explicit allocations data copy or unified virtual memory. The Grace Hopper Superchip, for the first time, supports an integrated CPU-GPU system page table, hardware-level addressing of allocated memory, cache-coherent NVLink-C2C interconnect, bringing alternative solution enabling a Unified system. In this work, we provide in-depth study on in both in-memory oversubscription scenarios. We suite...
OpenCUBE aims to develop an open-source full software stack for Cloud computing blueprint deployed on EPI hardware, adaptable emerging workloads across the continuum. prioritizes energy awareness and utilizes open APIs, Open Source components, advanced SiPearl Rhea processors, RISC-V accelerator. The project leverages representative workloads, such as cloud-native workflows of weather forecast data management, molecular docking, space weather, evaluation validation.
In drug discovery, molecular docking aims at characterizing the binding of a drug-like molecule to macromolecule. AutoDock-GPU, state-of-the-art software, estimates geometrical conformation docked ligand-protein complex by minimizing scoring function. Our profiling results indicate that current reduction operation is heavily used in function sub-optimal. Thus, we developed method accelerate sum four-element vectors using matrix operations on NVIDIA Tensor Cores. We integrated new into...
Complex workflows play a critical role in accelerating scientific discovery. In many domains, efficient workflow management can lead to faster output and broader user groups. Workflows that leverage resources across the boundary between cloud HPC are strong driver for convergence of cloud. This study investigates transition deployment GPU-accelerated molecular docking was designed systems onto cloud-native environment with Kubernetes Apache Airflow. The case focuses on state-of-of-the-art...
The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources the system or release underutilized during execution. In this paper, we present Kub, methodology that enables elastic execution workloads on Kubernetes so allocated to can be dynamically scaled One main optimization our method maximize reuse originally disruption running minimized. scaling procedure coordinated among nodes through remote calls for deploying cloud. We...
Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between hardware capacity achievable application performance. This work aims provide better understanding Infinity Fabric interconnects on AMD GPUs CPUs. We propose test evaluation methodology for characterizing performance data movements multi-GPU systems, stressing different communication options MI250X GPUs,...
High-end ARM processors are emerging in data centers and HPC systems, posing as a strong contender to x86 machines. Memory-centric profiling is an important approach for dissecting application's bottlenecks on memory access guiding optimizations. Many existing tools leverage hardware performance counters precise event sampling, such Intel PEBS AMD IBS, achieve high accuracy low overhead. In this work, we present multi-level tool processors, leveraging Statistical Profiling Extension (SPE)....
Disaggregated memory breaks the boundary of monolithic servers to enable provisioning on demand. Using network-attached provide expansion for memory-intensive applications compute nodes can improve overall utilization a cluster and reduce total cost ownership. However, current software solutions leveraging must consume resources node management tasks. Emerging off-path smartNICs general-purpose programmability at low-cost low-power cores. This work provides general architecture design that...
High-performance GPU-accelerated particle filter methods are critical for object detection applications, ranging from autonomous driving, robot localization, to time-series prediction. In this work, we investigate the design, development and optimization of particle-filter using half-precision on CUDA cores compare their performance accuracy with single- double-precision baselines Nvidia V100, A100, A40 T4 GPUs. To mitigate numerical instability precision losses, introduce algorithmic...