NFDI4DS | UHH-SEMS - Publication Details

Tze Meng Low

ORCID: 0000-0002-5179-8249

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5019607600

Research Areas

Parallel Computing and Optimization Techniques
Graph Theory and Algorithms
Advanced Data Storage Technologies
Cloud Computing and Resource Management
Interconnection Networks and Systems
Advanced Neural Network Applications
Numerical Methods and Algorithms
Advanced Memory and Neural Computing
Embedded Systems Design Techniques
Advanced Graph Neural Networks
Ferroelectric and Negative Capacitance Devices
Tensor decomposition and applications
Formal Methods in Verification
Error Correcting Code Techniques
Data Management and Algorithms
Real-Time Systems Scheduling
Complexity and Algorithms in Graphs
Matrix Theory and Algorithms
Adversarial Robustness in Machine Learning
Genetic Associations and Epidemiology
Caching and Content Delivery
CCD and CMOS Imaging Sensors
Evolution and Genetic Dynamics
Complex Network Analysis Techniques
Low-power high-performance VLSI design

Carnegie Mellon University
2015-2024

Universidade Estadual de Campinas (UNICAMP)
2018

The University of Texas at Austin
2005-2016

Analytical Modeling Is Enough for High-Performance BLIS

OPENALEX - Publications

Tze Meng Low Francisco D. Igual Tyler Smith Enrique S. Quintana–Ort́ı

We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations matrix-matrix multiplication. This is both practical and scientific importance, it greatly reduces development effort required implementation level-3 BLAS while also advancing our understanding hierarchically layered memories interact with...

10.1145/2925987 article EN ACM Transactions on Mathematical Software 2016-08-16

A Unified Coded Deep Neural Network Training Strategy based on Generalized PolyDot codes

OPENALEX - Publications

Sanghamitra Dutta Ziqian Bai Haewon Jeong Tze Meng Low Pulkit Grover

This paper has two main contributions. First, we propose a novel coding technique - Generalized PolyDot for matrix-vector products that advances on existing techniques coded matrix operations under storage and communication constraints. Next, use the problem of training large Deep Neural Networks (DNNs) using unreliable nodes are prone to soft-errors, e.g., bit flips during computation produce erroneous outputs. An additional difficulty imposed by DNN is parameter values (weight matrices)...

10.1109/isit.2018.8437852 article EN 2022 IEEE International Symposium on Information Theory (ISIT) 2018-06-01

SPIRAL: Extreme Performance Portability

OPENALEX - Publications

Franz Franchetti Tze Meng Low Doru Thom Popovici Richard Veras Daniele G. Spampinato and 4 more

In this paper, we address the question of how to automatically map computational kernels highly efficient code for a wide range computing platforms and establish correctness synthesized code. More specifically, focus on two fundamental problems that software developers are faced with: performance portability across ever-changing landscape parallel guarantees sophisticated floating-point The problem is approached as follows: We develop formal framework capture algorithms, platforms, program...

10.1109/jproc.2018.2873289 article EN publisher-specific-oa Proceedings of the IEEE 2018-10-26

The BLIS Framework

OPENALEX - Publications

Field G. Zee Tyler Smith Bryan Marker Tze Meng Low Robert A. Geijn and 7 more

BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how acts as productivity multiplier by using it to implement the level-3 BLAS on variety of current architectures. The systems which we include state-of-the-art general-purpose, low-power, and many-core show, with very little effort, yields sequential parallel implementations that are competitive performance ATLAS, OpenBLAS (an effort maintain extend GotoBLAS),...

10.1145/2755561 article EN ACM Transactions on Mathematical Software 2016-06-03

Efficient SpMV Operation for Large and Highly Sparse Matrices using Scalable Multi-way Merge Parallelization

OPENALEX - Publications

Fazle Sadi Joe Sweeney Tze Meng Low James C. Hoe Larry Pileggi and 1 more

The importance of Sparse Matrix dense Vector multiplication (SpMV) operation in graph analytics and numerous scientific applications has led to development custom accelerators that are intended over-come the difficulties sparse data operations on general purpose architectures. However, efficient SpMV large problem (i.e. working set exceeds on-chip storage) is severely constrained due strong dependence limited amount fast random access memory scale. Additionally, unstructured matrix with high...

10.1145/3352460.3358330 article EN 2019-10-11

Exploiting Symmetry in Tensors for High Performance: Multiplication with Symmetric Tensors

OPENALEX - Publications

Martin Schatz Tze Meng Low Robert A. Geijn Tamara G. Kolda

Symmetric tensor operations arise in a wide variety of computations. However, the benefits exploiting symmetry order to reduce storage and computation is conflict with desire simplify memory access patterns. In this paper, we propose blocked data structure (Blocked Compact Storage) wherein consider by blocks store only unique symmetric tensor. We an algorithm-by-blocks, already shown benefit for matrix computations, that exploits format utilizing series temporary tensors avoid redundant...

10.1137/130907215 article EN SIAM Journal on Scientific Computing 2014-01-01

Reformulating the direct convolution for high-performance deep learning inference on ARM processors

OPENALEX - Publications

Sergio Barrachina Adrián Castelló Manuel F. Dolz Tze Meng Low Héctor Martínez and 3 more

We present two high-performance implementations of the convolution operator via direct algorithm that outperform so-called lowering approach based on im2col transform plus gemm kernel an ARMv8-based processor. One our methods presents additional advantage zero-memory overhead while other employs yet rather moderate workspace, substantially smaller than required by im2col+gemm solution. In contrast with a previous implementation similar convolution, this work exhibits key preserving...

10.1016/j.sysarc.2022.102806 article EN cc-by-nc-nd Journal of Systems Architecture 2022-12-15

High Performance Zero-Memory Overhead Direct Convolutions

OPENALEX - Publications

Jiyuan Zhang Franz Franchetti Tze Meng Low

The computation of convolution layers in deep neural networks typically rely on high performance routines that trade space for time by using additional memory (either packing purposes or required as part the algorithm) to improve performance. problems with such an approach are two-fold. First, these incur overhead which reduces overall size network can fit embedded devices limited capacity. Second, were not optimized performing convolution, means obtained is usually less than conventionally...

10.48550/arxiv.1809.10170 preprint EN other-oa arXiv (Cornell University) 2018-01-01

High-Assurance SPIRAL: End-to-End Guarantees for Robot and Car Control

OPENALEX - Publications

Franz Franchetti Tze Meng Low Stefan Mitsch Juan Pablo Mendoza Liang-Yan Gui and 8 more

Cyber-physical systems (CPSs), ranging from critical infrastructures such as power plants, to modern (semi) autonomous vehicles, are that use software control physical processes. CPSs made up of many different computational components. Each component runs its own piece implements algorithms, based on model the environment. Every then interacts with other components through signals and values it sends out. Collectively, these components, code they run, drive complex behaviors society has come...

10.1109/mcs.2016.2643244 article EN IEEE Control Systems 2017-03-16

Accumulating Householder transformations, revisited

OPENALEX - Publications

Thierry Joffrain Tze Meng Low Enrique S. Quintana–Ort́ı Robert A. Geijn Field G. Zee

A theorem related to the accumulation of Householder transformations into a single orthogonal transformation known as compact WY transform is presented. It provides simple characterization computation this and suggests an alternative algorithm for computing it. also transformation, UT transform, with same utility Transform which requires less has similar stability properties. That was first published over decade ago but gone unnoticed by community.

10.1145/1141885.1141886 article EN ACM Transactions on Mathematical Software 2006-06-01

Analytical cache modeling and tilesize optimization for tensor contractions

OPENALEX - Publications

Rui Li Aravind Sukumaran-Rajam Richard Veras Tze Meng Low Fabrice Rastello and 2 more

Data movement between processor and memory hierarchy is a fundamental bottleneck that limits the performance of many applications on modern computer architectures. Tiling loop permutation are key techniques for improving data locality. However, selecting effective tile-sizes permutations particularly challenging tensor contractions due to large number loops. Even state-of-the-art compilers usually produce sub-optimal permutations, as they rely naïve cost models. In this paper we provide an...

10.1145/3295500.3356218 preprint EN 2019-11-07

Addressing Unreliability in Emerging Devices and Non-von Neumann Architectures Using Coded Computing

OPENALEX - Publications

Sanghamitra Dutta Haewon Jeong Yaoqing Yang Viveck R. Cadambe Tze Meng Low and 1 more

Computing systems are evolving rapidly. At the device level, emerging devices beginning to compete with traditional CMOS systems. architecture novel architectures successfully avoiding communication bottleneck that is a central feature, and limitation, of von Neumann architecture. Furthermore, such increasingly plagued by unreliability. This unreliability arises at or gate-level in devices, can percolate up processor system-level if left unchecked. The goal this article survey recent...

10.1109/jproc.2020.2986362 article EN publisher-specific-oa Proceedings of the IEEE 2020-05-15

First look: Linear algebra-based triangle counting without matrix multiplication

OPENALEX - Publications

Tze Meng Low Varun Nagaraj Rao Matthew Lee Doru Thom Popovici Franz Franchetti and 1 more

Linear algebra-based approaches to exact triangle counting often require sparse matrix multiplication as a primitive operation. Non-linear algebra the same problem assume that adjacency of graph is not available. In this paper, we show both can be unified into single approach separates data format from algorithm design. By casting multiplication, different counts each exactly once identified. addition, by choosing appropriate format, equivalent compact-forward attained assuming We our yields...

10.1109/hpec.2017.8091046 article EN 2017-09-01

Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU

OPENALEX - Publications

Maia P. Blanco Tze Meng Low Kyungjoo Kim

In this work we present a performance exploration on Eager K-truss, linear-algebraic formulation of the K-truss graph algorithm. We address issues related to load imbalance parallel tasks in symmetric, triangular graphs by presenting fine-grained approach executing support computation. This also increases available parallelism, making it amenable GPU execution. demonstrate our using implementations Kokkos and evaluate them an Intel Skylake CPU Nvidia Tesla V100 GPU. Overall, observe between...

10.1109/hpec.2019.8916473 preprint EN 2019-09-01

Evaluation of Graph Analytics Frameworks Using the GAP Benchmark Suite

OPENALEX - Publications

Ariful Azad Mohsen Mahmoudi Aznaveh Scott Beamer Maia P. Blanco Jinhao Chen and 20 more

Graphs play a key role in data analytics. and the software systems used to work with them are highly diverse. Algorithms interact hardware different ways which graph solution works best on given platform changes structure of graph. This makes it difficult decide programming framework is for situation. In this paper, we try make sense diverse landscape. We evaluate five frameworks analytics: SuiteS-parse GraphBLAS, Galois, NWGraph library, Graph Kernel Collection, GraphIt. use GAP Benchmark...

10.1109/iiswc50251.2020.00029 article EN 2020-10-01

Linear Algebraic Formulation of Edge-centric K-truss Algorithms with Adjacency Matrices

OPENALEX - Publications

Tze Meng Low Daniele G. Spampinato Anurag Kutuluru Upasana Sridhar Doru Thom Popovici and 2 more

Edge-centric algorithms using the linear algebraic approach typically require use of both incidence and adjacency matrices. Due to two different data formats, information contained in graph is replicated, thereby incurring time space for replication. Using K-truss as an example, we demonstrate that edge-centric algorithm, Eager can be identified from a formulation only matrix. In addition, our implementation algorithm out-performs Galois' by average (over 53 graphs) more than 13 times, up 71 times.

10.1109/hpec.2018.8547718 article EN 2018-09-01

Masterless Coded Computing: A Fully-Distributed Coded FFT Algorithm

OPENALEX - Publications

Haewon Jeong Tze Meng Low Pulkit Grover

We propose a coded computing strategy for the Fast Fourier Transform (FFT) algorithm in fully distributed setting, which does not have powerful master node orchestrating worker nodes. The setting requires large amount of data movements between nodes, and this communication is often bottleneck parallel computing. identify cost each step FFT using α-β model, commonly used highperformance literature to estimate latency. show that by (P, K) systematic MDS code, overhead coding negligible...

10.1109/allerton.2018.8636047 article EN 2018-10-01

Scalable parallelization of FLAME code via the workqueuing model

OPENALEX - Publications

Field G. Zee Paolo Bientinesi Tze Meng Low Robert A. Geijn

We discuss the OpenMP parallelization of linear algebra algorithms that are coded using Formal Linear Algebra Methods Environment (FLAME) API. This API expresses at a higher level abstraction, avoids use loop and array indices, represents these as they formally derived presented. report on two implementations workqueuing model, neither which requires explicit indices to specify parallelism. The first implementation uses experimental taskq pragma, may influence adoption similar construct into...

10.1145/1326548.1326552 article EN ACM Transactions on Mathematical Software 2008-03-01

PageRank Acceleration for Large Graphs with Scalable Hardware and Two-Step SpMV

OPENALEX - Publications

Fazle Sadi Joe Sweeney Scott McMillan Tze Meng Low James C. Hoe and 2 more

PageRank is an important vertex ranking algorithm that suffers from poor performance and efficiency due to notorious memory access behavior. Furthermore, when graphs become bigger sparser, applications are inhibited as most current solutions profoundly rely on large random fast memory, which not easily scalable. In this paper we present a 16nm ASIC based shared platform for implementation fundamentally accelerates Sparse Matrix dense Vector multiplication (SpMV), the core kernel of PageRank....

10.1109/hpec.2018.8547561 article EN 2018-09-01

quickLD: An efficient software for linkage disequilibrium analyses

OPENALEX - Publications

Charalampos Theodoris Tze Meng Low Pavlos Pavlidis Nikolaos Alachiotis

Software tools for linkage disequilibrium (LD) analyses are designed to calculate LD among all genetic variants in a single region. Since compute and memory requirements grow quadratically with the distance between variants, using these long-range calculations leads long execution times increased allocation of resources. Furthermore, widely used do not fully utilize computational resources modern processors and/or graphics processing cards, limiting future large-scale on thousands samples....

10.1111/1755-0998.13438 article EN cc-by-nc Molecular Ecology Resources 2021-06-01

Large Bandwidth-Efficient FFTs on Multicore and Multi-socket Systems

OPENALEX - Publications

Doru Thom Popovici Tze Meng Low Franz Franchetti

Current microprocessor trends show a steady increase in the number of cores and/or threads present on same CPU die. While this improves performance for compute-bound applications, benefits memory-bound applications are limited. The discrete Fourier transform (DFT) is an example such application, where increasing does not yield corresponding performance. In paper, we alternate solution using increased cores/threads available typical multicore system. We propose to repurpose some as soft...

10.1109/ipdps.2018.00048 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2018-05-01

Mixed data layout kernels for vectorized complex arithmetic

OPENALEX - Publications

Doru Thom Popovici Franz Franchetti Tze Meng Low

Implementing complex arithmetic routines with Single Instruction Multiple Data (SIMD) instructions requires the use of that are usually not found in their real counter-parts. These instructions, such as shuffles and addsub, often bottlenecks for many kernels modern architectures can perform more operations than execute arithmetic. In this work, we focus on using a variety data layouts (mixed format) storing numbers at different stages computation so to limit these instructions. Using matrix...

10.1109/hpec.2017.8091024 article EN 2017-09-01

FFTX and SpectralPack: A First Look

OPENALEX - Publications

Franz Franchetti Daniele G. Spampinato Anuva Kulkarni Doru Thom Popovici Tze Meng Low and 5 more

We propose FFTX, a new framework for building high-performance FFT-based applications on exascale machines. Complex node architectures lead to multiple levels of parallelism and demand efficient ways data communication. The current FFTW interface falls short in maximizing performance such scenarios. FFTX is designed enable application developers leverage expert-level, automatic optimizations while navigating familiar interface. backwards compatible extends the Interface into an embedded...

10.1109/hipcw.2018.8634111 article EN 2018-12-01

Efficient Computation of Linkage Disequilibria as Dense Linear Algebra Operations

OPENALEX - Publications

Nikolaos Alachiotis Thom Popovici Tze Meng Low

Genomic datasets are steadily growing in size as more genomes sequenced and new genetic variants discovered. Datasets that comprise thousands of millions single-nucleotide polymorphisms (SNPs), exhibit excessive computational demands can lead to prohibitively long analyses, yielding the deployment high-performance approaches a prerequisite for thorough analysis current future large-scale datasets. In this work, we demonstrate kernel calculating linkage disequilibria (LD) genomes, i.e.,...

10.1109/ipdpsw.2016.80 article EN 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2016-05-01

Coming Soon ...