Maurício Serrano

ORCID: 0000-0003-0250-5881
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Logic, programming, and type systems
  • Advanced Data Storage Technologies
  • Security and Verification in Computing
  • Distributed systems and fault tolerance
  • Software Engineering Research
  • Distributed and Parallel Computing Systems
  • Software Engineering Techniques and Practices
  • Software System Performance and Reliability
  • Interconnection Networks and Systems
  • Software Testing and Debugging Techniques
  • Speech Recognition and Synthesis
  • Advanced Software Engineering Methodologies
  • Embedded Systems Design Techniques
  • Advanced Neural Network Applications
  • Graph Theory and Algorithms
  • Cloud Computing and Resource Management
  • Advanced Graph Neural Networks
  • Formal Methods in Verification
  • Ferroelectric and Negative Capacitance Devices
  • Real-Time Systems Scheduling
  • Music and Audio Processing
  • Speech and Audio Processing
  • Service-Oriented Architecture and Web Services
  • Advanced Database Systems and Queries

IBM (United States)
1999-2021

Universidade de Brasília
2015-2021

IBM Research - Thomas J. Watson Research Center
1999-2020

Pontifical Catholic University of Rio de Janeiro
2008-2011

Shell (Malaysia)
2008

University of California, Santa Barbara
1993-2007

Intel (United States)
2001-2004

Los Angeles Mission College
2001

Jalapeño is a virtual machine for Java™ servers written in the Java language. To be able to address requirements of (performance and scalability particular), was designed "from scratch" as self-sufficient possible. Jalapeño's unique object model memory layout allows hardware null-pointer check well fast access array elements, fields, methods. Run-time services conventionally provided native code are implemented primarily Java. threads multiplexed by processors (implemented operating system...

10.1147/sj.391.0211 article EN IBM Systems Journal 2000-01-01

This paper presents a simple and efficient data flow algorithm for escape analysis of objects in Java programs to determine (i) if an object can be allocated on the stack; (ii) is accessed only by single thread during its lifetime, so that synchronization operations removed. We introduce new program abstraction analysis, connection graph, used establish reachability relationships between references. show graph summarized each method such same summary information may effectively different...

10.1145/320384.320386 article EN 1999-10-01

Many studies point to the difficulty of scaling existing computer architectures meet needs an exascale system (i.e., capable executing <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="TeX">$10^{18} $</tex-math></inline-formula> floating-point operations per second), consuming no more than 20 MW in power, by around year 2020. This paper outlines a new architecture, Active Memory Cube, which reduces energy...

10.1147/jrd.2015.2409732 article EN IBM Journal of Research and Development 2015-03-01

Article The Jalapeño dynamic optimizing compiler for Java Share on Authors: Michael G. Burke IBM Thomas J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY NYView Profile , Jong-Deok Choi Stephen Fink David Grove Hind Vivek Sarkar Mauricio Serrano V. C. Sreedhar Harini Srinivasan John Whaley Authors Info & Claims JAVA '99: Proceedings of the ACM 1999 conference GrandeJune Pages 129–141https://doi.org/10.1145/304065.304113Online:01 June 1999Publication History...

10.1145/304065.304113 article EN 1999-06-01

Language-supported synchronization is a source of serious performance problems in many Java programs. Even single-threaded applications may spend up to half their time performing useless due the thread-safe nature libraries. We solve this problem with new algorithm that allows lock and unlock operations be performed only few machine instructions most common cases. Our locks require partial word per object, were implemented without increasing object size. present measurements from our...

10.1145/277650.277734 article EN Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation 1998-05-01

As deep neural networks get more complex and input datasets larger, it can take days or even weeks to train a network the desired accuracy. Therefore, enabling distributed learning at massive scale is critical since offers potential reduce training time from hours. In this article, we present BlueConnect, an efficient communication library for that highly optimized popular GPU-based platforms. BlueConnect decomposes single all-reduce operation into large number of parallelizable...

10.1147/jrd.2019.2947013 article EN IBM Journal of Research and Development 2019-10-14

Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels accuracy on many AI tasks ushered explosive growth workloads across spectrum computing devices. However, their superior comes at a high computational cost, which necessitates approaches beyond traditional paradigms to improve operational efficiency. Leveraging application-level insight error resilience, we demonstrate how approximate (AxC) can significantly boost efficiency...

10.1109/jproc.2020.3029453 article EN Proceedings of the IEEE 2020-11-10

The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use hardware accelerators in their execution. Scaling the performance AI across generations is pivotal success commercial deployments. intrinsic error-resilient nature present a unique opportunity for performance/energy improvement through precision scaling. Motivated by recent algorithmic advances scaling inference training, we designed RaPiD <sup...

10.1109/isca52012.2021.00021 article EN 2021-06-01

This article presents an escape analysis framework for Java to determine (1) if object is not reachable after its method of creation returns, allowing the be allocated on stack, and (2) only from a single thread during lifetime, unnecessary synchronization operations that removed. We introduce new program abstraction analysis, connection graph , used establish reachability relationships between objects references. show can succinctly summarized each such same summary information may in...

10.1145/945885.945892 article EN ACM Transactions on Programming Languages and Systems 2003-11-01

Calling context profiles are used in many inter-procedural code optimizations and overall program understanding. Unfortunately, the collection of profile information is highly intrusive due to high frequency method calls most applications. Previously proposed calling-context profiling mechanisms consequently suffer from either low accuracy, overhead, or both. We have developed a new approach for building calling tree at runtime, called adaptive bursting . By selectively inhibiting redundant...

10.1145/1133255.1134012 article EN ACM SIGPLAN Notices 2006-06-11

This paper studies the memory behavior of important Java workloads used in benchmarking Virtual Machines (JVMs), based on instrumentation both application and library code a state-of-the-art JVM, provides structured information about these to help guide systems' design. We begin by characterizing inherent benchmarks, such as breakup heap accesses among different categories hotness references fields methods. then provide detailed misses data TLB caches, including distribution over kinds In...

10.1145/378420.378783 article EN 2001-06-01

Calling context profiles are used in many inter-procedural code optimizations and overall program understanding. Unfortunately, the collection of profile information is highly intrusive due to high frequency method calls most applications. Previously proposed calling-context profiling mechanisms consequently suffer from either low accuracy, overhead, or both. We have developed a new approach for building calling tree at runtime, called adaptive bursting. By selectively inhibiting redundant...

10.1145/1133981.1134012 article EN 2006-06-11

Deep Neural Networks (DNNs) have emerged as a core tool for machine learning.The computations performed during DNN training and inference are dominated by operations on the weight matrices describing DNN.As DNNs incorporate more stages nodes per stage, these may be required to sparse because of memory limitations.The GraphBLAS.orgmath library standard was developed provide high performance manipulation input/output vectors.For sufficiently matrices, matrix requires significantly less than...

10.1109/hpec.2017.8091098 preprint EN 2017-09-01

Cache miss stalls hurt performance because of the large gap between memory and processor speeds - for example, popular server benchmark SPEC JBB2000 spends 45% its cycles stalled waiting requests on Itanium® 2 processor. Traversing linked data structures causes a portion these stalls. Prefetching remains major challenge serial dependencies elements in structure preclude timely materialization prefetch addresses. This paper presents Mississippi Delta (MS Delta), novel technique prefetching...

10.1145/996841.996873 article EN Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation 2004-06-09

This paper presents the design and implementation of Quicksilver1 quasi-static compiler for Java. Quasi-static compilation is a new approach that combines benefits static dynamic compilation, while maintaining compliance with Java standard, including support its features. A relies on generation reuse persistent code images to reduce overhead during program execution, provide identical, testable reliable binaries over different executions. At runtime, adapts pre-compiled current JVM instance,...

10.1145/353171.353176 article EN 2000-10-01

Transparency is a critical concern to democratic societies. As software permeates our social lives, Software becoming quality criterion that demands more attention from developers. We present in this paper approach capture transparency-related requirements patterns through argumentation. represented initial transparency knowledge as patterns. These stimulated stakeholders' arguments on an open discussion about of given software. apply argumentation framework the graphs. Transparency-related...

10.1109/repa.2011.6046723 article EN 2011-08-01

Multistreamed processors can significantly improve processor throughput by allowing interleaved execution of instructions from multiple instruction streams. We present an analytical modeling technique to evaluate the effect dynamically interleaving additional streams within superscalar architectures. Using this technique, estimates executed per cycle (IPC) for a architecture are quickly calculated given simple descriptions workload and hardware characteristics. To validate SPEC89 benchmark...

10.1109/hicss.1994.323172 article EN 1994-01-01

Trace-based compilation is a promising technique for language compilers and binary translators. It offers the potential to expand scopes that have traditionally been limited by method boundaries.Detecting repeating cyclic execution paths capturing detected repetitions into traces key requirement trace selection algorithms achieve good optimization performance with small amounts of code. One important class repetition detection cyclic-path-based detection, where path (a starts ends at same...

10.1145/1950365.1950412 article EN 2011-03-05

The ubiquitous adoption of systems specialized for AI requires bridging two seemingly conflicting challenges—the need to deliver extreme processing efficiencies while employing familiar programming interfaces, making them compelling even non-expert users. We take a significant first step towards this goal and present an end-to-end software stack the RaPiD accelerator developed by IBM Research. set extensions, called Deeptools, that leverage work within popular deep learning frameworks....

10.1109/mm.2019.2931584 article EN IEEE Micro 2019-07-31

This paper presents a simple and efficient data flow algorithm for escape analysis of objects in Java programs to determine (i) if an object can be allocated on the stack; (ii) is accessed only by single thread during its lifetime, so that synchronization operations removed. We introduce new program abstraction analysis, connection graph , used establish reachability relationships between references. show summarized each method such same summary information may effectively different calling...

10.1145/320385.320386 article EN ACM SIGPLAN Notices 1999-10-01

A two-dimensional mesh of processing elements (PE's) with separable row and column buses (i.e., broadcast mechanisms for rows columns that can be logically divided into a number local through the use PE-controlled switches) has been shown to quite effective semigroup computation, prefix wide class other computations do not require excessive communication or data routing. For meshes row/column buses, authors show how performed same asymptotic time complexity without provision every discuss...

10.1109/71.246069 article EN IEEE Transactions on Parallel and Distributed Systems 1993-01-01

We present an approach for building calling context information useful program understanding, performance analysis and optimizations. Our exploits a lightweight profiling mechanism providing partial call traces. The goal is to reconstruct as accurately possible, help the user navigate through it. propose three steps merge traces into smaller number of trees. intend minimize errors such that final contexts represent actual components real tree with very high probability. first step...

10.1109/cgo.2009.12 article EN 2009-03-01

We investigate the impact of aggressive low-precision representations weights and activations in two families large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM -Hidden Markov Models (DBLSTM-HMMs) Recurrent Neural Network -Transducers (RNN-Ts).Using a 4-bit integer representation, naïve quantization approach applied to portion these models results significant Word Error Rate (WER) degradation.On other hand, we show that minimal accuracy loss...

10.21437/interspeech.2021-1962 article EN Interspeech 2022 2021-08-27

Graph processing is becoming a crucial component for analyzing big data arising in many application domains such as social and biological networks, fraud detection, sentiment analysis. As result, number of computational models graph analytics have been proposed the literature to help users write efficient large scale algorithms. In this paper we present an alternative model implementing algorithms using linear algebra based specification. We first specify set primitives that allows express...

10.1145/2903150.2903164 article EN 2016-05-16
Coming Soon ...