José Nelson Amaral

ORCID: 0000-0002-9943-1809
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Distributed and Parallel Computing Systems
  • Advanced Data Storage Technologies
  • Distributed systems and fault tolerance
  • Software Testing and Debugging Techniques
  • Cloud Computing and Resource Management
  • Software Engineering Research
  • Embedded Systems Design Techniques
  • Logic, programming, and type systems
  • Interconnection Networks and Systems
  • Algorithms and Data Compression
  • Network Packet Processing and Optimization
  • Formal Methods in Verification
  • AI-based Problem Solving and Planning
  • Caching and Content Delivery
  • Advanced Neural Network Applications
  • Software System Performance and Reliability
  • VLSI and FPGA Design Techniques
  • Advanced Database Systems and Queries
  • Neural Networks and Applications
  • Software-Defined Networks and 5G
  • Optimization and Packing Problems
  • Scheduling and Optimization Algorithms
  • Security and Verification in Computing
  • Computational Physics and Python Applications

University of Alberta
2016-2025

University of California System
2021

Athabasca University
2004-2016

Universidade Estadual de Campinas (UNICAMP)
2015

IBM (Canada)
2014

The University of Texas at Austin
1992-2002

University of Delaware
1998-2002

Pontifícia Universidade Católica do Rio Grande do Sul
1990-2002

This paper describes an end-to-end system implementation of the transactional memory (TM) programming model on top hardware (HTM) Blue Gene/Q (BG/Q) machine. The TM supports most C/C++ constructs a best-effort HTM with help complete software stack including compiler, kernel, and runtime.

10.1145/2370816.2370836 article EN 2012-09-19

Syntax errors are made by novice and experienced programmers alike; however, lack the years of experience that help them quickly resolve these frustrating errors. Standard LR parsers little help, typically resolving syntax their precise location poorly. We propose a methodology locates where occur, suggests possible changes to token stream can fix error identified. This finds using language models trained on correct source code find tokens seem out place. Fixes synthesized consulting...

10.1109/saner.2018.8330219 article EN 2018-03-01

A frustrating aspect of software development is that compiler error messages often fail to locate the actual cause a syntax error. An errant semicolon or brace can result in many errors reported throughout file. We seek find source these by relying on consistency software: valid code usually repetitive and unsurprising. exploit this constructing simple N-gram language model lexed tokens. implemented an automatic Java syntax-error locator using corpus project itself evaluated its performance...

10.1145/2597073.2597102 article EN 2014-05-20

The rapid adoption and the diversification of cloud computing technology exacerbate importance a sound experimental methodology for this domain. This work investigates how to measure report performance in cloud, well research community is already doing it. We propose set eight important methodological principles that combine best-practices from nearby fields with concepts applicable only clouds, new ideas about time-accuracy trade-off. show these are using practical use-case experiment. To...

10.1109/tse.2019.2927908 article EN publisher-specific-oa IEEE Transactions on Software Engineering 2019-07-10

This paper describes the design and implementation of a scalable run-time system an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, distributed-memory machine, demonstrates that combination with runtime produces programs performance comparable to efficient MPI good scalability up hundreds thousands processors.Our solves problem maintaining shared object consistency efficiently in distributed memory machine. Our infrastructure simplifies code...

10.1145/1133255.1133995 article EN ACM SIGPLAN Notices 2006-06-11

An unsound claim can misdirect a field, encouraging the pursuit of unworthy ideas and abandonment promising ideas. inadequate description make it difficult to reason about claim, for example, determine whether is sound. Many practitioners will acknowledge threat claims or descriptions their field. We believe that this situation exacerbated, even encouraged, by lack systematic approach exploring, exposing, addressing source poor exposition. This article proposes framework identifies three...

10.1145/2983574 article EN ACM Transactions on Programming Languages and Systems 2016-10-13

How should digital design be taught to computing science students in a single one-semester course? This work advocates the use of state-of-the-art tools and programmable devices presents series laboratory exercises help learn logic. Each exercise introduces new concepts produces complete stand-alone apparatus that is fun interesting use. These lead most challenging capstone designs for single-semester course which authors are aware. Fast progress made possible by providing with predesigned...

10.1109/te.2004.837048 article EN IEEE Transactions on Education 2005-02-01

This paper describes the design and implementation of a scalable run-time system an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, distributed-memory machine, demonstrates that combination with runtime produces programs performance comparable to efficient MPI good scalability up hundreds thousands processors.Our solves problem maintaining shared object consistency efficiently in distributed memory machine. Our infrastructure simplifies code...

10.1145/1133981.1133995 article EN 2006-06-11

Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery - vector registers and instructions to manipulate data stored in such registers. The central idea this paper is use these SIMD resources improve the performance tail recursive sorting algorithms. When number elements be sorted reaches a set threshold, loaded into registers, manipulated in-register, result back memory. Three implementations with two different machineries x86-64's SSE2 G5's...

10.1145/1248377.1248436 article EN 2007-06-09

A new compilation framework enables the execution of numerical-intensive applications, written in Python, on a hybrid environment formed by CPU and GPU. This compiler automatically computes set memory locations that need to be transferred GPU, produces correct mapping between GPU address spaces. Thus, programming model implements virtual shared space. is implemented as combination unPython, an ahead-of-time from Python/NumPy C language, jit4GPU, just-in-time AMD CAL interface. Experimental...

10.1145/1735688.1735695 article EN 2010-03-14

Improving the performance of work-stealing load-balancing algorithms in distributed shared-memory systems is challenging. These need to overcome high costs contention among workers, communication and remote data-references between nodes, their impact on locality preferences tasks. Prior research focus stealing from a victim that best exploits data locality, using special deques minimize local workers. This work explores selection tasks are favourable for migration across nodes memory...

10.1109/icpp.2013.19 article EN 2013-10-01

Finding the best state assignment for implementing a synchronous sequential circuit is important reducing silicon area or chip count in many digital designs. This problem (SAP) belongs to broader class of combinatorial optimization problems than well studied traveling salesman problem, which can be formulated as special case SAP. The search good solution considerably involved SAP due large number equivalent solutions, and no effective heuristic has been found so far cater all types circuits....

10.1109/21.370202 article EN IEEE Transactions on Systems Man and Cybernetics 1995-04-01

In this paper, we address the problem of generating an optimal instruction sequence S for a Directed Acyclic Graph (DAG), where is in terms number registers used. We call Minimum Register Instruction Sequence (MRIS) problem. The motivation revisiting MRIS stems from several modern architecture innovations/requirements that has put sequencing new context. develop efficient heuristic solution This based on notion lineage-a set instructions can definitely share single register. formation...

10.1109/tc.2003.1159750 article EN IEEE Transactions on Computers 2003-01-01

This paper describes Memory-Pooling-Assisted Data Splitting (MPADS), a framework that combines data structure splitting with memory pooling --- Although it MPADS may call to mind padding, distintion of this is does not insert padding. relies on pointer analysis ensure safe and applicable type-unsafe language. makes no assumption about type safety. The can identify cases in which the transformation could lead incorrect code thus abandons those cases.

10.1145/1375634.1375649 article EN 2008-06-07

This paper presents a detailed analysis of the application Hardware Transactional Memory (HTM) support for loop parallelization with Thread-Level Speculation (TLS) and describes careful evaluation implementation TLS on HTM extensions available in such machines. The sample over described this also provides evidence that programming effort to implement is non-trivial. Thus an extension OpenMP both makes more accessible programmers allows easytuning parameters. As result, it several important...

10.1109/tpds.2017.2752169 article EN IEEE Transactions on Parallel and Distributed Systems 2017-09-14

Convolution is one of the most computationally intensive operations that must be performed for machine learning model inference. A traditional approach to computing convolutions known as Im2Col + BLAS method. This article proposes SConv: a direct-convolution algorithm based on an MLIR/LLVM code-generation toolchain can integrated into machine-learning compilers. introduces: (a) Slicing Analysis (CSA)—a convolution-specific 3D cache-blocking analysis pass focuses tile reuse over cache...

10.1145/3625004 article EN ACM Transactions on Architecture and Code Optimization 2023-09-20

This article presents Forma , a practical, safe, and automatic data reshaping framework that reorganizes arrays to improve locality. splits large aggregated data-types into smaller ones Arrays of these types are then replaced by multiple the types. These new form natural streams have memory footprints, better locality, more suitable for hardware stream prefetching. consists field-sensitive alias analyzer, type checker, portable structure planner, an array reshaper. An extensive experimental...

10.1145/1290520.1290522 article EN ACM Transactions on Programming Languages and Systems 2007-11-01

Well-crafted libraries deliver much higher performance than code generated by sophisticated application programmers using advanced optimizing compilers. When a pattern for which well-tuned library implementation exists is found in the source of an application, highest performing solution to replace with call library. Idiom-recognition solutions past either required matching machinery that was outside compilation framework or provided very brittle would fail even minor variants code. This...

10.1145/3459010 article EN ACM Transactions on Architecture and Code Optimization 2021-06-28

Support Vector Machines (SVMs) are used to discover method-specific compilation strategies in Testarossa, a commercial Just-in-Time (JiT) compiler employed the IBM® J9 Java™ Virtual Machine. The learning process explores large number of different generate data needed for training models. trained machine-learned model is integrated with predict plan that balances code quality and effort on per-method basis. plans outperform original Testarossa start-up performance, but not throughput which...

10.5555/2190025.2190072 article EN 2011-04-02

The goal of Partitioned Global Address Space (PGAS) languages is to improve programmer productivity in large scale parallel machines. However, PGAS programs may have many fine-grained shared accesses that lead performance degradation. Manual code transformations or compiler optimizations are required the with accesses. downside manual increased program complexity hinders productivity. On other hand, most fine-grain require knowledge physical data mapping and use loop constructs.

10.1145/2464996.2465006 article EN 2013-05-28
Coming Soon ...