- Parallel Computing and Optimization Techniques
- Interconnection Networks and Systems
- Embedded Systems Design Techniques
- Distributed systems and fault tolerance
- Advanced Data Storage Technologies
- Advanced Frequency and Time Standards
- Distributed and Parallel Computing Systems
- Network Time Synchronization Technologies
- Radiation Effects in Electronics
- VLSI and Analog Circuit Testing
- VLSI and FPGA Design Techniques
- Advancements in PLL and VCO Technologies
- Optimization and Search Problems
- Satellite Communication Systems
- Formal Methods in Verification
- Logic, programming, and type systems
- GNSS positioning and interference
- Power Line Communications and Noise
- Inertial Sensor and Navigation
- Sensor Technology and Measurement Systems
- Smart Grid Security and Resilience
- Algorithms and Data Compression
- Advanced Graph Theory Research
- Real-time simulation and control systems
- Numerical Methods and Algorithms
IBM (India)
2023
CSIR National Physical Laboratory of India
1977-2022
University of Burdwan
1977-2022
Amity University
2016
Council of Scientific and Industrial Research
2011
Hewlett-Packard (United States)
2010
University of Illinois Chicago
2006
University of Illinois Urbana-Champaign
1988-2005
Northwestern University
1997-2005
Evanston Hospital
2005
An approach to the problem of automatic data partitioning is introduced. The notion constraints on distribution presented, and it shown how, based performance considerations, a compiler identifies be imposed various structures. These are then combined by obtain complete consistent picture scheme, one that offers good in terms overall execution time. Results study performed Fortran programs taken from Linpack Eispack libraries Perfect Benchmarks determine applicability real presented. results...
The design of fault-tolerant hypercube multiprocessor architecture is discussed. authors propose the detection and location faulty processors concurrently with actual execution parallel applications on using a novel scheme algorithm-based error detection. System-level mechanisms have been implemented for three 16-processor Intel iPSC multiprocessor: matrix multiplication, Gaussian elimination, fast Fourier transform. Schemes other are under development. Extensive studies done coverage...
Reconfigurable hardware has the potential for significant performance improvements by providing support application-specific operations. We report our experience with Chimaera, a prototype system that integrates small and fast reconfigurable functional unit (RFU) into pipeline of an aggressive, dynamically-scheduled superscalar processor. Chimaera is capable performing 9-input/1-output operations on integer data. discuss C compiler automatically maps computations execution in RFU. of: (1)...
This paper presents a new integrated compiler framework for improving the cache performance of scientific applications. In addition to applying loop transformations, method includes data layout optimizations, i.e., those that change memory layouts structures (arrays in this case). A key characteristic approach is transformations are used improve temporal locality while optimizations spatial locality. optimization was with sixteen nests from several benchmarks and math libraries, measured...
Appropriate data distribution has been found to be critical for obtaining good performance on Distributed Memory Multicomputers like the CM-5, Intel Paragon and IBM SP-1. It also that some programs need change their distributions during execution better (redistribution). This work focuses automatically generating efficient routines redistribution. We present a new mathematical representation regular called PITFALLS then discuss algorithms redistribution based this representation. A...
Simulated annealing, a methodology for solving combinatorial optimization problems, is very computationally expensive algorithm and, as such, numerous researchers have undertaken efforts to parallelize it. In this paper, we investigate three of these parallel simulated annealing strategies when applied standard cell placement, specifically the TimberWolfSC placement tool. We examined moves strategy, well two new approaches placement-multiple Markov chains and speculative computation. These...
This paper presents a new integrated compiler framework for improving the cache performance of scientific applications. In addition to applying loop transformations, method includes data layout optimizations, i.e., those that change memory layouts structures (arrays in this case). A key characteristic approach is transformations are used improve temporal locality while optimizations spatial locality. optimization was with sixteen nests from several benchmarks and math libraries, measured...
This paper describes a behavioral synthesis tool called AccelFPGA which reads in high-level descriptions of digital signal processing (DSP) applications written MATLAB, and automatically generates synthesizable register transfer level (RTL) models simulation testbenches VHDL or Verilog. The RTL can be synthesized using commercial logic tools place route onto field-programmable gate arrays (FPGAs). how powerful directives are used to provide architectural tradeoffs for the DSP designer....
Abstract : An important problem facing numerous research projects on parallelizing compilers for distributed memory machines is that of automatically determining a suitable data partitioning scheme program. Most the current leave this tedious almost entirely to user. In paper, we present novel approach automatic partitioning. We introduce notion constraints distribution, and show how compiler can infer those by looking at reference patterns in source code these may be combined obtain...
Several issues concerning the design of an I/O (input/output) system for a multiprocessor such as hypercube are examined. A methodology is proposed connecting processors to efficient access. The effect communication on network analyzed. Different disk organizations that can be employed within evaluated see which organization has better performance. It observed parallelism in serving request plays dominant role scientific workload. problem mapping specific data structures matrices onto disks...
Article Free Access Share on A hyperplane based approach for optimizing spatial locality in loop nests Authors: M. Kandemir EECS Dept., Syracuse University, Syracuse, NY NYView Profile , A. Choudhary ECE Northwestern Evanston, IL ILView N. Shenoy P. Banerjee J. Ramanujam Louisiana State Baton Rouge, LA LAView Authors Info & Claims ICS '98: Proceedings of the 12th international conference SupercomputingJuly 1998 Pages 69–76https://doi.org/10.1145/277830.277849Online:13 July 1998Publication...
Field Programmable Gate Arrays (FPGAs) have been recently used as an effective platform for implementing many image/signal processing applications. MATLAB is one of the most popular languages to model We present MATCH compiler that takes input and produces a hardware in RTL VHDL, which can be mapped FPGA using commercial CAD tools. This dramatically reduces time implement application on FPGA. results some image signal algorithms was synthesized our Xilinx XC4028 with external memory. also...
The design of two reconfiguration strategies for hypercube multicomputer architectures under failures is discussed. first scheme uses spare processors attached to certain in the by means a novel embedding technique. second approach places between specific links hypercube. Both schemes involve mapping logical virtual onto set physical final reconfigured and hence suffer some performance degradation.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML"...
The problem of designing fault-tolerant disk arrays that are not susceptible to 100% load increases on the functional disks when one in system fails is addressed. A technique combines advantages parity schemes and traditional dual copy methods offers a wide variety options providing fault-tolerance proposed. theoretical framework for solving presented number constructive techniques By utilizing same amount hardware as earlier but with better data organization different reconstruction...
The introduction of advanced FPGA architectures, with built-in DSP support, has given designers a new hardware alternative. By exploiting its inherent parallelism, it is expected that FPGAs can outperform processors. This paper describes the process and considerations for automatically translating binaries targeted general processors into Register Transfer Level (RTL) VHDL or Verilog code to be mapped onto commercial FPGAs. Texas Instruments C6000 processor architecture chosen as platform,...
A discussion is presented of a fault-tolerant hypercube multiprocessor architecture which uses novel algorithm-based fault-detection approach for identifying faulty processors. The scheme involves the detection and location processors concurrently with actual execution parallel applications on hypercube. authors have implemented system-level mechanisms various 16-processor Intel iPSC multiprocessor. They report results two applications: matrix multiplication fast Fourier transform. performed...
The performance evaluation, workload characterization, and trace-driven simulation of a hypercube multicomputer running realistic workloads are presented. Eleven representative parallel applications were selected as benchmarks. Software monitoring techniques then used to collect execution traces. Based on the measurement results, both computation communication behavior these programs investigated. various time interval distributions modeled by statistical functions which verified nonlinear...
A scheme for dealing with roundoff errors in algorithm-based fault tolerance methods which complicate the check phases of algorithm is presented. The method based on error analysis incorporating some simplifications result easier derivation bounds and more useful expressions cases where theoretical bound may be too wide to much use as a phase. are used derive three applications, it shown that fault-tolerant encodings these applications using authors achieve high coverage no false alarms...
Efficient high level design tools that can map behavioral descriptions to FPGA architectures are one of the key requirements fully leverage for throughput computations and meet time-to-market pressures. We present a compiler takes as input algorithms described in MATLAB generates RTL VHDL. The VHDL then be mapped FPGAs using existing commercial tools. application is multiple by parallelizing embedding communication synchronization primitives automatically. Our infers minimum number bits...
We suggest a new approach aimed at reducing the effort required to program distributed-memory multicomputers. The key idea in our is automatically convert written library-based programming language (MATLAB) parallel based on ScaLAPACK library. In process of performing this conversion, we apply compiler optimizations that simultaneously exploit task and data parallelism. As results show, feasible practical optimization provides significant performance benefits.
Applications that require digital signal processing (DSP) functions are typically mapped onto general purpose DSP processors. With the introduction of advanced FPGA architectures with built-in support, a new hardware alternative is available for designers. By exploiting its inherent parallelism, it expected FPGAs can outperform However, migration assembly code to very arduous process. This paper describes process and considerations automatically translating software binary codes targeted...
The role of HP Labs, the central research arm Hewlett-Packard, is to look beyond roadmap currently offered products and services solve some exciting technical problems that will be critical customers. paper mentions active collaboration with academic, commercial, government partners augments accelerates knowledge creation technology transfer.
The authors present the architecture, methodology and performance evaluation of a message-passing coprocessor (MPC) which can accelerate message communication in distributed memory multicomputer (i.e. iPSC/2 hypercube). MPC is microprogrammable processor offloads from CPU burden speeds up software processing by directly executing passing instructions microcode. It supports process scheduling, buffer management, fast copying. most unique feature that it performs caching for expected...
High-level synthesis compilers often produce reoccurring patterns in intermediate CDFGs during translation. By identifying large patterns, one may reduce area and communication overhead by efficiently reusing hardware for multiple operations. This paper presents an algorithm dynamically generating templates of resource sharing CDFGs. Results show 40-80% reduction using small, incremental template growth, variations within a 5% margin among varying look-ahead depths.