- Parallel Computing and Optimization Techniques
- Interconnection Networks and Systems
- Advanced Data Storage Technologies
- Matrix Theory and Algorithms
- Distributed and Parallel Computing Systems
- Advanced Neural Network Applications
- Molecular Junctions and Nanostructures
- Advanced Numerical Methods in Computational Mathematics
- Embedded Systems Design Techniques
- Machine Learning and ELM
- Advanced NMR Techniques and Applications
- Force Microscopy Techniques and Applications
- Stochastic Gradient Optimization Techniques
- Electron and X-Ray Spectroscopy Techniques
- Neural Networks and Applications
- Protein Structure and Dynamics
- Ferroelectric and Negative Capacitance Devices
- Advanced Optimization Algorithms Research
- Elasticity and Material Modeling
- Machine Learning in Materials Science
- Distributed systems and fault tolerance
- Graph theory and applications
- Advanced Memory and Neural Computing
- Scientific Computing and Data Management
- Model Reduction and Neural Networks
Cerebras Systems (United States)
2020-2023
Intel (United States)
2013-2019
Apple (United States)
2007
Apple (Israel)
2007
Oracle (United States)
2002-2006
University of California, Los Angeles
1997
The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, structured linear equations must be solved. Iterative solvers are limited by data movement, both between caches memory nodes. Here we describe the solution such on Cerebras Systems CS-1, a wafer-scale processor that has bandwidth communication latency to perform well. We achieve 0.86 PFLOPS single system BiCGStab arising from 7-point finite difference stencil 600 × 595 1536 mesh, achieving...
The evolution of molecular dynamics (MD) simulations has been intimately linked to that computing hardware. For decades following the creation MD, have improved with power along three principal dimensions accuracy, atom count (spatial scale), and duration (temporal scale). Since mid-2000s, computer platforms have, however, failed provide strong scaling for as scale-out central processing unit (CPU) graphics (GPU) substantial increases spatial scale do not lead proportional in temporal scale....
NAS parallel benchmarks (NPB) are a set of applications commonly used to evaluate systems. We use the NPB-OpenMP version examine performance Intel's new Xeon Phi co-processor and focus in particular on many core aspect architecture. A first analysis studies scalability up 244 threads 61 cores impact affinity settings scaling. It also compares characteristics traditional CPUs. The application several well-established optimization techniques allows us identify common bottlenecks that can...
This paper provides a systematic comparison of various characteristics computationally-intensive workloads. Our analysis focuses on standard HPC benchmarks and representative applications. For the selected workloads we provide wide range characterizations based instruction tracing hardware counter measurements.
Online Normalization is a new technique for normalizing the hidden activations of neural network. Like Batch Normalization, it normalizes sample dimension. While does not use batches, as accurate Normalization. We resolve theoretical limitation by introducing an unbiased computing gradient normalized activations. works with automatic differentiation adding statistical normalization primitive. This can be used in cases covered some other normalizers, such recurrent networks, fully connected...
We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory processing elements reside single silicon wafer. The wafer-scale engine (WSE) encompasses two-dimensional mesh of roughly 850,000 (PEs) with local equally nearest-neighbor interconnections. Our FFT (wsFFT) parallelizes $n^3$ problem up to $n^2$ PEs. At this point PE processes only vector 3D domain (known as pencil) per superstep, where each three supersteps...
Abstract In this work, we apply the ideas of domain decomposition and multi‐grid methods to PDE‐based eigenvalue problems represented in two equivalent variational formulations. To find lowest eigenpair, use a “subspace correction” framework for deriving multiplicative algorithm minimizing Rayleigh quotient current iteration. By considering an minimization formulation proposed by Mathew Reddy, can theory Schwarz algorithms non‐linear optimization developed Tai Espedal analyse convergence...
This work presents a general methodology for estimating the performance of an HPC workload when running on future hardware architecture. Further, it demonstrates by significant scientific application -- Gyrokinetic Toroidal Code (GTC) executing Sun's proposed next-generation petascale computer architecture.For GTC, we identify important phases iteration and perform low-level analysis that includes instruction tracing component simulations processor memory systems. Low-level is complemented...
Molecular dynamics (MD) simulations have transformed our understanding of the nanoscale, driving breakthroughs in materials science, computational chemistry, and several other fields, including biophysics drug design. Even on exascale supercomputers, however, runtimes are excessive for systems timescales scientific interest. Here, we demonstrate strong scaling MD Cerebras Wafer-Scale Engine. By dedicating a processor core each simulated atom, 179-fold improvement timesteps per second versus...
The evolution of molecular dynamics (MD) simulations has been intimately linked to that computing hardware. For decades following the creation MD, have improved with power along three principal dimensions accuracy, atom count (spatial scale), and duration (temporal scale). Since mid-2000s, computer platforms however failed provide strong scaling for MD as scale-out CPU GPU substantial increases spatial scale do not lead proportional in temporal scale. Important scientific problems therefore...
We present a high-level and accessible Application Programming Interface (API) for the solution of field equations on Cerebras Systems Wafer-Scale Engine (WSE) with over two orders magnitude performance gain relative to traditional distributed computing approaches. The domain-specific API is called WSE Field-equation (WFA). WFA outperforms OpenFOAM NETL's Joule 2.0 supercomputer by in time solution. While this consistent hand-optimized assembly codes, provides an easy-to-use, Python...
Solving 3-D partial differential equations in a Finite Element model is computationally intensive and requires extremely high memory communication bandwidth. This paper describes novel way where the mesh points of varying resolution are mapped on large 2-D homogenous array processors. Cerebras developed supercomputer that powered by 21.5cm Wafer-Scale Engine (WSE) with 850,000 programmable compute cores. With 2.6 trillion transistors 7nm process this far largest chip world. It structured as...
The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, structured linear equations must be solved. Iterative solvers are limited by data movement, both between caches memory nodes. Here we describe the solution such on Cerebras Systems CS-1, a wafer-scale processor that has bandwidth communication latency to perform well. We achieve 0.86 PFLOPS single system BiCGStab arising from 7-point finite difference stencil 600 X 595 1536 mesh, achieving...