NFDI4DS | UHH-SEMS - Publication Details

HPC-GPT: Integrating Large Language Model for High-Performance Computing

OPENALEX - Publications

Xianzhong Ding Le Chen Murali Emani Chunhua Liao Pei‐Hung Lin and 4 more

Large Language Models (LLMs), including the LLaMA model, have exhibited their efficacy across various general-domain natural language processing (NLP) tasks. However, performance in high-performance computing (HPC) domain tasks has been less than optimal due to specialized expertise required interpret model responses. In response this challenge, we propose HPC-GPT, a novel LLaMA-based that supervised fine-tuning using generated QA (Question-Answer) instances for HPC domain. To evaluate its...

10.1145/3624062.3624172 preprint EN cc-by 2023-11-10

Unveil cis-acting combinatorial mRNA motifs by interpreting deep neural network

OPENALEX - Publications

Xiaocheng Zeng Wei Zheng Qixiu Du Jiaqi Li Zhen Xie and 1 more

Abstract Summary Cis-acting mRNA elements play a key role in the regulation of stability and translation efficiency. Revealing interactions these their impact plays crucial understanding process, which supports development mRNA-based medicine or vaccines. Deep neural networks (DNN) can learn complex cis-regulatory codes from RNA sequences. However, extracting efficiently DNN remains significant challenge. Here, we propose method based on our toolkit NeuronMotif motif mutagenesis, not only...

10.1093/bioinformatics/btae262 article EN cc-by Bioinformatics 2024-04-12

Thorough Characterization and Analysis of Large Transformer Model Training At-Scale

OPENALEX - Publications

S. H. Cheng Jun-Liang Lin Murali Emani Siddhisanket Raskar Sam Foreman and 3 more

Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, large training today typically involves sharding, data parallelism, and parallelism. Thus, the throughput large-scale depends heavily on network bandwidth since combination sharding multiple parallelism strategies incurs costs. However, prior characterizations high-bandwidth DGX machines that use TFLOPS as metric may not reflect performance system with lower...

10.1145/3639034 article EN Proceedings of the ACM on Measurement and Analysis of Computing Systems 2024-02-16

CereSZ: Enabling and Scaling Error-bounded Lossy Compression on Cerebras CS-2

OPENALEX - Publications

Shihui Song Yafan Huang Peng Jiang Xiaodong Yu Weijian Zheng and 5 more

10.1145/3625549.3658691 article EN 2024-06-03

Thorough Characterization and Analysis of Large Transformer Model Training At-Scale

OPENALEX - Publications

S. H. Cheng Jun-Liang Lin Murali Emani Siddhisanket Raskar Sam Foreman and 3 more

Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, large training today typically involves sharding, data parallelism, and parallelism. Thus, the throughput large-scale depends heavily on network bandwidth since combination sharding multiple parallelism strategies incurs costs. However, prior characterizations high-bandwidth DGX machines that use TFLOPS as metric may not reflect performance system with lower...

10.1145/3652963.3655087 article EN 2024-06-01

Thorough Characterization and Analysis of Large Transformer Model Training At-Scale

OPENALEX - Publications

S. H. Cheng Jun-Liang Lin Murali Emani Siddhisanket Raskar Sam Foreman and 3 more

Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, large training today typically involves sharding, data parallelism, and parallelism. Thus, the throughput large-scale depends heavily on network bandwidth since combination sharding multiple parallelism strategies incurs costs. However, prior characterizations high-bandwidth DGX machines that use TFLOPS as metric may not reflect performance system with lower...

10.1145/3673660.3655087 article EN ACM SIGMETRICS Performance Evaluation Review 2024-06-11

Toward a Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators

OPENALEX - Publications

Murali Emani Sam Foreman Varuni Sastry Zhen Xie Siddhisanket Raskar and 16 more

10.1109/ipdpsw63119.2024.00016 article EN 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2024-05-27

GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors

OPENALEX - Publications

Chengming Zhang Xiaxin Ding Baixi Sun Xiaodong Yu Weijian Zheng and 2 more

Heterogeneous hardware like Gaudi processor has been developed to enhance computations, especially matrix operations for Transformer-based large language models (LLMs) generative AI tasks. However, our analysis indicates that Transformers are not fully optimized on such emerging hardware, primarily due inadequate optimizations in non-matrix computational kernels Softmax and heterogeneous resource utilization, particularly when processing long sequences. To address these issues, we propose an...

10.48550/arxiv.2412.19829 preprint EN arXiv (Cornell University) 2024-12-19

Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors

OPENALEX - Publications

Chengming Zhang Baixi Sun Xiaodong Yu Zhen Xie Weijian Zheng and 3 more

Transformer models have achieved remarkable success in various machine learning tasks but suffer from high computational complexity and resource requirements. The quadratic of the self-attention mechanism further exacerbates these challenges when dealing with long sequences large datasets. Specialized AI hardware accelerators, such as Habana GAUDI architecture, offer a promising solution to tackle issues. features Matrix Multiplication Engine (MME) cluster fully programmable Tensor...

10.1145/3624062.3624257 article EN 2023-11-10