NFDI4DS | UHH-SEMS - Publication Details

Chronos

OPENALEX - Publications

Wentao Han Youshan Miao Kaiwei Li Ming Wu Fan Yang and 4 more

Temporal graphs capture changes in over time and are becoming a subject that attracts increasing interest from the research communities, for example, to understand temporal characteristics of social interactions on time-evolving graph. Chronos is storage execution engine designed optimized specifically running in-memory iterative graph computation graphs. Locality at center design, where layout scheduling carefully designed, so common "bulk" operations scheduled maximize benefit data...

10.1145/2592798.2592799 article EN 2014-04-14

ImmortalGraph

OPENALEX - Publications

Youshan Miao Wentao Han Kaiwei Li Ming Wu Fan Yang and 4 more

Temporal graphs that capture graph changes over time are attracting increasing interest from research communities, for functions such as understanding temporal characteristics of social interactions on a time-evolving graph. ImmortalGraph is storage and execution engine designed optimized specifically graphs. Locality at the center ImmortalGraph’s design: carefully laid out in both persistent memory, taking into account data locality graph-structure dimensions. introduces notion...

10.1145/2700302 article EN ACM Transactions on Storage 2015-07-24

WarpLDA

OPENALEX - Publications

Jianfei Chen Kaiwei Li Jun Zhu Wenguang Chen

Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest many applications. Previous work has developed an O (1) Metropolis-Hastings (MH) sampling method each token. However, its performance far from being optimal due to frequent cache misses caused by random accesses the parameter matrices. In this paper, we first carefully analyze memory access behavior existing LDA locality at document level. We then develop WarpLDA, which achieves time...

10.14778/2977797.2977801 article EN Proceedings of the VLDB Endowment 2016-06-01

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer

OPENALEX - Publications

Haohuan Fu Liao Junfeng Wei Xue Lanning Wang Dexun Chen and 20 more

This paper reports our efforts on refactoring and optimizing the Community Atmosphere Model (CAM) Sunway TaihuLight supercomputer, which uses a many-core processor that consists of management processing elements (MPEs) clusters computing (CPEs). To map large code base CAM to millions cores system, we take OpenACC-based as major approach, apply source-to-source translator tools exploit most suitable parallelism for CPE cluster, fit intermediate variable into limited on-chip fast buffer. For...

10.1109/sc.2016.82 article EN 2016-11-01

Reconfigured Scan Forest for Test Application Cost, Test Data Volume, and Test Power Reduction

OPENALEX - Publications

Dong Xiang Kaiwei Li Jiaguang Sun Hideo Fujiwara

A new scan architecture called reconfigured forest is proposed for cost-effective testing. Multiple flip-flops can be grouped based on structural analysis that avoids untestable faults due to reconvergent fanouts. The allows only a few connected the XOR trees. size of trees greatly reduced compared with original forest; therefore, area overhead and routing complexity reduced. It shown test application cost, data volume, power conventional full design single chain several recent testing methods

10.1109/tc.2007.1002 article EN IEEE Transactions on Computers 2007-03-14

WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation

OPENALEX - Publications

Jianfei Chen Kaiwei Li Jun Zhu Wenguang Chen

Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest many applications. Previous work has developed an O(1) Metropolis-Hastings sampling method each token. However, the performance far from being optimal due to random accesses parameter matrices frequent cache misses. In this paper, we first carefully analyze memory access efficiency existing LDA by scope access, which size region in fall, within a short period time. We then develop WarpLDA,...

10.48550/arxiv.1510.08628 preprint EN other-oa arXiv (Cornell University) 2015-01-01

Refactoring and optimizing the community atmosphere model (CAM) on the sunway taihulight supercomputer

OPENALEX - Publications

Haohuan Fu Liao Junfeng Wei Xue Lanning Wang Dexun Chen and 20 more

This paper reports our efforts on refactoring and optimizing the Community Atmosphere Model (CAM) Sunway TaihuLight supercomputer, which uses a many-core processor that consists of management processing elements (MPEs) clusters computing (CPEs). To map large code base CAM to millions cores system, we take OpenACC-based as major approach, apply source-to-source translator tools exploit most suitable parallelism for CPE cluster, fit intermediate variable into limited on-chip fast buffer. For...

10.5555/3014904.3015016 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2016-11-13

SaberLDA

OPENALEX - Publications

Kaiwei Li J.F. Chen Wenguang Chen Jun Zhu

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear

10.1145/3037697.3037740 article EN 2017-04-04

Auxo: a temporal graph management system

OPENALEX - Publications

Wentao Han Kaiwei Li Shimin Chen Wenguang Chen

As real-world graphs are often evolving over time, interest in analyzing the temporal behavior of has grown. Herein, we propose Auxo, a novel graph management system to support analysis. It supports both efficient global and local queries with low space overhead. Auxo organizes data spatio-temporal chunks. A chunk spans particular time interval covers set vertices graph. We layout splitting designs achieve desired efficiency above-mentioned goals. First, by carefully choosing split policy,...

10.26599/bdma.2018.9020030 article EN cc-by Big Data Mining and Analytics 2018-11-19

ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation

OPENALEX - Publications

Zhiyu Mei Wei Fu Kaiwei Li Guangju Wang Huanchen Zhang and 1 more

Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal technique in empowering large language model (LLM) applications. Since RLHF involves diverse computational workloads and intricate dependencies among multiple LLMs, directly adopting parallelization techniques supervised training can result sub-optimal performance. To overcome this limitation, we propose novel approach named parameter ReaLlocation, which dynamically redistributes LLM parameters the cluster adapts...

10.48550/arxiv.2406.14088 preprint EN arXiv (Cornell University) 2024-06-20

Generating Compact Robust and Non-Robust Tests for Complete Coverage of Path Delay Faults Based on Stuck-at Tests

OPENALEX - Publications

Dong Xiang Kaiwei Li Hideo Fujiwara Jiaguang Sun

A new rest generation method of fully scanned or combinational circuits is proposed for complete coverage path delay faults based on single stuck-at tests. The adds the target into original circuit, where all off inputs are connected with corresponding nodes in circuit. Test fault reduced to that at fanout branch, additional connects its source node disjoint dynamic test compaction scheme reduce size set process generation. conjoint counts paths. presents a very compact robustly and...

10.1109/iccd.2006.4380854 article EN Proceedings, IEEE International Conference on Computer Design/Proceedings - IEEE International Conference on Computer Design 2006-10-01

Constraining Transition Propagation for Low-Power Scan Testing Using a Two-Stage Scan Architecture

OPENALEX - Publications

Dong Xiang Kaiwei Li Hideo Fujiwara K. Thulasiraman Jiaguang Sun

A two-stage scan architecture is proposed to constrain transition propagation within a small part of flip-flops. Most flip-flops are deactivated during test application. The first stage includes multiple chains, where each chain driven by primary input. Each flip-flop in the chains drives group second stage. Scan different stages use separate clock signals. Test signals assigned applied one cycle after vector has been chains. There exists no at when

10.1109/tcsii.2007.892393 article EN IEEE Transactions on Circuits and Systems II Analog and Digital Signal Processing 2007-05-01

A Novel Program Scheme to Optimize Program Disturbance in Dual-Deck 3D NAND Flash Memory

OPENALEX - Publications

Xinlei Jia Lei Jin Jianquan Jia Kaikai You Kaiwei Li and 10 more

The dual-deck architecture with aligned upper and lower decks is considered a promising technology to meet the demand of increasing word-line (WL) layers 3D NAND flash. However, relevant reliability studies are still lacking for array. In this work, it reported an abnormal program disturbance phenomena bottom WLs in upper-deck, physical mechanisms were studied. According experimental analysis TCAD simulations, un-programmed dummy at joint region can introduce excessive residual electrons...

10.1109/led.2022.3178155 article EN IEEE Electron Device Letters 2022-05-27

SaberLDA

OPENALEX - Publications

Kaiwei Li Jianfei Chen Wenguang Chen Jun Zhu

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear

10.1145/3093315.3037740 article EN ACM SIGOPS Operating Systems Review 2017-04-04

SaberLDA

OPENALEX - Publications

Kaiwei Li Jianfei Chen Wenguang Chen Jun Zhu

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear In this paper, we propose SaberLDA, system that...

10.1145/3093336.3037740 article EN ACM SIGPLAN Notices 2017-04-04

SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs

OPENALEX - Publications

Kaiwei Li Jianfei Chen Wenguang Chen Jun Zhu

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear In this paper, we propose SaberLDA, system that...

10.48550/arxiv.1610.02496 preprint EN other-oa arXiv (Cornell University) 2016-01-01

12-Lead ECG Classification Based on Channel Self-Attention Mechanism Integrated Expert Features

OPENALEX - Publications

Kaiwei Li Jinmeng Li Jiahui Li Hongkuan Zhang

10.1109/icsp62122.2024.10744000 article EN 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP) 2024-04-19

Fast and effective fault simulation for path delay faults based on selected testable paths

OPENALEX - Publications

Dong Xiang Yang Zhao Kaiwei Li Hideo Fujiwara

Test generation and fault simulation of path delay faults are very time-consuming. A new method fully enhanced scan designed circuits is proposed for based on single stuck-at tests without circuit transformation. The identifies robustly non-robustly testable paths first, which a selected (SPC) constructed. SPC contains no internal fanouts. Fault reduced to 3-valued logic the circuit. completed by only tracing active part An effective dropping technique also adopted selective scheme. scheme...

10.1109/test.2007.4437636 article EN 2007-01-01

Scan-Based BIST Using an Improved Scan Forest Architecture

OPENALEX - Publications

Dong Xiang Ming-jing Chen Kaiwei Li Yu-Liang Wu

Scan forest is an efficient scan architecture which can reduce the test application cost, power of testing and data volume greatly. The modified for scan-based BIST. Techniques are used to make existing improved that more suitable A flip-flop regrouping technique introduced groups have similar sizes. Sufficient experimental results show proposed techniques improve popular test-per-scan greatly on fault coverage length. It shown according length reduced 77.3% average all benchmark circuits.

10.1109/ats.2004.78 article EN 2005-04-06

SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs

OPENALEX - Publications

Kaiwei Li Jianfei Chen Wenguang Chen Jun Zhu

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images, which are required to model datasets large number of topics, e.g., tens thousands topics industry scale applications. Although distributed CPU systems have been used address this problem, they slow resource inefficient. GPU-based emerged promising alternative because their high computational power memory bandwidth. However, existing LDA can only learn use dense structures, linear...

10.1109/tpds.2020.2979702 article EN cc-by IEEE Transactions on Parallel and Distributed Systems 2020-04-08

SaberLDA

OPENALEX - Publications

Kaiwei Li Jianfei Chen Wenguang Chen Jun Zhu

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear In this paper, we propose SaberLDA, system that...

10.1145/3093337.3037740 article EN ACM SIGARCH Computer Architecture News 2017-04-04