Kaiwei Li

ORCID: 0000-0002-8015-0812
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Data Management and Algorithms
  • VLSI and Analog Circuit Testing
  • Integrated Circuits and Semiconductor Failure Analysis
  • Topic Modeling
  • Graph Theory and Algorithms
  • Engineering and Test Systems
  • Algorithms and Data Compression
  • Image Retrieval and Classification Techniques
  • Advanced Image and Video Retrieval Techniques
  • Text and Document Classification Technologies
  • Caching and Content Delivery
  • Advanced Database Systems and Queries
  • Opportunistic and Delay-Tolerant Networks
  • Meteorological Phenomena and Simulations
  • Advancements in Photolithography Techniques
  • Advanced Graph Neural Networks
  • Distributed and Parallel Computing Systems
  • Natural Language Processing Techniques
  • Bayesian Methods and Mixture Models
  • Advanced Data Storage Technologies
  • Web Data Mining and Analysis
  • Speech Recognition and Synthesis
  • Parallel Computing and Optimization Techniques
  • VLSI and FPGA Design Techniques
  • Cellular Automata and Applications

Shandong Academy of Sciences
2024

Qilu University of Technology
2024

Tsinghua University
2005-2020

Microsoft Research (United Kingdom)
2014

Xiaomi (China)
2006

Temporal graphs capture changes in over time and are becoming a subject that attracts increasing interest from the research communities, for example, to understand temporal characteristics of social interactions on time-evolving graph. Chronos is storage execution engine designed optimized specifically running in-memory iterative graph computation graphs. Locality at center design, where layout scheduling carefully designed, so common "bulk" operations scheduled maximize benefit data...

10.1145/2592798.2592799 article EN 2014-04-14

Temporal graphs that capture graph changes over time are attracting increasing interest from research communities, for functions such as understanding temporal characteristics of social interactions on a time-evolving graph. ImmortalGraph is storage and execution engine designed optimized specifically graphs. Locality at the center ImmortalGraph’s design: carefully laid out in both persistent memory, taking into account data locality graph-structure dimensions. introduces notion...

10.1145/2700302 article EN ACM Transactions on Storage 2015-07-24

Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest many applications. Previous work has developed an O (1) Metropolis-Hastings (MH) sampling method each token. However, its performance far from being optimal due to frequent cache misses caused by random accesses the parameter matrices. In this paper, we first carefully analyze memory access behavior existing LDA locality at document level. We then develop WarpLDA, which achieves time...

10.14778/2977797.2977801 article EN Proceedings of the VLDB Endowment 2016-06-01

This paper reports our efforts on refactoring and optimizing the Community Atmosphere Model (CAM) Sunway TaihuLight supercomputer, which uses a many-core processor that consists of management processing elements (MPEs) clusters computing (CPEs). To map large code base CAM to millions cores system, we take OpenACC-based as major approach, apply source-to-source translator tools exploit most suitable parallelism for CPE cluster, fit intermediate variable into limited on-chip fast buffer. For...

10.1109/sc.2016.82 article EN 2016-11-01

A new scan architecture called reconfigured forest is proposed for cost-effective testing. Multiple flip-flops can be grouped based on structural analysis that avoids untestable faults due to reconvergent fanouts. The allows only a few connected the XOR trees. size of trees greatly reduced compared with original forest; therefore, area overhead and routing complexity reduced. It shown test application cost, data volume, power conventional full design single chain several recent testing methods

10.1109/tc.2007.1002 article EN IEEE Transactions on Computers 2007-03-14

Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest many applications. Previous work has developed an O(1) Metropolis-Hastings sampling method each token. However, the performance far from being optimal due to random accesses parameter matrices frequent cache misses. In this paper, we first carefully analyze memory access efficiency existing LDA by scope access, which size region in fall, within a short period time. We then develop WarpLDA,...

10.48550/arxiv.1510.08628 preprint EN other-oa arXiv (Cornell University) 2015-01-01

This paper reports our efforts on refactoring and optimizing the Community Atmosphere Model (CAM) Sunway TaihuLight supercomputer, which uses a many-core processor that consists of management processing elements (MPEs) clusters computing (CPEs). To map large code base CAM to millions cores system, we take OpenACC-based as major approach, apply source-to-source translator tools exploit most suitable parallelism for CPE cluster, fit intermediate variable into limited on-chip fast buffer. For...

10.5555/3014904.3015016 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2016-11-13

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear

10.1145/3037697.3037740 article EN 2017-04-04

As real-world graphs are often evolving over time, interest in analyzing the temporal behavior of has grown. Herein, we propose Auxo, a novel graph management system to support analysis. It supports both efficient global and local queries with low space overhead. Auxo organizes data spatio-temporal chunks. A chunk spans particular time interval covers set vertices graph. We layout splitting designs achieve desired efficiency above-mentioned goals. First, by carefully choosing split policy,...

10.26599/bdma.2018.9020030 article EN cc-by Big Data Mining and Analytics 2018-11-19

Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal technique in empowering large language model (LLM) applications. Since RLHF involves diverse computational workloads and intricate dependencies among multiple LLMs, directly adopting parallelization techniques supervised training can result sub-optimal performance. To overcome this limitation, we propose novel approach named parameter ReaLlocation, which dynamically redistributes LLM parameters the cluster adapts...

10.48550/arxiv.2406.14088 preprint EN arXiv (Cornell University) 2024-06-20

A new rest generation method of fully scanned or combinational circuits is proposed for complete coverage path delay faults based on single stuck-at tests. The adds the target into original circuit, where all off inputs are connected with corresponding nodes in circuit. Test fault reduced to that at fanout branch, additional connects its source node disjoint dynamic test compaction scheme reduce size set process generation. conjoint counts paths. presents a very compact robustly and...

10.1109/iccd.2006.4380854 article EN Proceedings, IEEE International Conference on Computer Design/Proceedings - IEEE International Conference on Computer Design 2006-10-01

A two-stage scan architecture is proposed to constrain transition propagation within a small part of flip-flops. Most flip-flops are deactivated during test application. The first stage includes multiple chains, where each chain driven by primary input. Each flip-flop in the chains drives group second stage. Scan different stages use separate clock signals. Test signals assigned applied one cycle after vector has been chains. There exists no at when

10.1109/tcsii.2007.892393 article EN IEEE Transactions on Circuits and Systems II Analog and Digital Signal Processing 2007-05-01

The dual-deck architecture with aligned upper and lower decks is considered a promising technology to meet the demand of increasing word-line (WL) layers 3D NAND flash. However, relevant reliability studies are still lacking for array. In this work, it reported an abnormal program disturbance phenomena bottom WLs in upper-deck, physical mechanisms were studied. According experimental analysis TCAD simulations, un-programmed dummy at joint region can introduce excessive residual electrons...

10.1109/led.2022.3178155 article EN IEEE Electron Device Letters 2022-05-27

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear

10.1145/3093315.3037740 article EN ACM SIGOPS Operating Systems Review 2017-04-04

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear In this paper, we propose SaberLDA, system that...

10.1145/3093336.3037740 article EN ACM SIGPLAN Notices 2017-04-04

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear In this paper, we propose SaberLDA, system that...

10.48550/arxiv.1610.02496 preprint EN other-oa arXiv (Cornell University) 2016-01-01

10.1109/icsp62122.2024.10744000 article EN 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP) 2024-04-19

Test generation and fault simulation of path delay faults are very time-consuming. A new method fully enhanced scan designed circuits is proposed for based on single stuck-at tests without circuit transformation. The identifies robustly non-robustly testable paths first, which a selected (SPC) constructed. SPC contains no internal fanouts. Fault reduced to 3-valued logic the circuit. completed by only tracing active part An effective dropping technique also adopted selective scheme. scheme...

10.1109/test.2007.4437636 article EN 2007-01-01

Scan forest is an efficient scan architecture which can reduce the test application cost, power of testing and data volume greatly. The modified for scan-based BIST. Techniques are used to make existing improved that more suitable A flip-flop regrouping technique introduced groups have similar sizes. Sufficient experimental results show proposed techniques improve popular test-per-scan greatly on fault coverage length. It shown according length reduced 77.3% average all benchmark circuits.

10.1109/ats.2004.78 article EN 2005-04-06

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images, which are required to model datasets large number of topics, e.g., tens thousands topics industry scale applications. Although distributed CPU systems have been used address this problem, they slow resource inefficient. GPU-based emerged promising alternative because their high computational power memory bandwidth. However, existing LDA can only learn use dense structures, linear...

10.1109/tpds.2020.2979702 article EN cc-by IEEE Transactions on Parallel and Distributed Systems 2020-04-08

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear In this paper, we propose SaberLDA, system that...

10.1145/3093337.3037740 article EN ACM SIGARCH Computer Architecture News 2017-04-04
Coming Soon ...