- Data Management and Algorithms
- VLSI and Analog Circuit Testing
- Integrated Circuits and Semiconductor Failure Analysis
- Topic Modeling
- Graph Theory and Algorithms
- Engineering and Test Systems
- Algorithms and Data Compression
- Image Retrieval and Classification Techniques
- Advanced Image and Video Retrieval Techniques
- Text and Document Classification Technologies
- Caching and Content Delivery
- Advanced Database Systems and Queries
- Opportunistic and Delay-Tolerant Networks
- Meteorological Phenomena and Simulations
- Advancements in Photolithography Techniques
- Advanced Graph Neural Networks
- Distributed and Parallel Computing Systems
- Natural Language Processing Techniques
- Bayesian Methods and Mixture Models
- Advanced Data Storage Technologies
- Web Data Mining and Analysis
- Speech Recognition and Synthesis
- Parallel Computing and Optimization Techniques
- VLSI and FPGA Design Techniques
- Cellular Automata and Applications
Shandong Academy of Sciences
2024
Qilu University of Technology
2024
Tsinghua University
2005-2020
Microsoft Research (United Kingdom)
2014
Xiaomi (China)
2006
Temporal graphs capture changes in over time and are becoming a subject that attracts increasing interest from the research communities, for example, to understand temporal characteristics of social interactions on time-evolving graph. Chronos is storage execution engine designed optimized specifically running in-memory iterative graph computation graphs. Locality at center design, where layout scheduling carefully designed, so common "bulk" operations scheduled maximize benefit data...
Temporal graphs that capture graph changes over time are attracting increasing interest from research communities, for functions such as understanding temporal characteristics of social interactions on a time-evolving graph. ImmortalGraph is storage and execution engine designed optimized specifically graphs. Locality at the center ImmortalGraph’s design: carefully laid out in both persistent memory, taking into account data locality graph-structure dimensions. introduces notion...
Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest many applications. Previous work has developed an O (1) Metropolis-Hastings (MH) sampling method each token. However, its performance far from being optimal due to frequent cache misses caused by random accesses the parameter matrices. In this paper, we first carefully analyze memory access behavior existing LDA locality at document level. We then develop WarpLDA, which achieves time...
This paper reports our efforts on refactoring and optimizing the Community Atmosphere Model (CAM) Sunway TaihuLight supercomputer, which uses a many-core processor that consists of management processing elements (MPEs) clusters computing (CPEs). To map large code base CAM to millions cores system, we take OpenACC-based as major approach, apply source-to-source translator tools exploit most suitable parallelism for CPE cluster, fit intermediate variable into limited on-chip fast buffer. For...
A new scan architecture called reconfigured forest is proposed for cost-effective testing. Multiple flip-flops can be grouped based on structural analysis that avoids untestable faults due to reconvergent fanouts. The allows only a few connected the XOR trees. size of trees greatly reduced compared with original forest; therefore, area overhead and routing complexity reduced. It shown test application cost, data volume, power conventional full design single chain several recent testing methods
Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest many applications. Previous work has developed an O(1) Metropolis-Hastings sampling method each token. However, the performance far from being optimal due to random accesses parameter matrices frequent cache misses. In this paper, we first carefully analyze memory access efficiency existing LDA by scope access, which size region in fall, within a short period time. We then develop WarpLDA,...
This paper reports our efforts on refactoring and optimizing the Community Atmosphere Model (CAM) Sunway TaihuLight supercomputer, which uses a many-core processor that consists of management processing elements (MPEs) clusters computing (CPEs). To map large code base CAM to millions cores system, we take OpenACC-based as major approach, apply source-to-source translator tools exploit most suitable parallelism for CPE cluster, fit intermediate variable into limited on-chip fast buffer. For...
Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear
As real-world graphs are often evolving over time, interest in analyzing the temporal behavior of has grown. Herein, we propose Auxo, a novel graph management system to support analysis. It supports both efficient global and local queries with low space overhead. Auxo organizes data spatio-temporal chunks. A chunk spans particular time interval covers set vertices graph. We layout splitting designs achieve desired efficiency above-mentioned goals. First, by carefully choosing split policy,...
Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal technique in empowering large language model (LLM) applications. Since RLHF involves diverse computational workloads and intricate dependencies among multiple LLMs, directly adopting parallelization techniques supervised training can result sub-optimal performance. To overcome this limitation, we propose novel approach named parameter ReaLlocation, which dynamically redistributes LLM parameters the cluster adapts...
A new rest generation method of fully scanned or combinational circuits is proposed for complete coverage path delay faults based on single stuck-at tests. The adds the target into original circuit, where all off inputs are connected with corresponding nodes in circuit. Test fault reduced to that at fanout branch, additional connects its source node disjoint dynamic test compaction scheme reduce size set process generation. conjoint counts paths. presents a very compact robustly and...
A two-stage scan architecture is proposed to constrain transition propagation within a small part of flip-flops. Most flip-flops are deactivated during test application. The first stage includes multiple chains, where each chain driven by primary input. Each flip-flop in the chains drives group second stage. Scan different stages use separate clock signals. Test signals assigned applied one cycle after vector has been chains. There exists no at when
The dual-deck architecture with aligned upper and lower decks is considered a promising technology to meet the demand of increasing word-line (WL) layers 3D NAND flash. However, relevant reliability studies are still lacking for array. In this work, it reported an abnormal program disturbance phenomena bottom WLs in upper-deck, physical mechanisms were studied. According experimental analysis TCAD simulations, un-programmed dummy at joint region can introduce excessive residual electrons...
Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear
Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear In this paper, we propose SaberLDA, system that...
Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear In this paper, we propose SaberLDA, system that...
Test generation and fault simulation of path delay faults are very time-consuming. A new method fully enhanced scan designed circuits is proposed for based on single stuck-at tests without circuit transformation. The identifies robustly non-robustly testable paths first, which a selected (SPC) constructed. SPC contains no internal fanouts. Fault reduced to 3-valued logic the circuit. completed by only tracing active part An effective dropping technique also adopted selective scheme. scheme...
Scan forest is an efficient scan architecture which can reduce the test application cost, power of testing and data volume greatly. The modified for scan-based BIST. Techniques are used to make existing improved that more suitable A flip-flop regrouping technique introduced groups have similar sizes. Sufficient experimental results show proposed techniques improve popular test-per-scan greatly on fault coverage length. It shown according length reduced 77.3% average all benchmark circuits.
Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images, which are required to model datasets large number of topics, e.g., tens thousands topics industry scale applications. Although distributed CPU systems have been used address this problem, they slow resource inefficient. GPU-based emerged promising alternative because their high computational power memory bandwidth. However, existing LDA can only learn use dense structures, linear...
Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets number of topics. Though distributed CPU systems have been used, GPU-based emerged promising alternative because the high computational power memory bandwidth GPUs. However, existing cannot support topics they use algorithms on dense structures whose time space complexity linear In this paper, we propose SaberLDA, system that...