- Interconnection Networks and Systems
- Parallel Computing and Optimization Techniques
- Embedded Systems Design Techniques
- Advanced Data Storage Technologies
- Distributed and Parallel Computing Systems
- Advanced Data Compression Techniques
- Video Coding and Compression Technologies
- Embedded Systems and FPGA Design
- Ion-surface interactions and analysis
- Neural Networks and Applications
- Ferroelectric and Negative Capacitance Devices
- PAPR reduction in OFDM
- RFID technology advancements
- Evaluation and Optimization Models
- Evaluation Methods in Various Fields
- VLSI and Analog Circuit Testing
- Image Processing Techniques and Applications
- Service-Oriented Architecture and Web Services
- Numerical Methods and Algorithms
- Rice Cultivation and Yield Improvement
- VLSI and FPGA Design Techniques
- Advancements in PLL and VCO Technologies
- Video Analysis and Summarization
- Software System Performance and Reliability
- Petri Nets in System Modeling
Baidu (China)
2020-2023
OnApp (Gibraltar)
2018
Shandong Institute of Automation
2015-2018
Chinese Academy of Sciences
2016-2018
Institute of Automation
2015-2017
University of Chinese Academy of Sciences
2017
University of Science and Technology of China
2008-2015
Tiangong University
2015
Beihang University
2014
General Electric (Norway)
2005
As the feature size of semiconductor process is scaling down to 10nm and below, it possible assemble systems with high performance processors that can theoretically provide computational power up tens PLOPS. However, consumption these also rocketing millions watts, actual only around 60% theoretical performance. Today, efficiency sustained have become main foci processor designers. Traditional computing architecture such as superscalar GPGPU are proven be inefficient, there a big gap between...
This article consists only of a collection slides from the author's conference presentation.
In order to be able handle a wide range of AI applications, such as for speech, image, language and autonomous driving, it is necessary that an accelerator flexible enough diversified workloads. Baidu Kunlun, chip designed in-house by Baidu, achieves this capability with high programmability, flexibility performance. Kunlun was inspired the XPU architecture [1]. The implemented in Samsung 14nm process technology. Its peak performance 230TOPS@INT8 at 900MHz up 281TOPS@INT8 1.1GHz boost...
In high-performance computing systems, each node communicates via a high-speed serial bus to ensure sufficient data transfer bandwidth. However, of different protocols is very difficult communicate directly, which not conducive the extensibility HPC (High performance computing) clusters. this paper, we propose UPI, inter-node communication interface based on FPGA, can transmit (PCIe protocol and Ethernet protocol) simultaneously. More importantly, many bus-supported nodes be connected same...
Design Space Exploration (DSE) is a critical step in the chip design. The tradeoffs and interactions among parameters are traditionally evaluated by simulating or synthesizing variety of designs which intractable. predictive modeling techniques have been applied to predict design performance for DSE. For system-on-a-chip (SoC) DSE cases, however, it difficult achieve high accuracy with previous methods due their limitations. In this paper, we proposed new estimation method based on...
In FPGA-based SoCs, interconnect bus such as PCIe and Ethernet has a separate physical layer interface. The (PHY) consumes quite few power consumption area overhead. this paper, we propose flexible interface (Unified PHY Interface, UPI) based on FPGA describe its design. More specifically, UPI can parse various packets automatically by adding an convertor between upper layer. Thus, architecture be realized using for each controller. We implemented two Xilinx Virtex-7 FPGAs with Synopsys...
Platform-As-A-Service (PaaS) systems offer customers a rich environment in which to build, deploy, and run applications. Today's PaaS offerings are tailored mainly the needs of web mobile applications developers, involve fairly rigid stack components features. The vision H2020 5GPPP Phase 2 Next Generation Platform-as-a-Service (NGPaaS) project is enable "build-to-order" customized PaaSes, wide range use cases with telco-grade 5G characteristics. This paper sets out salient innovative...
In high-performance computing systems, each node communicates via a high-speed serial bus to ensure sufficient data transfer bandwidth. However, of different protocols is very difficult communicate directly, which not conducive the extensibility HPC (High performance computing) clusters. this paper, we propose UPI, inter-node communication interface based on FPGA, can transmit (PCIe protocol and Ethernet protocol) simultaneously. More importantly, many bus-supported nodes be connected same...
SSD (solid state device) has shown a great potential in astronomy data storage. Data compression is an essential task to obtain higher storage density and bandwidth. This paper proposes distributed compressor customized for FPGA-based SSD. Our data-driven cope with the unit of byte, two algorithms, run length length-limited huffman are utilized, encoder further developed reduce latency. Experimental results indicate that our proposed achieves 1GB/s bandwidth less than 2500 LUTs utilized...
The high performance processing (HPP) is an innovative architecture which targets on computing with excellent power efficiency and performance. It suitable for data intensive applications like supercomputing, machine learning wireless communication. An example chip four application-specific integrated circuit (ASIC) cores the first generation of HPP has been taped out successfully under Taiwan Semiconductor Manufacturing Company (TSMC) 40 nm low process. shows great energy over traditional...
Traditional system reliability model has almost neglected the coupling between different states (normal, failures, etc) and continuous variation process of performance. This paper presents a method modeling based on hybrid Petri nets (HPN), which combines discrete state performance together during to describe relationship. Firstly, normal running mode fault were established using HPN logical relationship states; Secondly, account each state, corresponding models uncertain external...
An improved anti-aliasing sampling algorithm is submitted to reduce the increasing memory consumption caused by super-sampling in mobile devices. Six-point anisotropy blends two samples of a pixel, as well nearby pixels. Experiment results showed that six-point has reduced 50% than traditional FLIPQUAD algorithm. This method similar quality with only consumption.
At present, most of the factories carry materials by hand about ceramic.This paper designed structure product with automatic pick-and-placing manipulator as research object.Furthermore, it applied 3D model and motion simulation to through SolidWorks.Then gave a design scheme manipulator.Finally obtained feasibility scheme.At same time, this adopted control strategy force ring position fuzzy adoptive PID algorithm ensure precision requirement.It also made dynamic for movement manipulator.The...
In high-performance computing systems, each node communicates via a high-speed serial bus to ensure sufficient data transfer bandwidth. However, of different protocols is very difficult communicate directly, which not conducive the extensibility HPC (High performance computing) clusters. this paper, we propose UPI, inter-node communication interface based on FPGA, can transmit (PCIe protocol and Ethernet protocol) simultaneously. More importantly, many bus-supported nodes be connected same...
Spatial accelerators enable the pervasive use of energy-efficient solutions for computation-intensive applications. In mapping spatial accelerators, a large kernel is usually partitioned into multiple subgraphs resource constraints, leading to more memory accesses and access conflicts. To minimize conflicts, existing works either neglect interference or pay little attention data's life cycle along execution order. this end, paper proposes an optimized allocation approach multi-subgraph on by...
With the widely use of 4G network, corresponding bandwidth processing has become a critical issue. The current recognized network is LTE-A. In baseband for LTE-A, its physical layer algorithm biggest bottleneck processors. application specific integrated circuit (ASIC) design necessary. This article will introduce communication dedicated coprocessor (TxCP), specifically LTE-A uplink shared/control channel (PUSCH/PUCCH) bit-level acceleration. Its internal support PUSCH/PUCCH CRC, Turbo...
This paper presents a novel register file with self-indexed features, targeting the DSP/media algorithm massive data locality. The (SIRF) contains 128 high-speed registers, 4 input ports and output ports. It can be accessed double circular window mode, or simply immediate index mode. SIRF eliminate write after (WAW) dependence without renaming in hardware redundant allocation compilers, it also reduce address computation if accessing pattern satisfies was implemented high performance...