Yong Zhao

ORCID: 0000-0003-2644-952X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Face recognition and analysis
  • Advanced Neural Network Applications
  • Advanced Image and Video Retrieval Techniques
  • Video Surveillance and Tracking Methods
  • Emotion and Mood Recognition
  • Speech and dialogue systems
  • Topic Modeling
  • Human Pose and Action Recognition
  • Structural Behavior of Reinforced Concrete
  • Scientific Computing and Data Management
  • Distributed and Parallel Computing Systems
  • Language, Metaphor, and Cognition
  • Advanced Vision and Imaging
  • Translation Studies and Practices
  • Image Enhancement Techniques
  • Face and Expression Recognition
  • Generative Adversarial Networks and Image Synthesis
  • Anomaly Detection Techniques and Applications
  • Asian Culture and Media Studies
  • Advanced Optical Network Technologies
  • Advanced Adaptive Filtering Techniques

Tongji University
2010-2024

Peking University
2013-2024

Northwestern Polytechnical University
2004-2024

National University of Defense Technology
2023-2024

Ningxia University
2023

China Resources (China)
2023

Xi'an City Planning&Design Institute (China)
2023

Guizhou University
2022

Ocean University of China
2015-2022

Microsoft (United States)
2006-2021

A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level representation (d-vector i-vector). In work we use CNN extract noise-robust features. These smartly combined form vector through attention...

10.1109/slt.2016.7846261 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2016-12-01

The ResNet-based architecture has been widely adopted to extract speaker embeddings for text-independent verification systems. By introducing the residual connections CNN and standardizing blocks, ResNet structure is capable of training deep networks achieve highly competitive recognition performance. However, when input feature space becomes more complicated, simply increasing depth width <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>...

10.1109/slt48900.2021.9383531 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2021-01-19

Deep Neural Network Hidden Markov Models, or DNN-HMMs, are recently very promising acoustic models achieving good speech recognition results over Gaussian mixture model based HMMs (GMM-HMMs). In this paper, for emotion from speech, we investigate DNN-HMMs with restricted Boltzmann Machine (RBM) unsupervised pre-training, and discriminative pre-training. Emotion experiments carried out on these two the eNTERFACE'05 database Berlin database, respectively, compared those GMM-HMMs,...

10.1109/acii.2013.58 article EN 2013-09-01

The teacher-student (T/S) learning has been shown to be effective for a variety of problems such as domain adaptation and model compression. One shortcoming the T/S is that teacher model, not always perfect, sporadically produces wrong guidance in form posterior probabilities misleads student towards suboptimal performance. To overcome this problem, we propose conditional scheme, which "smart" selectively chooses learn from either or ground truth labels conditioned on whether can correctly...

10.1109/icassp.2019.8683438 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

This paper describes a system that generates speaker-annotated transcripts of meetings by using microphone array and 360-degree camera. The hallmark the is its ability to handle overlapped speech, which has been an unsolved problem in realistic settings for over decade. We show this can be addressed continuous speech separation approach. In addition, we describe online audio-visual speaker diarization method leverages face tracking identification, sound source localization, and, if...

10.1109/asru46091.2019.9003827 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in wild, evaluated at track of VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020. We will first explain our design to address issues handling real recordings. then present details components, which include Res2Net-based embedding extractor, conformer-based continuous speech separation with leakage filtering, and a modified DOVER (short Diarization Output Voting Error Reduction) method...

10.1109/icassp39728.2021.9413832 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Abstract The first Provenance Challenge was set up in order to provide a forum for the community understand capabilities of different provenance systems and expressiveness their representations. To this end, functional magnetic resonance imaging workflow defined, which participants had either simulate or run produce some representation, from identified queries be implemented executed. Sixteen teams responded challenge, submitted inputs. In paper, we present challenge queries, summarize...

10.1002/cpe.1233 article EN Concurrency and Computation Practice and Experience 2007-11-02

Text-independent speaker verification imposes no constraints on the spoken content and usually needs long observations to make reliable prediction. In this paper, we propose two embedding approaches by integrating phonetic information into attention-based residual convolutional neural network (CNN). Phonetic features are extracted from bottleneck layer of a pretrained acoustic model. implicit attention (IPA), projected transformation multi-channel feature maps, then combined with raw as...

10.1109/asru46091.2019.9003826 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

In this paper, we propose a scalable adaptation technique that adapts the deep neural network (DNN) model through low-rank plus diagonal (LRPD) decomposition. It is desired an method can properly accommodate available development data with variable amount of parameters. Thus, resulting models neither over-fit nor under-fit as vary in size for different speakers. The developed paper inspired by observing matrices are very close to identity matrix or diagonally dominant. LRPD restructures...

10.1109/icassp.2016.7472630 article EN 2016-03-01

The use of deep networks to extract embeddings for speaker recognition has proven successfully. However, such are susceptible performance degradation due the mismatches among training, enrollment, and test conditions. In this work, we propose an adversarial verification (ASV) scheme learn condition-invariant embedding via multi-task training. ASV, a classification network condition identification jointly optimized minimize loss simultaneously mini-maximize loss. target labels can be...

10.1109/icassp.2019.8682488 preprint EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

In reinforced concrete structures, the interface between old and new is commonplace in precast splices, repairs, construction joints, vital for structural integrity. While extensive research has been conducted on mechanical bonding properties of interfaces concrete, there remains a notable gap understanding shear behavior under various interfacial treatments. To investigate influence preparation methods surface roughness performance new-to-old interface, direct tests were Z-shaped specimens....

10.1016/j.cscm.2024.e03549 article EN cc-by-nc Case Studies in Construction Materials 2024-07-20

The description, composition, and execution of even logically simple scientific workflows are often complicated by the need to deal with "messy" issues like heterogeneous storage formats ad-hoc file system structures. We show how these difficulties can be overcome via a typed, compositional workflow notation within which physical representation cleanly separated from logical typing, implementation this context powerful runtime that supports distributed execution. resulting capable both...

10.1145/1084805.1084813 article EN ACM SIGMOD Record 2005-09-01

A response variation ( <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">RV</i> ) element is introduced to control the consistency of an adaptive wideband beamformer's over frequency range interest. By incorporating into linearly constrained minimum variance (LCMV) beamformer, we develop a novel beamformer with improved output signal-to-interference-plus-noise ratio (SINR), compared both traditional formulation and eigenvector based formulation, due...

10.1109/tap.2011.2110630 article EN IEEE Transactions on Antennas and Propagation 2011-02-03

Deep cement mixing piles are a key technology for treating settlement distress of soft soil subgrade. However, it is very challenging to accurately evaluate the quality pile construction due limitations material, large number and small spacing. Here, we propose idea transforming defect detection into evaluation ground improvement. Geological models group reinforced subgrade constructed their ground-penetrating radar response characteristics revealed. We have also developed attribute analysis...

10.1038/s41467-023-39236-4 article EN cc-by Nature Communications 2023-06-10

This paper presents a novel unsupervised algorithm to detect salient regions and segment out foreground objects from background. In contrast previous unidirectional saliency-based object segmentation methods, in which only the detected saliency map is used guide segmentation, our mutually exploits detection/segmentation cues each other. To achieve this goal, an initial generated by proposed driven low-rank matrix recovery model. Such exploited initialize model, formulated as energy...

10.1109/tip.2015.2456497 article EN IEEE Transactions on Image Processing 2015-07-15

The outstanding Histogram-of-Oriented-Gradients (HOG) feature proposed by Dalal and Triggs is a state-of-art technique for pedestrian detection, it usually applied with linear support vector machine (SVM) in sliding-window framework. Most other algorithms detection use HOG as the basic feature, combine features to form set. Hence, actually most efficient fundamental detection. However, cannot adequately handle scale variation of pedestrians. In addition, simply downsampling an image into...

10.1109/mits.2015.2427366 article EN IEEE Intelligent Transportation Systems Magazine 2015-01-01

To develop speaker adaptation algorithms for deep neural network (DNN) that are suitable large-scale online deployment, it is desirable the model be represented in a compact form and learned an unsupervised fashion. In this paper, we propose novel low-footprint technique DNN adapts through node activation functions. The approach introduces slope bias parameters sigmoid functions each speaker, allowing to stored small-sized storage space. We show can formulated linear regression fashion,...

10.1109/icassp.2015.7178784 article EN 2015-04-01

Grouted sleeve connection is the widely used method for of stressed steel reinforcement bars in precast concrete structure. The reliability depends on bond strength between and grouting materials. In this study, uniaxial tensile test 204 grouted specimens was carried out. parameters included bar diameter, anchorage length, material strength, etc. Test results indicate that there are two main failure modes specimens, fracture failure, which mainly affected by length bars. When greater than...

10.1016/j.cscm.2024.e02883 article EN cc-by-nc-nd Case Studies in Construction Materials 2024-01-21

The undercooled solidification of the Inconel 718 superalloy under high magnetic field was performed for first time at undercoolings (∼ 200 °C). results show that can significantly refine grains alloys, with average grain size decreasing from 241 ± 92 μm 0 T to about a third fields (3 ∼ 9 T). Detailed EBSD analysis provides clear evidence Icosahedral Short-Range Order (ISRO) enhanced nucleation occurs. present work opens up new way refinement alloys.

10.1080/21663831.2024.2360700 article EN cc-by-nc Materials Research Letters 2024-06-18

Deep CNN networks have shown great success in various tasks for text-independent speaker recognition. In this paper, we explore two approaches modeling long temporal contexts to improve the performance of ResNet networks. The first approach is simply integrating utterance-level mean and variance normalization into architecture. Secondly, combine BLSTM one unified layers model range, supposedly phonetically aware, context information, which could facilitate learn optimal attention weight...

10.1109/icassp40776.2020.9053767 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09
Coming Soon ...