Pingchuan Ma

ORCID: 0000-0003-3752-0803
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Music and Audio Processing
  • Speech Recognition and Synthesis
  • Face recognition and analysis
  • Generative Adversarial Networks and Image Synthesis
  • Natural Language Processing Techniques
  • Hearing Loss and Rehabilitation
  • Domain Adaptation and Few-Shot Learning
  • Video Surveillance and Tracking Methods
  • Manufacturing Process and Optimization
  • Computer Graphics and Visualization Techniques
  • Video Analysis and Summarization
  • Human Pose and Action Recognition
  • Adversarial Robustness in Machine Learning
  • Bayesian Modeling and Causal Inference
  • Advanced Vision and Imaging
  • Multimodal Machine Learning Applications
  • BIM and Construction Integration
  • Hand Gesture Recognition Systems
  • Advanced Multi-Objective Optimization Algorithms
  • Multisensory perception and integration
  • Advanced Image Processing Techniques
  • Advanced Computational Techniques and Applications
  • Machine Learning in Healthcare
  • Animal Vocal Communication and Behavior

Shandong University of Science and Technology
2024-2025

Henan Agricultural University
2025

Imperial College London
2018-2024

Massachusetts Institute of Technology
2020-2024

Ludwig-Maximilians-Universität München
2023-2024

LMU Klinikum
2023-2024

Hong Kong University of Science and Technology
2022-2024

University of Hong Kong
2022-2024

Korea Institute of Machinery and Materials
2024

Heidelberg University
2021

Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images signals and perform speech recognition. However, research on audiovisual models is very limited. In this work, we present an model based residual networks Bidirectional Gated Recurrent Units (BGRUs). To best of our knowledge, first fusion simultaneously learns to directly image pixels waveforms performs within-context word recognition a large publicly...

10.1109/icassp.2018.8461326 article EN 2018-04-01

Lip-reading has attracted a lot of research attention lately thanks to advances in deep learning. The current state-of-the-art model for recognition isolated words in-the-wild consists residual network and Bidirectional Gated Recurrent Unit (BGRU) layers. In this work, we address the limitations propose changes which further improve its performance. Firstly, BGRU layers are replaced with Temporal Convolutional Networks (TCN). Secondly, greatly simplify training procedure, allows us train one...

10.1109/icassp40776.2020.9053841 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

In this work, we present a hybrid CTC/Attention model based on ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. particular, the audio visual encoders learn to extract features directly from raw pixels waveforms, respectively, which are then fed conformers fusion takes place via Multi-Layer Perceptron (MLP). The learns recognise characters using combination of CTC attention mechanism. We show training, instead pre-computed is common...

10.1109/icassp39728.2021.9414567 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance automatic, visual, and audio-visual (ASR, VSR, AV-ASR, respectively) been substantially improved, mainly use larger models training sets. However, accurate labelling datasets is time-consuming expensive. Hence, in this work, we investigate automatically-generated transcriptions unlabelled increase set size. For purpose, publicly-available pre-trained ASR...

10.1109/icassp49357.2023.10096889 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based can provide nonsequential alignments. Therefore, we could use a loss combination with an model order to force monotonic alignments and at the same time get rid assumption. In this paper, recently proposed hybrid CTC/attention architecture audio-visual...

10.1109/slt.2018.8639643 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

Lipreading has witnessed a lot of progress due to the resurgence neural networks. Recent works have placed emphasis on aspects such as improving performance by finding optimal architecture or generalization. However, there is still significant gap between current methodologies and requirements for an effective deployment lipreading in practical scenarios. In this work, we propose series innovations that significantly bridge gap: first, raise state-of-the-art wide margin LRW LRW-1000 88.5 %...

10.1109/icassp39728.2021.9415063 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Video-to-speech is the process of reconstructing audio speech from a video spoken utterance. Previous approaches to this task have relied on two-step where an intermediate representation inferred and then decoded into waveform using vocoder or reconstruction algorithm. In work, we propose new end-to-end video-to-speech model based generative adversarial networks (GANs) which translates without any separate synthesis Our consists encoder-decoder architecture that receives raw as input...

10.1109/tcyb.2022.3162495 article EN IEEE Transactions on Cybernetics 2022-04-19

Generative AI models provide foundational knowledge and sufficient reasoning to aid in individual aspects of a computational design modeling workflow (see Part 1 <https://doi.org/10.1162/99608f92.cc80fe30> ). In this work, we analyze the ability state-of-the-art LLMs (in particular GPT-4) reason about entire end-to-end workflow, from conceptualizing realization, for two target domains: static physical objects (here, furniture), dynamical cyberphysical systems quadcopters). Our investigation...

10.1162/99608f92.0705d8bd article EN cc-by Harvard data science review 2024-05-28

The progress in generative AI, particularly large language models (LLMs), opens new prospects design and manufacturing. Our research explores the use of these tools throughout entire manufacturing workflow. We assess capabilities LLMs various tasks: converting text prompts into designs, generating spaces variations, transforming designs instructions, evaluating performance, searching for based on performance metrics. identify discuss current strengths limitations LLMs, suggesting areas...

10.21428/e4baedd9.745b62fa article EN cc-by-nc 2024-03-27

A major challenge of AI + Science lies in their inherent incompatibility: today's is primarily based on connectionism, while science depends symbolism. To bridge the two worlds, we propose a framework to seamlessly synergize Kolmogorov-Arnold Networks (KANs) and science. The highlights KANs' usage for three aspects scientific discovery: identifying relevant features, revealing modular structures, discovering symbolic formulas. synergy bidirectional: KAN (incorporating knowledge into KANs),...

10.48550/arxiv.2408.10205 preprint EN arXiv (Cornell University) 2024-08-19

We present a learning-based method to control coupled 2D system involving both fluid and rigid bodies. Our approach is used modify the fluid/rigid simulator's behavior by applying forces only at simulation domain boundaries. The rest of domain, corresponding interior, governed Navier-Stokes equation for fluids Newton-Euler's represent our controller using general neural-net, which trained deep reinforcement learning. formulation decomposes task into two stages: precomputation training stage...

10.1145/3197517.3201334 article EN ACM Transactions on Graphics 2018-07-30

In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words. Although Networks (TCN) have recently demonstrated great potential in many vision tasks, its receptive fields are not dense enough to model complex temporal dynamics scenarios. To address problem, introduce connections into network capture more robust features. Moreover, our approach utilises Squeeze-and-Excitation block, a light-weight attention mechanism, further...

10.1109/wacv48630.2021.00290 preprint EN 2021-01-01

Several training strategies and temporal models have been recently proposed for isolated word lip-reading in a series of independent works. However, the potential combining best investigating impact each them has not explored. In this paper, we systematically investigate performance state-of-the-art data augmentation approaches, other strategies, like self-distillation using boundary indicators. Our results show that Time Masking (TM) is most important followed by mixup Densely-Connected...

10.1109/icassp43922.2022.9746706 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Speech is a means of communication which relies on both audio and visual information.The absence one modality can often lead to confusion or misinterpretation information.In this paper we present an end-to-end temporal model capable directly synthesising from silent video, without needing transform to-and-from intermediate features.Our proposed approach, based GANs producing natural sounding, intelligible speech synchronised with the video.The performance our evaluated GRID dataset for...

10.21437/interspeech.2019-1445 article EN Interspeech 2022 2019-09-13

The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect self-supervised learning.Recent works have focused on each these modalities separately, while others attempted model both simultaneously in a cross-modal fashion.However, comparatively little been given leveraging one modality as training objective learn from other.In this work, we propose Learning visual speech Representations Audio via self-supervision (LiRA).Specifically,...

10.21437/interspeech.2021-1360 article EN Interspeech 2022 2021-08-27

TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research development of technologies by providing well-designed, easy-to-use, performant PyTorch components. Its contributors routinely engage with users understand their needs fulfill them developing impactful features. Here, we survey TorchAudio's principles contents highlight key features include in its latest version (2.1): self-supervised learning pre-trained pipelines training...

10.1109/asru57964.2023.10389648 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2023-12-16

Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on particular modality or feature alone there been very limited work that studies interaction between two modalities self representations. We propose framework representations guided by in context audiovisual speech. employ generative audio-to-video training scheme which we animate still image corresponding to given clip...

10.1109/icassp40776.2020.9053415 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

Automatic cleaning of carbon blocks based on machine vision is currently an important aspect industrial intelligent applications. The recognition block types and center point localization are the core contents this task, but existing instance segmentation algorithms perform poorly in task. This paper proposes algorithm improved YOLOv8 (YOLOv8-HDSA), which achieves highly accurate edge segmentation. YOLOv8-HDSA designs a Selective Reinforcement Feature Fusion Module (SRFF) that utilizes...

10.1038/s41598-025-91495-x article EN cc-by-nc-nd Scientific Reports 2025-03-09

Colletotrichum graminicola can cause leaf spots and stalk rot in maize. The primary function of carbohydrate esterases (CEs) is to eliminate ester modifications from monosaccharides, oligosaccharides, polysaccharides, thereby facilitating the hydrolysis sugars. We identified 128 CE genes through whole-genome analysis functional annotation C. TZ–3 here. further analyzed physicochemical properties, subcellular localization, conserved motifs, gene structures, promoter regulatory elements these...

10.3390/agriculture15070781 article EN cc-by Agriculture 2025-04-03

Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning and texts in shared embedding space, expressing semantic similarities between vision language embeddings. VLM classification can be improved descriptions generated Large Language (LLMs). However, it difficult to determine the contribution actual description semantics, as performance gain may also stem from semantic-agnostic ensembling...

10.1609/aaai.v39i6.32638 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

Current discriminative depth estimation methods often produce blurry artifacts, while generative approaches suffer from slow sampling due to curvatures in the noise-to-depth transport. Our method addresses these challenges by framing as a direct transport between image and distributions. We are first explore flow matching this field, we demonstrate that its interpolation trajectories enhance both training efficiency preserving high performance. While models typically require extensive data,...

10.1609/aaai.v39i3.32330 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

Tasks in multi-task learning often correlate, conflict, or even compete with each other. As a result, single solution that is optimal for all tasks rarely exists. Recent papers introduced the concept of Pareto optimality to this field and directly cast as multi-objective optimization problems, but solutions returned by existing methods are typically finite, sparse, discrete. We present novel, efficient method generates locally continuous sets fronts, which opens up possibility analysis...

10.48550/arxiv.2006.16434 preprint EN other-oa arXiv (Cornell University) 2020-01-01
Coming Soon ...