- 3D Shape Modeling and Analysis
- Speech and Audio Processing
- Computer Graphics and Visualization Techniques
- Human Motion and Animation
- Human Pose and Action Recognition
- Advanced Vision and Imaging
- Face recognition and analysis
- Image Retrieval and Classification Techniques
- Image Processing and 3D Reconstruction
- Hand Gesture Recognition Systems
- Music and Audio Processing
- Robotics and Sensor-Based Localization
- Social Robot Interaction and HRI
- Speech Recognition and Synthesis
- Emotion and Mood Recognition
- Speech and dialogue systems
- Advanced Image and Video Retrieval Techniques
- Blind Source Separation Techniques
- Multimodal Machine Learning Applications
- Video Analysis and Summarization
- AI in Service Interactions
- Computational Geometry and Mesh Generation
- Humor Studies and Applications
- Face and Expression Recognition
- Hearing Impairment and Communication
Koç University
2014-2023
Centre National de la Recherche Scientifique
2003
Boğaziçi University
1993-2002
<para xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> It is well-known that early integration (also called data fusion) effective when the modalities are correlated, and late decision or opinion optimal uncorrelated. In this paper, we propose a new multimodal fusion strategy for open-set speaker identification using combination of following canonical correlation analysis (CCA) speech lip texture features. We also method high precision...
We address content-based retrieval of complete 3D object models by a probabilistic generative description local shape properties. The proposed framework characterizes with sampled multivariate probability density functions its surface features. This density-based descriptor can be efficiently computed via kernel estimation (KDE) coupled fast Gauss transform. non-parametric KDE technique allows reliable characterization diverse set shapes and yields descriptors which remain relatively...
There have been several studies that jointly use audio, lip intensity, and geometry information for speaker identification speech-reading applications. This paper proposes using explicit motion information, instead of or in addition to intensity and/or within a unified feature selection discrimination analysis framework, addresses two important issues: 1) Is useful, and, 2) if so, what are the best features these applications? The considered be those result highest individual speakers...
3-D scene representation is utilized during extraction, modeling, transmission and display stages of a 3DTV framework. To this end, different technologies are proposed to fulfill the requirements paradigm. Dense point-based methods appropriate for free-view applications, since they can generate novel views easily. As surface representations, polygonal meshes quite popular due their generality current hardware support. Unfortunately, there no inherent smoothness in description resulting...
Abstract We present a dense correspondence method for isometric shapes, which is accurate yet computationally efficient. minimize the distortion directly in 3D Euclidean space, i.e., domain where isometry originally defined, by using coarse‐to‐fine sampling and combinatorial matching algorithm. Our does not require any initialization aims to find an solution minimum‐distortion sense perfectly shapes. demonstrate performance of our on various (or nearly isometric) pairs
We propose a novel framework for learning many-to-many statistical mappings from musical measures to dance figures towards generating plausible music-driven choreographies. obtain music-to-dance through use of four models: 1) measure models, representing many-to-one relation, each which associates different melody patterns given figure via hidden Markov model (HMM); 2) exchangeable model, captures the diversity in performance one-to-many extracted by unsupervised clustering segments based on...
We present a multimodal open-set speaker identification system that integrates information coming from audio, face and lip motion modalities. For fusion of multiple modalities, we propose new adaptive cascade rule favors reliable modality combinations through classifiers. The order the classifiers in is adaptively determined based on reliability each combination. A novel measure, genuinely fits to problem, also proposed assess accept or reject decisions classifier. formal framework developed...
In this correspondence, the problem of directional and multiscale edge detection is considered. Orthogonal linear-phase M-band wavelet transform used to decompose image into MxM channels. These channels are then combined such that each combination, which we refer as decomposition filter, results in zero-crossings at locations edges corresponding different directions resolutions, inherently performs regularization against noise. By applying a zero-crossing detector on outputs filters, maps...
We propose a new two-stage framework for joint analysis of head gesture and speech prosody patterns speaker towards automatic realistic synthesis gestures from prosody. In the first stage analysis, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation features separately to determine elementary patterns, respectively, particular speaker. second stage, correlations between these is performed using Multi-Stream HMMs an audio-visual mapping model. The resulting model...
We present a purely isometric method that establishes 3D correspondence between two (nearly) shapes. Our evenly samples high-curvature vertices from the given mesh representations, and then seeks an injective mapping one vertex set to other minimizes distortion. formulate problem of shape as combinatorial optimization over domain all possible mappings, which reduces in probabilistic setting log-likelihood maximization we solve via Expectation-Maximization (EM) algorithm. The EM algorithm is...
We address the problem of object recognition from RGB-D images using deep convolutional neural networks (CNNs). advocate use 3D CNNs to fully exploit spatial information in depth as well pretrained 2D learn features images. There exists currently no large scale dataset available comprising compared those for RGB data. Hence transfer learning source data is key be able train CNNs. To this end, we propose a hybrid 2D/3D network that can initialized with and then trained over relatively small...
We present SlotAdapt, an object-centric learning method that combines slot attention with pretrained diffusion models by introducing adapters for slot-based conditioning. Our preserves the generative power of models, while avoiding their text-centric conditioning bias. also incorporate additional guidance loss into our architecture to align cross-attention from adapter layers attention. This enhances alignment model objects in input image without using external supervision. Experimental...
Thanks to a remarkably great ability show amusement and engagement, laughter is one of the most important social markers in human interactions. Laughing together can actually help set up positive atmosphere favors creation new relationships. This paper presents data collection interaction dialogs involving humor between participant robot. In this work, scenarios have been designed order study such as laughter. They implemented within two automatic systems developed Joker project: dialog...
The authors combine two different biometric modalities for next-generation vehicles that use person recognition. Next-generation will undoubtedly feature recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework achieving successfully combines modalities, borne out in case studies.
Abstract We address the symmetric flip problem that is inherent to multi‐resolution isometric shape matching algorithms. To this effect, we extend our previous work which handles dense correspondence in original 3D Euclidean space via coarse‐to‐fine combinatorial matching. The key idea based on keeping track of all optimal solutions, may be more than one due symmetry especially at coarse levels, throughout denser levels process. compare resulting algorithm with state‐of‐the‐art techniques...
Multimodal speech and speaker modeling recognition are widely accepted as vital aspects of state the art human-machine interaction systems. While correlations between lip motion well facial expressions studied, relatively little work has been done to investigate gesture. Detection head, hand arm gestures a have studied extensively these were shown carry linguistic information. A typical example is head gesture while saying "yes/no". In this study, correlation investigated. signal analysis,...
We present a new framework for joint analysis of head gesture and speech prosody patterns speaker towards automatic realistic synthesis gestures from prosody. The proposed two-stage aims to "learn" both elementary particular speaker, as well the correlations between these training video sequence. resulting audio-visual mapping model is then employed synthesize natural arbitrary input test given speaker. Objective subjective evaluations indicate that by scheme provides looking with any speech.
Abstract We present a 3‐D correspondence method to match the geometric extremities of two shapes which are partially isometric. consider most general setting isometric partial shape problem, in be matched may have multiple common parts at arbitrary scales as well that not similar. Our rank‐and‐vote‐and‐combine algorithm identifies and ranks potentially correct matches by exploring space all possible maps between coarsely sampled extremities. The qualified top‐ranked matchings then subjected...
We present a bimodal audio-visual speaker identification system. The objective is to improve the recognition performance over conventional unimodal schemes. proposed system exploits not only temporal and spatial correlations existing in speech video signals of speaker, but also cross-correlation between these two modalities. Lip images extracted from each frame are transformed onto an eigenspace. obtained eigenlip coefficients interpolated match rate signal fused with Mel frequency cepstral...