Bugra Tekin

ORCID: 0000-0001-8811-9919
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Human Pose and Action Recognition
  • Hand Gesture Recognition Systems
  • Advanced Vision and Imaging
  • Anomaly Detection Techniques and Applications
  • Video Surveillance and Tracking Methods
  • Robot Manipulation and Learning
  • Human Motion and Animation
  • 3D Shape Modeling and Analysis
  • Advanced Neural Network Applications
  • Multimodal Machine Learning Applications
  • Gait Recognition and Analysis
  • Image and Object Detection Techniques
  • Robotics and Sensor-Based Localization
  • Sparse and Compressive Sensing Techniques
  • Video Analysis and Summarization
  • Tensor decomposition and applications
  • Diabetic Foot Ulcer Assessment and Management
  • Advanced Image and Video Retrieval Techniques
  • Image and Signal Denoising Methods
  • Natural Language Processing Techniques
  • Augmented Reality Applications
  • AI in Service Interactions
  • Social Robot Interaction and HRI
  • Gaze Tracking and Assistive Technology
  • Speech and dialogue systems

Swiss Federal Institute of Metrology
2024

META Health
2024

Microsoft Research (United Kingdom)
2019-2023

Microsoft (United States)
2021-2022

Microsoft (Switzerland)
2019

École Polytechnique Fédérale de Lausanne
2013-2018

Centre d'Imagerie BioMedicale
2013

We propose a single-shot approach for simultaneously detecting an object in RGB image and predicting its 6D pose without requiring multiple stages or having to examine hypotheses. Unlike recently proposed technique this task [10] that only predicts approximate must then be refined, ours is accurate enough not require additional post-processing. As result, it much faster - 50 fps on Titan X (Pascal) GPU more suitable real-time processing. The key component of our method new CNN architecture...

10.1109/cvpr.2018.00038 article EN 2018-06-01

We present a unified framework for understanding 3D hand and object interactions in raw image sequences from egocentric RGB cameras. Given single image, our model jointly estimates the poses, models their interactions, recognizes action classes with feed-forward pass through neural network. propose architecture that does not rely on external detection algorithms but rather is trained end-to-end images. further merge propagate information temporal domain to infer between trajectories...

10.1109/cvpr.2019.00464 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Most recent approaches to monocular 3D human pose estimation rely on Deep Learning. They typically involve regressing from an image either joint coordinates directly or 2D locations which are inferred. Both have their strengths and weaknesses we therefore propose a novel architecture designed deliver the best of both worlds by performing simultaneously fusing information along way. At heart our framework is trainable fusion scheme that learns how fuse optimally instead being hand-designed....

10.1109/iccv.2017.425 article EN 2017-10-01

Most recent approaches to monocular 3D pose estimation rely on Deep Learning.They either train a Convolutional Neural Network directly regress from image pose, which ignores the dependencies between human joints, or model these via max-margin structured learning framework, involves high computational cost at inference time.In this paper, we introduce Learning regression architecture for prediction of images that relies an overcomplete auto-encoder learn high-dimensional latent representation...

10.5244/c.30.130 article EN 2016-01-01

We propose an efficient approach to exploiting motion information from consecutive frames of a video sequence recover the 3D pose people. Previous approaches typically compute candidate poses in individual and then link them post-processing step resolve ambiguities. By contrast, we directly regress spatio-temporal volume bounding boxes central frame. further show that, for this achieve its full potential, it is essential compensate so that subject remains centered. This allows us effectively...

10.1109/cvpr.2016.113 article EN 2016-06-01

Modeling hand-object manipulations is essential for understanding how humans interact with their environment. While of practical importance, estimating the pose hands and objects during interactions challenging due to large mutual occlusions that occur manipulation. Recent efforts have been directed towards fully-supervised methods require amounts labeled training samples. Collecting 3D ground-truth data interactions, however, costly, tedious, error-prone. To overcome this challenge we...

10.1109/cvpr42600.2020.00065 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects. To this end, we propose method to create unified dataset recognition. Our produces the pose and 6D manipulated objects, along with their labels each frame. dataset, called H2O (2 Hands Objects), provides synchronized multi-view RGB-D images, labels, object classes, ground-truth poses left & right hands, poses, camera meshes scene point clouds. best...

10.1109/iccv48922.2021.00998 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Learning filters to produce sparse image representations in terms of over-complete dictionaries has emerged as a powerful way create features for many different purposes. Unfortunately, these are usually both numerous and non-separable, making their use computationally expensive. In this paper, we show that such can be computed linear combinations smaller number separable ones, thus greatly reducing the computational complexity at no cost performance. This makes filter learning approaches...

10.1109/tpami.2014.2343229 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2014-07-25

Mixed reality headsets, such as the Microsoft HoloLens 2, are powerful sensing devices with integrated compute capabilities, which makes it an ideal platform for computer vision research. In this technical report, we present 2 Research Mode, API and a set of tools enabling access to raw sensor streams. We provide overview explain how can be used build mixed applications based on processing data. also show combine Mode data built-in eye hand tracking capabilities provided by 2. By releasing...

10.48550/arxiv.2008.11239 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Building an interactive AI assistant that can perceive, reason, and collaborate with humans in the real world has been a long-standing pursuit community. This work is part of broader research effort to develop intelligent agents interactively guide through performing tasks physical world. As first step this direction, we introduce HoloAssist, large-scale egocentric human interaction dataset, where two people collaboratively complete manipulation tasks. The task performer executes while...

10.1109/iccv51070.2023.01854 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

The lack of large-scale real datasets with annotations makes transfer learning a necessity for video activity understanding. We aim to develop an effective method few-shot first-person action classification. leverage independently trained local visual cues learn representations that can be transferred from source domain, which provides primitive labels, different target domain using only handful examples. Visual we employ include object-object interactions, hand grasps and motion within...

10.1109/tpami.2021.3058606 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2021-02-13

State-of-the-art methods for self-supervised sequential action alignment rely on deep networks that find correspondences across videos in time. They either learn frame-to-frame mapping sequences, which does not leverage temporal information, or assume monotonic between each video pair, ignores variations the order of actions. As such, these are able to deal with common real-world scenarios involve background frames contain non-monotonic sequence In this paper, we propose an approach align...

10.1109/cvpr52688.2022.00222 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

We propose an efficient approach to exploiting motion information from consecutive frames of a video sequence recover the 3D pose people. Instead computing candidate poses in individual and then linking them, as is often done, we regress directly spatio-temporal block central one. will demonstrate that this allows us effectively overcome ambiguities improve upon state-of-the-art on challenging sequences.

10.48550/arxiv.1504.08200 preprint EN other-oa arXiv (Cornell University) 2015-01-01

Temporal alignment of fine-grained human actions in videos is important for numerous applications computer vision, robotics, and mixed reality. State-of-the-art methods directly learn image-based embedding space by leveraging powerful deep convolutional neural networks. While being straightforward, their results are far from satisfactory, the aligned exhibit severe temporal discontinuity without additional post-processing steps. The recent advancements body hand pose estimation wild promise...

10.1109/cvpr52688.2022.00800 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

The steerable wavelet transform is a redundant image representation with the remarkable property that its basis functions can be adaptively rotated to desired orientation. This makes well-suited design of wavelet-based algorithms applicable images high amount directional features. However, arbitrary modification wavelet-domain coefficients may violate consistency constraints because legitimate must redundant. In this paper, by honoring redundancy coefficients, we demonstrate it possible...

10.1109/icassp.2013.6637872 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible semantically meaningful. Furthermore, generalization unseen objects hindered by limited scale of available interaction datasets. We propose DiffH2O, a novel method synthesize realistic, one or two-handed from provided text prompts geometry object. The introduces three techniques that enable effective learning data. First, we decompose task into...

10.48550/arxiv.2403.17827 preprint EN arXiv (Cornell University) 2024-03-26
Coming Soon ...