Huseyin Coskun

ORCID: 0000-0002-4669-2220
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Human Pose and Action Recognition
  • Anomaly Detection Techniques and Applications
  • Advanced Vision and Imaging
  • Gait Recognition and Analysis
  • Video Surveillance and Tracking Methods
  • Domain Adaptation and Few-Shot Learning
  • Adversarial Robustness in Machine Learning
  • Explainable Artificial Intelligence (XAI)
  • Multimodal Machine Learning Applications
  • Diabetic Foot Ulcer Assessment and Management
  • CCD and CMOS Imaging Sensors
  • Embedded Systems Design Techniques
  • Surgical Simulation and Training
  • Hand Gesture Recognition Systems
  • Stroke Rehabilitation and Recovery
  • Advanced Malware Detection Techniques
  • Image Processing Techniques and Applications
  • Interactive and Immersive Displays
  • Robotics and Sensor-Based Localization
  • Cell Image Analysis Techniques
  • Parallel Computing and Optimization Techniques
  • Multimedia Communication and Technology

Snap (United States)
2022

Technical University of Munich
2016-2019

One-shot pose estimation for tasks such as body joint localization, camera estimation, and object tracking are generally noisy, temporal filters have been extensively used regularization. One of the most widely-used methods is Kalman filter, which both extremely simple general. However, require a motion model measurement to be specified priori, burdens modeler simultaneously demands that we use explicit models often only crude approximations reality. For example, in pose-estimation mentioned...

10.1109/iccv.2017.589 article EN 2017-10-01

We present a sampling-free approach for computing the epistemic uncertainty of neural network. Epistemic is an important quantity deployment deep networks in safety-critical applications, since it represents how much one can trust predictions on new data. Recently promising works were proposed using noise injection combined with Monte-Carlo sampling at inference time to estimate this (e.g. dropout). Our main contribution approximation estimated by these methods that does not require...

10.1109/iccv.2019.00302 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

We present a self-supervised approach for learning video representations using temporal alignment as pretext task, while exploiting both frame-level and video-level information. leverage novel combination of loss regularization terms, which can be used supervision signals training an encoder network. Specifically, the (i.e., Soft-DTW) aims minimum cost temporally aligning videos in embedding space. However, optimizing solely this term leads to trivial solutions, particularly, one where all...

10.1109/cvpr46437.2021.00550 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

The lack of large-scale real datasets with annotations makes transfer learning a necessity for video activity understanding. We aim to develop an effective method few-shot first-person action classification. leverage independently trained local visual cues learn representations that can be transferred from source domain, which provides primitive labels, different target domain using only handful examples. Visual we employ include object-object interactions, hand grasps and motion within...

10.1109/tpami.2021.3058606 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2021-02-13

State-of-the-art methods for self-supervised sequential action alignment rely on deep networks that find correspondences across videos in time. They either learn frame-to-frame mapping sequences, which does not leverage temporal information, or assume monotonic between each video pair, ignores variations the order of actions. As such, these are able to deal with common real-world scenarios involve background frames contain non-monotonic sequence In this paper, we propose an approach align...

10.1109/cvpr52688.2022.00222 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Vision Transformers (ViT) is known for its scalability. In this work, we target to scale down a ViT fit in an environment with dynamic-changing resource constraints. We observe that smaller ViTs are intrinsically the sub-networks of larger different widths. Thus, propose general framework, named Scala, enable single network represent multiple flexible inference capability, which aligns inherent design vary from Concretely, Scala activates several subnets during training, introduces Isolated...

10.48550/arxiv.2412.04786 preprint EN arXiv (Cornell University) 2024-12-06

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small fast T2I that generates high-resolution high-quality images platforms. We propose techniques achieve this goal. First, we systematically examine the design choices network architecture reduce parameters latency, while ensuring generation. Second,...

10.48550/arxiv.2412.09619 preprint EN arXiv (Cornell University) 2024-12-12

We have witnessed the unprecedented success of diffusion-based video generation over past year. Recently proposed models from community wielded power to generate cinematic and high-resolution videos with smooth motions arbitrary input prompts. However, as a supertask image generation, require more computation are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose comprehensive acceleration framework bring large-scale diffusion...

10.48550/arxiv.2412.10494 preprint EN arXiv (Cornell University) 2024-12-13

We present a sampling-free approach for computing the epistemic uncertainty of neural network. Epistemic is an important quantity deployment deep networks in safety-critical applications, since it represents how much one can trust predictions on new data. Recently promising works were proposed using noise injection combined with Monte-Carlo sampling at inference time to estimate this (e.g. dropout). Our main contribution approximation estimated by these methods that does not require...

10.48550/arxiv.1908.00598 preprint EN other-oa arXiv (Cornell University) 2019-01-01

One-shot pose estimation for tasks such as body joint localization, camera estimation, and object tracking are generally noisy, temporal filters have been extensively used regularization. One of the most widely-used methods is Kalman filter, which both extremely simple general. However, require a motion model measurement to be specified priori, burdens modeler simultaneously demands that we use explicit models often only crude approximations reality. For example, in pose-estimation mentioned...

10.48550/arxiv.1708.01885 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Effectively measuring the similarity between two human motions is necessary for several computer vision tasks such as gait analysis, person identi- fication and action retrieval. Nevertheless, we believe that traditional approaches L2 distance or Dynamic Time Warping based on hand-crafted local pose metrics fail to appropriately capture semantic relationship across and, such, are not suitable being employed within these tasks. This work addresses this limitation by means of a triplet-based...

10.48550/arxiv.1807.11176 preprint EN other-oa arXiv (Cornell University) 2018-01-01

The video action segmentation task is regularly explored under weaker forms of supervision, such as transcript where a list actions easier to obtain than dense frame-wise labels. In this formulation, the presents various challenges for sequence modeling approaches due emphasis on transition points, long lengths, and frame contextualization, making well-posed transformers. Given developments enabling transformers scale linearly, we demonstrate through our architecture how they can be applied...

10.48550/arxiv.2201.05675 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Clustering is a ubiquitous tool in unsupervised learning. Most of the existing self-supervised representation learning methods typically cluster samples based on visually dominant features. While this works well for image-based self-supervision, it often fails videos, which require understanding motion rather than focusing background. Using optical flow as complementary information to RGB can alleviate problem. However, we observe that naive combination two views does not provide meaningful...

10.48550/arxiv.2207.10158 preprint EN cc-by arXiv (Cornell University) 2022-01-01

State-of-the-art methods for self-supervised sequential action alignment rely on deep networks that find correspondences across videos in time. They either learn frame-to-frame mapping sequences, which does not leverage temporal information, or assume monotonic between each video pair, ignores variations the order of actions. As such, these are able to deal with common real-world scenarios involve background frames contain non-monotonic sequence In this paper, we propose an approach align...

10.48550/arxiv.2111.09301 preprint EN cc-by-nc-nd arXiv (Cornell University) 2021-01-01

We present a self-supervised approach for learning video representations using temporal alignment as pretext task, while exploiting both frame-level and video-level information. leverage novel combination of loss regularization terms, which can be used supervision signals training an encoder network. Specifically, the (i.e., Soft-DTW) aims minimum cost temporally aligning videos in embedding space. However, optimizing solely this term leads to trivial solutions, particularly, one where all...

10.48550/arxiv.2103.17260 preprint EN other-oa arXiv (Cornell University) 2021-01-01
Coming Soon ...