Emad Barsoum

ORCID: 0000-0002-4097-8690
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Neural Network Applications
  • Human Motion and Animation
  • Topic Modeling
  • Multimodal Machine Learning Applications
  • Human Pose and Action Recognition
  • Natural Language Processing Techniques
  • Video Analysis and Summarization
  • Music and Audio Processing
  • Emotion and Mood Recognition
  • Domain Adaptation and Few-Shot Learning
  • Speech Recognition and Synthesis
  • Composite Structure Analysis and Optimization
  • Speech and Audio Processing
  • Video Surveillance and Tracking Methods
  • Advanced Image and Video Retrieval Techniques
  • Generative Adversarial Networks and Image Synthesis
  • Advanced Image Processing Techniques
  • Advanced Vision and Imaging
  • Elasticity and Material Modeling
  • Face and Expression Recognition
  • Stochastic Gradient Optimization Techniques
  • Autonomous Vehicle Technology and Safety
  • Gait Recognition and Analysis
  • Machine Learning and ELM
  • Algorithms and Data Compression

Advanced Micro Devices (United States)
2024

Columbia University
1993-2020

Microsoft (United States)
2016-2017

Ain Shams University
2005

Automatic emotion recognition from speech is a challenging task which relies heavily on the effectiveness of features used for classification. In this work, we study use deep learning to automatically discover emotionally relevant speech. It shown that using recurrent neural network, can learn both short-time frame-level acoustic are relevant, as well an appropriate temporal aggregation those into compact utterance-level representation. Moreover, propose novel strategy feature pooling over...

10.1109/icassp.2017.7952552 article EN 2017-03-01

Crowd sourcing has become a widely adopted scheme to collect ground truth labels. However, it is well-known problem that these labels can be very noisy. In this paper, we demonstrate how learn deep convolutional neural network (DCNN) from noisy labels, using facial expression recognition as an example. More specifically, have 10 taggers label each input image, and compare four different approaches utilizing the multiple labels: majority voting, multi-label learning, probabilistic drawing,...

10.1145/2993148.2993165 preprint EN 2016-10-31

Predicting and understanding human motion dynamics has many applications, such as synthesis, augmented reality, security, autonomous vehicles. Due to the recent success of generative adversarial networks (GAN), there been much interest in probabilistic estimation synthetic data generation using deep neural network architectures learning algorithms. We propose a novel sequence-to-sequence model for prediction, trained with modified version improved Wasserstein (WGAN-GP), which we use custom...

10.1109/cvprw.2018.00191 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2018-06-01

This paper presents the implementation details of proposed solution to Emotion Recognition in Wild 2016 Challenge, category video-based emotion recognition. The approach takes video stream from audio-video trimmed clips provided by challenge as input and produces label corresponding this sequence. output is encoded one out seven classes: six basic emotions (Anger, Disgust, Fear, Happiness, Sad, Surprise) Neutral. Overall, system consists several pipelined modules: face detection, image...

10.1145/2993148.2997627 article EN 2016-10-31

Transformer-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-attention mechanism suffers from quadratic time complexity and linearly increased cache size. Sliding window attention (SWA) solves this problem by restricting to fixed-size local context window. Nevertheless, SWA employs uniform size for each head in layer, making it inefficient capturing varying scales. To mitigate limitation, we propose Multi-Scale Window Attention...

10.48550/arxiv.2501.01039 preprint EN arXiv (Cornell University) 2025-01-01

Recent advancements in diffusion models have significantly facilitated text-guided video editing. However, there is a relative scarcity of research on image-guided editing, method that empowers users to edit videos by merely indicating target object the initial frame and providing an RGB image as reference, without relying text prompts. In this paper, we propose novel Image-guided Video Editing Diffusion model, termed IVEDiff for built top editing models, equipped with learnable motion...

10.48550/arxiv.2501.04325 preprint EN arXiv (Cornell University) 2025-01-08

Speculative decoding (SD) accelerates large language model inference by using a smaller draft to predict multiple tokens, which are then verified in parallel the larger target model. However, limited capacity of often necessitates tree-based sampling improve prediction accuracy, where candidates generated at each step. We identify key limitation this approach: same step derived from representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro,...

10.48550/arxiv.2502.06282 preprint EN arXiv (Cornell University) 2025-02-10

Designing an efficient and effective neural network has remained a prominent topic in computer vision research. Depthwise onvolution (DWConv) is widely used CNNs or ViTs, but it needs frequent memory access during inference, which leads to low throughput. FasterNet attempts introduce partial convolution (PConv) as alternative DWConv compromises the accuracy due underutilized channels. To remedy this shortcoming consider redundancy between feature map channels, we novel Partial visual...

10.32388/1l3te6 preprint EN cc-by 2025-03-21

3D Gaussian Splatting (3D GS) has gained popularity due to its faster rendering speed and high-quality novel view synthesis. Some researchers have explored using GS for reconstructing driving scenes. However, these methods often rely on various types of data, such as depth maps, bounding boxes, trajectories moving objects. Additionally, the lack annotations synthesized images limits their direct application in downstream tasks. To address issues, we propose EGSRAL, a GS-based method that...

10.1609/aaai.v39i4.32403 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

With the increase number of companies focusing on commercializing Augmented Reality (AR), Virtual (VR) and wearable devices, need for a hand based input mechanism is becoming essential in order to make experience natural, seamless immersive. Hand pose estimation has progressed drastically recent years due introduction commodity depth cameras. vision still challenging problem its complexity from self-occlusion (between fingers), close similarity between fingers, dexterity hands, speed high...

10.48550/arxiv.1604.06195 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Diffusion-based image super-resolution (SR) models have attracted substantial interest due to their powerful restoration capabilities. However, prevailing diffusion often struggle strike an optimal balance between efficiency and performance. Typically, they either neglect exploit the potential of existing extensive pretrained models, limiting generative capacity, or necessitate a dozens forward passes starting from random noises, compromising inference efficiency. In this paper, we present...

10.48550/arxiv.2409.17778 preprint EN arXiv (Cornell University) 2024-09-26

Crowd sourcing has become a widely adopted scheme to collect ground truth labels. However, it is well-known problem that these labels can be very noisy. In this paper, we demonstrate how learn deep convolutional neural network (DCNN) from noisy labels, using facial expression recognition as an example. More specifically, have 10 taggers label each input image, and compare four different approaches utilizing the multiple labels: majority voting, multi-label learning, probabilistic drawing,...

10.48550/arxiv.1608.01041 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Lane detection is a fundamental task in autonomous driving, and has achieved great progress as deep learning emerges. Previous anchor-based methods often design dense anchors, which highly depend on the training dataset remain fixed during inference. We analyze that anchors are not necessary for lane detection, propose transformer-based framework based sparse anchor mechanism. To this end, we generate with position-aware queries angle instead of traditional explicit anchors. adopt Horizontal...

10.48550/arxiv.2404.07821 preprint EN arXiv (Cornell University) 2024-04-11

Video Frame Interpolation (VFI) is a crucial technique in various applications such as slow-motion generation, frame rate conversion, video restoration etc. This paper introduces an efficient interpolation framework that aims to strike favorable balance between efficiency and quality. Our follows general paradigm consisting of flow estimator refinement module, while incorporating carefully designed components. First all, we adopt depth-wise convolution with large kernels the simultaneously...

10.48550/arxiv.2404.11108 preprint EN arXiv (Cornell University) 2024-04-17

Pre-trained language models (PLMs) are engineered to be robust in contextual understanding and exhibit outstanding performance various natural processing tasks. However, their considerable size incurs significant computational storage costs. Modern pruning strategies employ one-shot techniques compress PLMs without the need for retraining on task-specific or otherwise general data; however, these approaches often lead an indispensable reduction performance. In this paper, we propose SDS, a...

10.48550/arxiv.2408.10473 preprint EN arXiv (Cornell University) 2024-08-19

Diffusion models have achieved remarkable progress in the field of image generation due to their outstanding capabilities. However, these require substantial computing resources because multi-step denoising process during inference. While traditional pruning methods been employed optimize models, retraining necessitates large-scale training datasets and extensive computational costs maintain generalization ability, making it neither convenient nor efficient. Recent studies attempt utilize...

10.48550/arxiv.2410.16942 preprint EN arXiv (Cornell University) 2024-10-22

Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders deployment in industrial applications. Many works leverage traditional compression approaches boost inference, but these always introduce additional training costs restore and pruning results typically show noticeable drops compared original when aiming for a...

10.48550/arxiv.2412.11494 preprint EN arXiv (Cornell University) 2024-12-16

3D Gaussian Splatting (3D GS) has gained popularity due to its faster rendering speed and high-quality novel view synthesis. Some researchers have explored using GS for reconstructing driving scenes. However, these methods often rely on various data types, such as depth maps, boxes, trajectories of moving objects. Additionally, the lack annotations synthesized images limits their direct application in downstream tasks. To address issues, we propose EGSRAL, a GS-based method that relies...

10.48550/arxiv.2412.15550 preprint EN arXiv (Cornell University) 2024-12-19

In text-to-image (T2I) generation applications, negative embeddings have proven to be a simple yet effective approach for enhancing quality. Typically, these are derived from user-defined prompts, which, while being functional, not necessarily optimal. this paper, we introduce ReNeg, an end-to-end method designed learn improved Negative guided by Reward model. We employ reward feedback learning framework and integrate classifier-free guidance (CFG) into the training process, which was...

10.48550/arxiv.2412.19637 preprint EN arXiv (Cornell University) 2024-12-27

Inspired by CapsNet's routing-by-agreement mechanism with its ability to learn object properties, we explore if those properties in turn can determine new of the objects, such as locations. We then propose a CapsNet architecture coordinate atoms and modified algorithm unevenly distributed initial routing probabilities. The model is based on but uses find objects' approximate positions image system. also discussed how derive property translation through show importance sparse representation....

10.48550/arxiv.1805.07706 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Human motion prediction and understanding is a challenging problem. Due to the complex dynamic of human non-deterministic aspect future prediction. We propose novel sequence-to-sequence model for feature learning, trained with modified version generative adversarial network, custom loss function that takes inspiration from animation can control variation between multiple predicted same input poses. Our learns predict sequences poses sequence. show discriminator general presentation by using...

10.48550/arxiv.2012.15378 preprint EN other-oa arXiv (Cornell University) 2020-01-01
Coming Soon ...