- Video Analysis and Summarization
- Multimodal Machine Learning Applications
- Human Motion and Animation
- Advanced Vision and Imaging
- Humor Studies and Applications
- Human Pose and Action Recognition
- Music and Audio Processing
- Subtitles and Audiovisual Media
- Brain Tumor Detection and Classification
- Reinforcement Learning in Robotics
- Advanced Optical Imaging Technologies
- 3D Shape Modeling and Analysis
- Augmented Reality Applications
- Computer Graphics and Visualization Techniques
- Cell Image Analysis Techniques
- Robotic Path Planning Algorithms
- Video Coding and Compression Technologies
- COVID-19 diagnosis using AI
- Evacuation and Crowd Dynamics
- Advanced Image Processing Techniques
- Hand Gesture Recognition Systems
- Advanced Image and Video Retrieval Techniques
- Robot Manipulation and Learning
École Polytechnique
2021-2024
Laboratoire d'Informatique de l'École Polytechnique
2021-2024
Université de Rennes
2022-2023
Institut de Recherche en Informatique et Systèmes Aléatoires
2022-2023
Centre National de la Recherche Scientifique
2021-2023
Institut national de recherche en informatique et en automatique
2022-2023
Abstract Automatically understanding funny moments (i.e., the that make people laugh) when watching comedy is challenging, as they relate to various features, such body language, dialogues and culture. In this paper, we propose FunnyNet-W, a model relies on cross- self-attention for visual, audio text data predict in videos. Unlike most methods rely ground truth form of subtitles, work exploit modalities come naturally with videos: (a) video frames contain visual information indispensable...
This paper presents JAWS, an optimization-driven approach that achieves the robust transfer of visual cinematic features from a reference in-the-wild video clip to newly generated clip. To this end, we rely on implicit-neural-representation (INR) in way compute shares same as We propose general formulation camera optimization problem INR computes extrinsic and intrinsic parameters well timing. By leveraging differentiability neural representations, can back-propagate our designed losses...
Automatically understanding funny moments (i.e., the that make people laugh) when watching comedy is challenging, as they relate to various features, such body language, dialogues and culture. In this paper, we propose FunnyNet-W, a model relies on cross- self-attention for visual, audio text data predict in videos. Unlike most methods rely ground truth form of subtitles, work exploit modalities come naturally with videos: (a) video frames contain visual information indispensable scene...
Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, particular camera placement movement over time. Crafting compelling trajectories remains a complex iterative process, even for skilful artists. To tackle this, this paper, we propose dataset called Exceptional Trajectories (E.T.) with along character information textual captions encompassing descriptions both character. our knowledge, is first its kind. show potential applications E.T. dataset,...
Recent advances in text-conditioned video diffusion have greatly improved quality. However, these methods offer limited or sometimes no control to users on camera aspects, including dynamic motion, zoom, distorted lens and focus shifts. These motion optical aspects are crucial for adding controllability cinematic elements generation frameworks, ultimately resulting visual content that draws focus, enhances mood, guides emotions according filmmakers' controls. In this paper, we aim close the...
The artistic crafting of 3D animations by designers is a complex and iterative process. While classical animation tools have brought significant improvements in creating manipulating shapes over time, most approaches rely on 2D input devices to create contents. With the advent virtual reality technologies their ability dive users worlds precisely track 6 dimensions (position orientation), number VR creative emerged such as Quill, AnimVR, Tvori, Tiltbrush or MasterPieceVR. these provide...
Transformers were initially introduced for natural language processing (NLP) tasks, but fast they adopted by most deep learning fields, including computer vision. They measure the relationships between pairs of input tokens (words in case text strings, parts images visual Transformers), termed attention. The cost is exponential with number tokens. For image classification, common Transformer Architecture uses only Encoder order to transform various However, there are also numerous other...
Neural Radiance Fields (NeRFs) have revolutionized scene novel view synthesis, offering visually realistic, precise, and robust implicit reconstructions. While recent approaches enable NeRF editing, such as object removal, 3D shape modification, or material property manipulation, the manual annotation prior to edits makes process tedious. Additionally, traditional 2D interaction tools lack an accurate sense of space, preventing precise manipulation editing scenes. In this paper, we introduce...
Neural Radiance Fields (NeRFs) have revolutionized scene novel view synthesis, offering visually realistic, precise, and robust implicit reconstructions. While recent approaches enable NeRF editing, such as object removal, 3D shape modification, or material property manipulation, the manual annotation prior to edits makes process tedious. Additionally, traditional 2D interaction tools lack an accurate sense of space, preventing precise manipulation editing scenes. In this paper, we introduce...