Lluís Gómez

ORCID: 0000-0003-1408-9803
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Image and Video Retrieval Techniques
  • Handwritten Text Recognition Techniques
  • Multimodal Machine Learning Applications
  • Image Retrieval and Classification Techniques
  • Domain Adaptation and Few-Shot Learning
  • Natural Language Processing Techniques
  • Video Analysis and Summarization
  • Image Processing and 3D Reconstruction
  • Topic Modeling
  • Human Pose and Action Recognition
  • Vehicle License Plate Recognition
  • Web Data Mining and Analysis
  • Music and Audio Processing
  • Hand Gesture Recognition Systems
  • Hate Speech and Cyberbullying Detection
  • Human Mobility and Location-Based Analysis
  • Internet Traffic Analysis and Secure E-voting
  • Mathematics, Computing, and Information Processing
  • Generative Adversarial Networks and Image Synthesis
  • Sentiment Analysis and Opinion Mining
  • Speech Recognition and Synthesis
  • Digital Media Forensic Detection
  • Power Systems and Technologies
  • Geography and Education Methods
  • Subtitles and Audiovisual Media

Universitat Autònoma de Barcelona
2015-2024

Nankai University
2024

Computer Vision Center
2013-2023

Barcelona Supercomputing Center
2019

Artifex University
2019

Arizona State University
1998

Results of the ICDAR 2015 Robust Reading Competition are presented. A new Challenge 4 on Incidental Scene Text has been added to Challenges Born-Digital Images, Focused Images and Video Text. is run a newly acquired dataset 1,670 images evaluating Localisation, Word Recognition End-to-End pipelines. In addition, for 3 substantially updated with more video sequences accurate ground truth data. Finally, tasks assessing system performance have introduced all Challenges. The competition took...

10.1109/icdar.2015.7333942 article EN 2015-08-01

This report presents the final results of ICDAR 2013 Robust Reading Competition. The competition is structured in three Challenges addressing text extraction different application domains, namely born-digital images, real scene images and real-scene videos. are organised around specific tasks covering localisation, segmentation word recognition. took place first quarter 2013, received a total 42 submissions over offered. describes datasets ground truth specification, details performance...

10.1109/icdar.2013.221 article EN 2013-08-01

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight importance of exploiting high-level in images as textual cues Visual Question Answering process. We use dataset define series tasks increasing difficulty for which reading scene context provided is necessary reason and generate appropriate answer. propose evaluation metric these account both reasoning...

10.1109/iccv.2019.00439 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

In this work we target the problem of hate speech detection in multimodal publications formed by a text and an image. We gather annotate large scale dataset from Twitter, MMHS150K, propose different models that jointly analyze textual visual information for detection, comparing them with unimodal detection. provide quantitative qualitative results challenges proposed task. find that, even though images are useful task, current cannot outperform analyzing only text. discuss why open field...

10.1109/wacv45572.2020.9093414 article EN 2020-03-01

Scene text extraction methodologies are usually based in classification of individual regions or patches, using a priori knowledge for given script language. Human perception text, on the other hand, is perceptual organisation through which emerges as perceptually significant group atomic objects. Therefore humans able to detect even languages and scripts never seen before. In this paper, we argue that problem could be posed detection meaningful groups regions. We present method built around...

10.1109/icdar.2013.100 article EN 2013-08-01

Current image captioning systems perform at a merely descriptive level, essentially enumerating the objects in scene and their relations. Humans, on contrary, interpret images by integrating several sources of prior knowledge world. In this work, we aim to take step closer producing captions that offer plausible interpretation scene, such contextual information into pipeline. For focus used illustrate news articles. We propose novel method is able leverage provided text articles associated...

10.1109/cvpr.2019.01275 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In paper we present a method that able to take advantage freely available multi-modal content train algorithms without human supervision. We put forward the idea performing self-supervised learning visual features by mining large scale corpus (text image) documents. show discriminative can be learnt efficiently CNN predict...

10.1109/cvpr.2017.218 preprint EN 2017-07-01

We present a hybrid algorithm for detection and tracking of text in natural scenes that goes beyond the full-detection approaches terms time performance optimization. A state-of-the-art scene module based on Maximally Stable Extremal Regions (MSER) is used to detect asynchronously, while separate thread detected objects are tracked by MSER propagation. The cooperation these two modules yields real video processing at high frame rates even low-resource devices.

10.1109/icpr.2014.536 article EN 2014-08-01

In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement. We start by employing transformer-based architecture that incorporates three pretext tasks as learning objectives be optimized during pre-training without the usage of labelled data. Each is specifically tailored for final downstream tasks. conduct several ablation experiments confirm...

10.1609/aaai.v37i2.25328 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2023-06-26

This report presents the final results of ICDAR 2017 Robust Reading Challenge on COCO-Text. A challenge scene text detection and recognition based largest real dataset currently available: COCO-Text dataset. The competition is structured around three tasks: Text Localization, Cropped Word Recognition End-To-End Recognition. received a total 27 submissions over different opened tasks. describes datasets ground truth, details performance evaluation protocols used along with brief summary...

10.1109/icdar.2017.234 article EN 2017-11-01

This paper focuses on the problem of script identification in unconstrained scenarios. Script is an important prerequisite to recognition, and indispensable condition for automatic text understanding systems designed multi-language environments. Although widely studied document images handwritten documents, it remains almost unexplored territory scene images. We detail a novel method natural that combines convolutional features Naive-Bayes Nearest Neighbor classifier. The proposed framework...

10.1109/das.2016.64 article EN 2016-04-01

Perceiving text is crucial to understand semantics of outdoor scenes and hence a critical requirement build intelligent systems for driver assistance self-driving. Most the existing datasets detection recognition comprise still images are mostly compiled keeping in mind. This paper introduces new "RoadText-1K" dataset driving videos. The 20 times larger than largest Our comprises 1000 video clips without any bias towards with annotations bounding boxes transcriptions every frame. State art...

10.1109/icra40945.2020.9196577 article EN 2020-05-01

This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any system up to date, namely the incorporation scene text answer questions asked about image. The a new dataset comprising 23,038 images annotated with 31,791 question / pairs where always grounded on instances present in are taken from 7 different public computer vision datasets, covering wide range scenarios. was...

10.1109/icdar.2019.00251 article EN 2019-09-01

Scene text instances found in natural images carry explicit semantic information that can provide important cues to solve a wide array of computer vision problems. In this paper, we focus on leveraging multi-modal content the form visual and textual tackle task fine-grained image classification retrieval. First, obtain from by employing reading system. Then, combine features with salient regions exploit complementary carried two sources. Specifically, employ Graph Convolutional Network...

10.1109/wacv48630.2021.00407 article EN 2021-01-01

Explaining an image with missing or non-existent objects is known as object bias (hallucination) in captioning. This behaviour quite common the state-of-the-art captioning models which not desirable by humans. To decrease hallucination captioning, we propose three simple yet efficient training augmentation method for sentences requires no new data increase model size. By extensive analysis, show that proposed methods can significantly diminish our models’ on metrics. Moreover,...

10.1109/wacv51458.2022.00253 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2022-01-01

10.1007/s10032-016-0274-2 article EN International Journal on Document Analysis and Recognition (IJDAR) 2016-09-24

Text contained in an image carries high-level semantics that can be exploited to achieve richer understanding. In particular, the mere presence of text provides strong guiding content should employed tackle a diversity computer vision tasks such as retrieval, fine-grained classification, and visual question answering. this paper, we address problem classification retrieval by leveraging textual information along with cues comprehend existing intrinsic relation between two modalities. The...

10.1109/wacv45572.2020.9093373 article EN 2020-03-01

Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in improved matching between the representation image textual its caption. Yet, current representations overlook key aspect: text appearing images, which may contain crucial information retrieval. In this paper, we first propose new dataset that allows exploration where images scene-text...

10.1109/wacv48630.2021.00227 article EN 2021-01-01

In this paper we present the LSDE string representation and its application to handwritten word spotting. is a novel embedding approach for representing strings that learns space in which distances between projected points are correlated with Levenshtein edit distance original strings. We show how such produces more semantically interpretable retrieval from user's perspective than other state of art ones as PHOC DCToW. also conduct preliminary spotting experiment on George Washington dataset.

10.1109/icdar.2017.88 article EN 2017-11-01
Coming Soon ...