- Advanced Image and Video Retrieval Techniques
- Image Retrieval and Classification Techniques
- Handwritten Text Recognition Techniques
- Multimodal Machine Learning Applications
- Video Analysis and Summarization
- Natural Language Processing Techniques
- Image Processing and 3D Reconstruction
- Medical Image Segmentation Techniques
- Music and Audio Processing
- Domain Adaptation and Few-Shot Learning
- Topic Modeling
- Vehicle License Plate Recognition
- Advanced Vision and Imaging
- Data Management and Algorithms
- Human Pose and Action Recognition
- Anomaly Detection Techniques and Applications
- Information Retrieval and Search Behavior
- Image and Object Detection Techniques
- Advanced Text Analysis Techniques
- Biomedical Text Mining and Ontologies
- Text and Document Classification Technologies
- Visual Attention and Saliency Detection
- Generative Adversarial Networks and Image Synthesis
- Optical measurement and interference techniques
- Advanced Image Processing Techniques
Amazon (United States)
2017-2023
Amazon (Germany)
2019-2022
Technion – Israel Institute of Technology
2021
California Institute of Technology
2021
University of Massachusetts Amherst
2008-2017
Amherst College
2001-2014
Universitat Autònoma de Barcelona
2014
Defense Advanced Research Projects Agency
2003
University of Hawaii System
1987
Libraries have traditionally used manual image annotation for indexing and then later retrieving their collections. However, is an expensive labor intensive procedure hence there has been great interest in coming up with automatic ways to retrieve images based on content. Here, we propose approach annotating a training set of images. We assume that regions can be described using small vocabulary blobs. Blobs are generated from features clustering. Given annotations, show probabilistic models...
The ability to learn richer network representations generally boosts the performance of deep learning models. To improve representation-learning in convolutional neural networks, we present a multi-branch architecture, which applies channel-wise attention across different branches leverage complementary strengths both feature-map and multi-path representation. Our proposed Split-Attention module provides simple modular computation block that can serve as drop-in replacement for popular...
Deep embeddings answer one simple question: How similar are two images? Learning these is the bedrock of verification, zero-shot learning, and visual search. The most prominent approaches optimize a deep convolutional network with suitable loss function, such as contrastive or triplet loss. While rich line work focuses solely on functions, we show in this paper that selecting training examples plays an equally important role. We propose distance weighted sampling, which selects more...
Retrieving images in response to textual queries requires some knowledge of the semantics picture. Here, we show how can do both automatic image annotation and retrieval (using one word queries) from videos using a multiple Bernoulli relevance model. The model assumes that training set or along with keyword annotations is provided. Multiple keywords are provided for an specific correspondence between not Each partitioned into rectangular regions real-valued feature vector computed over these...
Libraries and other institutions are interested in providing access to scanned versions of their large collections handwritten historical manuscripts on electronic media. Convenient a collection requires an index, which is manually created at great labor expense. Since current handwriting recognizers do not perform well documents, technique called word spotting has been developed: clusters with occurrences the same established using image matching. By annotating "interesting" clusters, index...
Libraries have traditionally used manual image annotation for indexing and then later retrieving their collections. However, is an expensive labor intensive procedure hence there has been great interest in coming up with automatic ways to retrieve images based on content. Here, we propose approach annotating a training set of images. We assume that regions can be described using small vocabulary blobs. Blobs are generated from features clustering. Given annotations, show probabilistic models...
Training robust deep video representations has proven to be much more challenging than learning image representations. This is in part due the enormous size of raw streams and high temporal redundancy; true interesting signal often drowned too irrelevant data. Motivated by that superfluous information can reduced up two orders magnitude compression (using H.264, HEVC, etc.), we propose train a network directly on compressed video. representation higher density, found training easier. In...
Keyword spotting refers to the process of retrieving all instances a given keyword from document. In present paper, novel method for handwritten documents is described. It derived neural network-based system unconstrained handwriting recognition. As such it performs template-free spotting, i.e., not necessary appear in training set. The done using modification CTC Token Passing algorithm conjunction with recurrent network. We demonstrate that proposed systems outperform only classical...
We present DocFormer - a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, pre-trained an unsupervised fashion using carefully designed tasks encourage interaction. uses text, vision spatial features combines them novel self-attention layer. also shares learned embeddings across modalities makes it easy model...
Scene Text Recognition (STR), the task of recognizing text against complex image backgrounds, is an active area research. Current state-of-the-art (SOTA) methods still struggle to recognize written in arbitrary shapes. In this paper, we introduce a novel architecture for STR, named Selective Context ATtentional Recognizer (SCATTER). SCATTER utilizes stacked block with intermediate supervision during training, that paves way successfully train deep BiLSTM encoder, thus improving encoding...
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from form, VQA and other tasks. is challenging as it needs model to make sense of multiple modalities (visual, language spatial) prediction. Our approach, termed DocFormerv2 an encoder-decoder which takes input - vision, spatial features. pre-trained with unsupervised tasks employed asymmetrically...
A robust system is proposed to automatically detect and extract text in images from different sources, including video, newspapers, advertisements, stock certificates, photographs, checks. Text first detected using multiscale texture segmentation spatial cohesion constraints, then cleaned up extracted a histogram-based binarization algorithm. An automatic performance evaluation scheme also proposed.
There are many historical manuscripts written in a single hand which it would be useful to index. Examples include the W.B. DuBois collection at University of Massachusetts and early Presidential libraries Library Congress. Since Optical Character Recognition (OCR) does not work well on handwriting, an alternative scheme based matching images words is proposed for indexing such texts. The current paper deals with aspects this process. Two different techniques discussed. first method matches...
Most offline handwriting recognition approaches proceed by segmenting words into smaller pieces (usually characters) which are recognized separately. The result of a word is then the composition individually parts. Inspired results in cognitive psychology, researchers have begun to focus on holistic approaches. Here we present approach for single-author historical documents, motivated fact that severely degraded documents segmentation characters will produce very poor results. quality...
article Share on Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent retrieval, University Massachusetts Amherst, September 2002 Authors: James Allan View Profile , Jay Aslam Nicholas Belkin Chris Buckley Jamie Callan Bruce Croft Sue Dumais Norbert Fuhr Donna Harman David J. Harper Djoerd Hiemstra Thomas Hofmann Eduard Hovy Wessel Kraaij John Lafferty Victor Lavrenko Lewis Liz Liddy R. Manmatha Andrew McCallum Ponte Prager...
In this paper the score distributions of a number text search engines are modeled. It is shown empirically that on per query basis may be fitted using an exponential distribution for set non-relevant documents and normal relevant documents. Experiments show model fits TREC-3 TREC-4 data not only probabilistic like INQUERY but also vector space SMART English. We have used to fit output other LSI indexing languages Chinese.
Article Free Access Share on Finding text in images Authors: Victor Wu Computer Science Department, University of Massachusetts, Amherst, MA MAView Profile , R. Manmatha Edward M. Riseman Authors Info & Claims DL '97: Proceedings the second ACM international conference Digital librariesJuly 1997Pages 3–12https://doi.org/10.1145/263690.263766Published:01 July 1997Publication History 141citation3,052DownloadsMetricsTotal Citations141Total Downloads3,052Last 12 Months584Last 6 weeks239 Get...
For the transition from traditional to digital libraries, large number of handwritten manuscripts that exist pose a great challenge. Easy access such collections requires an index, which is currently created manually at cost. Because automatic handwriting recognizers fail on historical manuscripts, word spotting technique has been developed: words in collection are matched as images and grouped into clusters contain all instances same word. By annotating "interesting" clusters, index links...
Many libraries, museums, and other organizations contain large collections of handwritten historical documents, for example, the papers early presidents like George Washington at Library Congress. The first step in providing recognition/retrieval tools is to automatically segment pages into words. State art segmentation techniques gap metrics algorithm have been mostly developed tested on highly constrained documents bank checks postal addresses. There has little work full this usually...
We propose simple and effective models for the image annotation that make use of Convolutional Neural Network (CNN) features extracted from an word embedding vectors to represent their associated tags. Our first set is based on Canonical Correlation Analysis (CCA) framework helps in modeling both views - visual (CNN feature) textual (word vectors) data. Results all three variants CCA models, namely linear CCA, kernel with k-nearest neighbor (CCA-KNN) clustering, are reported. The best...
Video action recognition is one of the representative tasks for video understanding. Over last decade, we have witnessed great advancements in thanks to emergence deep learning. But also encountered new challenges, including modeling long-range temporal information videos, high computation costs, and incomparable results due datasets evaluation protocol variances. In this paper, provide a comprehensive survey over 200 existing papers on learning recognition. We first introduce 17 that...
We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition. To account the structure, each feature map is divided into different instances over loss computed. This operation enables us contrast in sub-word level, where from image extract several positive pairs and multiple negative examples. yield effective representations recognition, further suggest novel augmentation heuristics, encoder architectures custom...