- Topic Modeling
- Natural Language Processing Techniques
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Machine Learning and Data Classification
- Advanced Neural Network Applications
- Anomaly Detection Techniques and Applications
- Speech Recognition and Synthesis
- Advanced Image and Video Retrieval Techniques
- Neural Networks and Applications
- Information Retrieval and Search Behavior
- Computational and Text Analysis Methods
- Image Retrieval and Classification Techniques
- Web Data Mining and Analysis
- Multidisciplinary Science and Engineering Research
- Software System Performance and Reliability
- Mechanical and Optical Resonators
- Force Microscopy Techniques and Applications
- Machine Learning and Algorithms
- Infrastructure Maintenance and Monitoring
- Artificial Intelligence in Games
- Cloud Computing and Resource Management
- Cancer-related molecular mechanisms research
- Remote-Sensing Image Classification
- Spam and Phishing Detection
Google (United States)
2018-2023
Université de Montréal
2021
Centre Universitaire de Mila
2021
Lawrence Berkeley National Laboratory
2014
Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field natural language processing for example, Transformers become an indispensable staple in modern deep learning stack. Recently, dizzying number "X-former" models been proposed - Reformer, Linformer, Performer, Longformer, name few which improve upon original architecture, many make improvements around...
The dot product self-attention is known to be central and indispensable state-of-the-art Transformer models. But it really required? This paper investigates the true importance contribution of product-based mechanism on performance Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively (2) learning attention weights from token-token (query-key) interactions useful but not important after all. To this end, propose \textsc{Synthesizer}, a...
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum efficient, fast have been proposed tackle this problem, more often than claiming superior or comparable model quality vanilla Transformer models. To date, there is no well-established consensus on how evaluate class Moreover, inconsistent benchmarking tasks and datasets makes it difficult assess relative amongst many This paper proposes...
Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents unified framework for that universally effective across datasets setups. We begin by disentangling architectural archetypes with objectives -- two concepts commonly conflated. Next, we present generalized & perspective self-supervision in NLP show how different can cast as...
We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our is based on differentiable sorting of internal representations. Concretely, we introduce meta network that learns generate latent permutations over sequences. Given sorted sequences, are then able compute quasi-global attention with only local windows, improving the memory efficiency module. To this end, algorithmic innovations such as Causal Balancing SortCut, dynamic sequence truncation...
Despite the recent success of multi-task learning and transfer for natural language processing (NLP), few works have systematically studied effect scaling up number tasks during pre-training. Towards this goal, paper introduces ExMix (Extreme Mixture): a massive collection 107 supervised NLP across diverse domains task-families. Using ExMix, we study pre-training at largest scale to date, analyze co-training amongst common families tasks. Through analysis, show that manually curating an...
In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all about the corpus is encoded parameters of model. To end, introduce Differentiable Search Index (DSI), new paradigm learns text-to-text model maps string queries directly to relevant docids; other words, DSI answers using only its parameters, dramatically simplifying whole process. We study variations how documents and their identifiers are represented, training procedures,...
State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a model inductive bias that learns end-to-end as part of the model. To end, introduce soft gradient-based module (GBST) automatically latent representations from characters data-driven fashion. Concretely, GBST enumerates candidate blocks score them position-wise fashion using block...
When experiencing an information need, users want to engage with a domain expert, but often turn retrieval system, such as search engine, instead. Classical systems do not answer needs directly, instead provide references (hopefully authoritative) answers. Successful question answering offer limited corpus created on-demand by human experts, which is neither timely nor scalable. Pre-trained language models, contrast, are capable of directly generating prose that may be responsive at present...
Yi Tay, Mostafa Dehghani, Jai Prakash Gupta, Vamsi Aribandi, Dara Bahri, Zhen Qin, Donald Metzler. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
Recent advances in Transformer-based large language models (LLMs) have led to significant performance improvements across many tasks. These gains come with a drastic increase the models' size, potentially leading slow and costly use at inference time. In practice, however, series of generations made by LLMs is composed varying levels difficulty. While certain predictions truly benefit from full capacity, other continuations are more trivial can be solved reduced compute. this work, we...
The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by large, revolved around new model architectures, training tasks, loss objectives, along with substantial engineering efforts scale up capacity dataset size. Comparatively little work been done improve generalization these through better optimization. In this work, we show that Sharpness-Aware Minimization (SAM), a recently proposed optimization...
Self-supervised contrastive representation learning has proved incredibly successful in the vision and natural language domains, enabling state-of-the-art performance with orders of magnitude less labeled data. However, such methods are domain-specific little been done to leverage this technique on real-world tabular datasets. We propose SCARF, a simple, widely-applicable for learning, where views formed by corrupting random subset features. When applied pre-train deep neural networks 69...
Language models have recently been shown capable of performing regression tasks wherein numeric predictions are represented as decoded strings. In this work, we provide theoretical grounds for capability and furthermore investigate the utility causal auto-regressive sequence when they applied to any feature representation. We find that, despite being trained in usual way - next-token prediction via cross-entropy loss decoding-based is performant traditional approaches tabular tasks, while...
Yikang Shen, Yi Tay, Che Zheng, Dara Bahri, Donald Metzler, Aaron Courville. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
In the era of pre-trained language models, Transformers are de facto choice model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using pre-train-fine-tune paradigm. context convolutional models competitive to when pre-trained? This paper investigates this question and presents several interesting findings. Across an extensive set experiments on 8 datasets/tasks, we find that CNN-based outperform their...
Recent advances in neural text generation modeling have resulted a number of societal concerns related to how such approaches might be used malicious ways. It is therefore desirable develop deeper understanding the fundamental properties models. The study artifacts that emerge machine generated as result choices nascent research area. To this end, extent and degree which these surface still unclear. In spirit better generative models their artifacts, we propose new task distinguishing...
Work in information retrieval has traditionally focused on ranking and relevance: given a query, return some number of results ordered by relevance to the user. However, problem determining how many return, i.e. optimally truncate ranked result list, received less attention despite being critical importance range applications. Such truncation is balancing act between overall relevance, or usefulness results, with user cost processing more results. In this work, we propose Choppy, an...