Ben Athiwaratkun

ORCID: 0000-0002-2009-496X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Machine Learning and Data Classification
  • Advanced Neural Network Applications
  • Software Engineering Research
  • Advanced Text Analysis Techniques
  • Auction Theory and Applications
  • Multi-Agent Systems and Negotiation
  • Mathematical Dynamics and Fractals
  • Parallel Computing and Optimization Techniques
  • Text Readability and Simplification
  • Sentiment Analysis and Opinion Mining
  • Organizational Management and Leadership
  • Handwritten Text Recognition Techniques
  • Scientific Computing and Data Management
  • Text and Document Classification Technologies
  • Advanced Topology and Set Theory
  • Gaussian Processes and Bayesian Inference
  • Speech and dialogue systems
  • Limits and Structures in Graph Theory
  • Advanced Bandit Algorithms Research
  • Video Analysis and Summarization
  • Functional Equations Stability Results

Amazon (United States)
2021

Amazon (Germany)
2020-2021

Allen Institute for Artificial Intelligence
2021

Cornell University
2017-2019

California Institute of Technology
2018

In recent years great success has been achieved in sentiment classification for English, thanks part to the availability of copious annotated resources. Unfortunately, most languages do not enjoy such an abundance labeled data. To tackle problem low-resource without adequate data, we propose Adversarial Deep Averaging Network (ADAN 1 ) transfer knowledge learned from data on a resource-rich source language where only unlabeled exist. ADAN two discriminative branches: classifier and...

10.1162/tacl_a_00039 article EN cc-by Transactions of the Association for Computational Linguistics 2018-12-01

Malicious software, or malware, continues to be a problem for computer users, corporations, and governments. Previous research [1] has explored training file-based, malware classifiers using two-stage approach. In the first stage, language model is used learn feature representation which then input second stage classifier. Pascanu et al. [1], either standard recurrent neural network (RNN) an echo state (ESN). this work, we propose several new classification architectures include long...

10.1109/icassp.2017.7952603 article EN 2017-03-01

We introduce Probabilistic FastText, a new model for word embeddings that can capture multiple senses, sub-word structure, and uncertainty information. In particular, we represent each with Gaussian mixture density, where the mean of component is given by sum n-grams. This representation allows to share "strength" across structures (e.g. Latin roots), producing accurate representations rare, misspelt, or even unseen words. Moreover, different sense. FastText outperforms both which has no...

10.18653/v1/p18-1001 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018-01-01

Presently the most successful approaches to semi-supervised learning are based on consistency regularization, whereby a model is trained be robust small perturbations of its inputs and parameters. To understand we conceptually explore how loss geometry interacts with training procedures. The dramatically improves generalization performance over supervised-only training; however, show that SGD struggles converge continues make large steps lead changes in predictions test data. Motivated by...

10.48550/arxiv.1806.05594 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Word embeddings provide point representations of words containing useful semantic information. We introduce multimodal word distributions formed from Gaussian mixtures, for multiple meanings, entailment, and rich uncertainty To learn these distributions, we propose an energy-based max-margin objective. show that the resulting approach captures uniquely expressive information, outperforms alternatives, such as word2vec skip-grams, embeddings, on benchmark datasets similarity entailment.

10.18653/v1/p17-1151 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2017-01-01

Convolutional Neural Networks (CNNs) are powerful models that achieve impressive results for image classification. In addition, pre-trained CNNs also useful other computer vision tasks as generic feature extractors. This paper aims to gain insight into the aspect of CNN and demonstrate uses features. Our show maps can be used with Random Forests SVM yield classification outperforms original CNN. A is less than optimal (e.g. not fully trained or overfitting) extract features Forest/SVM...

10.48550/arxiv.1507.02313 preprint EN other-oa arXiv (Cornell University) 2015-01-01

In recent years great success has been achieved in sentiment classification for English, thanks part to the availability of copious annotated resources. Unfortunately, most languages do not enjoy such an abundance labeled data. To tackle problem low-resource without adequate data, we propose Adversarial Deep Averaging Network (ADAN) transfer knowledge learned from data on a resource-rich source language where only unlabeled exists. ADAN two discriminative branches: classifier and adversarial...

10.48550/arxiv.1606.01614 preprint EN other-oa arXiv (Cornell University) 2016-01-01

We propose a generative framework for joint sequence labeling and sentence-level classification. Our model performs multiple tasks at once using single, shared natural language output space. Unlike prior discriminative methods, our naturally incorporates label semantics shares knowledge across tasks. general purpose, performing well on few-shot learning, low resource, high resource demonstrate these advantages popular named entity recognition, slot labeling, intent classification benchmarks....

10.18653/v1/2020.emnlp-main.27 article EN cc-by 2020-01-01

Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as whole, yet optimal strategies for dataset composition filtering remain largely elusive. Many of top-performing lack transparency their curation model development processes, posing an obstacle to fully open models. In this paper, we identify three core data-related challenges that must be addressed advance open-source These include (1) development, including data...

10.48550/arxiv.2411.12372 preprint EN arXiv (Cornell University) 2024-11-19

Recent advances in large language models (LLMs) demonstrate substantial capabilities natural understanding and generation tasks. With the growing number of LLMs, how to harness collective expertise multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages strengths through Mixture-of-Agents (MoA) methodology. In our approach, construct layered MoA architecture wherein each layer comprises LLM agents. Each agent takes all outputs from agents...

10.48550/arxiv.2406.04692 preprint EN arXiv (Cornell University) 2024-06-07

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, MathQA-X. These datasets cover over 10 programming languages are generated using a scalable conversion framework that transpiles prompts test cases from the original Python into corresponding data in target language. Using these benchmarks, we able to assess performance of models multi-lingual fashion, discovered generalization ability language out-of-domain languages, advantages mono-lingual,...

10.48550/arxiv.2210.14868 preprint EN cc-by arXiv (Cornell University) 2022-01-01

By representing words with probability densities rather than point vectors, probabilistic word embeddings can capture rich and interpretable semantic information uncertainty. The uncertainty be particularly meaningful in capturing entailment relationships -- whereby general such as "entity" correspond to broad distributions that encompass more specific "animal" or "instrument". We introduce density order embeddings, which learn hierarchical representations through encapsulation of densities....

10.48550/arxiv.1804.09843 preprint EN other-oa arXiv (Cornell University) 2018-01-01

ML-powered code generation aims to assist developers write in a more productive manner by intelligently generating blocks based on natural language prompts. Recently, large pretrained deep learning models have pushed the boundary of and achieved impressive performance. However, huge number model parameters poses significant challenge their adoption typical software development environment, where developer might use standard laptop or mid-size server develop code. Such cost resources terms...

10.1145/3611643.3616302 article EN 2023-11-30

We introduce Probabilistic FastText, a new model for word embeddings that can capture multiple senses, sub-word structure, and uncertainty information. In particular, we represent each with Gaussian mixture density, where the mean of component is given by sum n-grams. This representation allows to share statistical strength across structures (e.g. Latin roots), producing accurate representations rare, misspelt, or even unseen words. Moreover, different sense. FastText outperforms both which...

10.48550/arxiv.1806.02901 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Word embeddings provide point representations of words containing useful semantic information. We introduce multimodal word distributions formed from Gaussian mixtures, for multiple meanings, entailment, and rich uncertainty To learn these distributions, we propose an energy-based max-margin objective. show that the resulting approach captures uniquely expressive information, outperforms alternatives, such as word2vec skip-grams, embeddings, on benchmark datasets similarity entailment.

10.48550/arxiv.1704.08424 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions tokens. This paper describes TEAL, a simple training-free method applies...

10.48550/arxiv.2408.14690 preprint EN arXiv (Cornell University) 2024-08-26

We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks including joint entity and relation extraction, nested named recognition, classification, semantic role labeling, event coreference resolution, dialogue state tracking. Instead of tackling the problem by training task-specific discriminative classifiers, we frame it as translation task augmented natural languages, from which task-relevant information can be...

10.48550/arxiv.2101.05779 preprint EN other-oa arXiv (Cornell University) 2021-01-01

For infinite-measure-preserving rank-one transformations, we give a condition guaranteeing that all finite Cartesian products of the transformation with its inverse are ergodic. We show infinite Chacón satisfies this condition.

10.4064/sm170330-9-9 article EN Studia Mathematica 2018-01-01

Generative models, widely utilized in various applications, can often struggle with prompts corresponding to partial tokens. This stems from tokenization, where tokens fall out of distribution during inference, leading incorrect or nonsensical outputs. paper examines a technique alleviate the tokenization artifact on text completion generative maintaining performance even regular non-subword cases. The method, termed token alignment, involves backtracking last complete and ensuring model's...

10.48550/arxiv.2403.08688 preprint EN arXiv (Cornell University) 2024-03-13

In our study, we present bifurcated attention, a method developed for language model inference in single-context batch sampling contexts. This approach aims to reduce redundant memory IO costs, significant factor latency high sizes and long context lengths. Bifurcated attention achieves this by dividing the mechanism during incremental decoding into two distinct GEMM operations, focusing on KV cache from prefill process. ensures precise computation maintains usual computational load (FLOPs)...

10.48550/arxiv.2403.08845 preprint EN arXiv (Cornell University) 2024-03-13
Coming Soon ...