- Topic Modeling
- Natural Language Processing Techniques
- Privacy, Security, and Data Protection
- Text Readability and Simplification
- Privacy-Preserving Technologies in Data
- Language and cultural evolution
- Artificial Intelligence in Healthcare and Education
- Machine Learning and Data Classification
- Human Mobility and Location-Based Analysis
- Domain Adaptation and Few-Shot Learning
- Reinforcement Learning in Robotics
- Sentiment Analysis and Opinion Mining
- Urban Transport and Accessibility
- Handwritten Text Recognition Techniques
- Speech Recognition and Synthesis
- Multimodal Machine Learning Applications
- Speech and dialogue systems
- Advanced Text Analysis Techniques
- Transportation Planning and Optimization
Naver (South Korea)
2020-2023
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has hypothesized that this is consequence implicit multitask learning in models' pretraining (Radford 2019). Can instead be directly induced by explicit learning? To test question at scale, we develop system for easily mapping any natural into human-readable prompted form. We convert large supervised datasets, each with multiple prompts wording....
We address the problem of unsupervised abstractive summarization collections user generated reviews through self-supervision and control. propose a self-supervised setup that considers an individual document as target summary for set similar documents. This setting makes training simpler than previous approaches by relying only on standard log-likelihood loss mainstream models. hallucinations use control codes, to steer generation towards more coherent relevant summaries.
Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of and algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds minimizing reverse KL an implicit arising penalty objective. On other hand, Generative Distributional Control (GDC) has explicit minimizes forward it using Policy Gradient (DPG) algorithm. In...
The power of natural language generation models has provoked a flurry interest in automatic methods to detect if piece text is human or machine-authored. problem so far been framed standard supervised way and consists training classifier on annotated data predict the origin one given new document. In this paper, we frame an unsupervised distributional way: assume that have access large collection unannotated documents, big fraction which machine-generated. We propose method those...
Research on Large Language Models (LLMs) has recently witnessed an increasing interest in extending models' context size to better capture dependencies within long documents. While benchmarks have been proposed assess long-range abilities, existing efforts primarily considered generic tasks that are not necessarily aligned with real-world applications. In contrast, our work proposes a new benchmark for long-context LLMs focused practical meeting assistant scenario. this scenario, the...
As large language models (LLMs) are increasingly used across various applications, there is a growing need to control text generation satisfy specific constraints or requirements. This raises crucial question: Is it possible guarantee strict constraint satisfaction in generated outputs while preserving the distribution of original model as much possible? We first define ideal - one closest model, which also always satisfies expressed ultimate goal guaranteed generation. then state...
Pre-trained language models and other generative have revolutionized NLP beyond. However, these tend to reproduce undesirable biases present in their training data. Also, they may overlook patterns that are important but challenging capture. To address limitations, researchers introduced distributional control techniques. These techniques, not limited language, allow controlling the prevalence (i.e. expectations) of any features interest model’s outputs. Despite potential, widespread...
Public transport operations data and in particular fare collection can be used to reconstruct analyse mobility patterns. So far, various methods have been proposed studied some specific contexts. This paper proposes a general framework for looking at all the core elements of possible operational settings public transport. It also describes novel trip alignments, travels' origin destination detection vehicle load estimation. Two use cases illustrate validate efficiency reconstruction methods.
Pre-trained language models and other generative have revolutionized NLP beyond. However, these tend to reproduce undesirable biases present in their training data. Also, they may overlook patterns that are important but challenging capture. To address limitations, researchers introduced distributional control techniques. These techniques, not limited language, allow controlling the prevalence (i.e., expectations) of any features interest model's outputs. Despite potential, widespread...
Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is first transform it into a sequence tokens that scored by model. However, there are exponentially many represent given string. To truly compute one should marginalize over all tokenizations, which typically intractable. Here, we analyze whether ignoring marginalization justified. this end, devise an...
Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is first transform it into a sequence tokens that scored by model. However, there are exponentially many represent given string. To truly compute one should marginalize over all tokenizations, which typically intractable. Here, we analyze whether ignoring marginalization justified. this end, devise an...
As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along susceptibility overfitting preference dataset. We propose Compositional (CPMs), a novel PM framework decomposes one global assessment into several interpretable features, obtains scalar scores these features...