- Topic Modeling
- Natural Language Processing Techniques
- Multimodal Machine Learning Applications
- Advanced Text Analysis Techniques
- Text Readability and Simplification
- Speech Recognition and Synthesis
- Text and Document Classification Technologies
- Domain Adaptation and Few-Shot Learning
- Neural Networks and Applications
- Advanced Graph Neural Networks
- Video Analysis and Summarization
- Advanced Neural Network Applications
- Machine Learning in Bioinformatics
- Web Data Mining and Analysis
- Face and Expression Recognition
- Sentiment Analysis and Opinion Mining
- Machine Learning and Data Classification
- Expert finding and Q&A systems
- Land Use and Ecosystem Services
- Oil Spill Detection and Mitigation
- Data Quality and Management
- Ferroelectric and Negative Capacitance Devices
- Environmental Impact and Sustainability
- Machine Learning and Algorithms
- Complex Network Analysis Techniques
Microsoft Research Asia (China)
2020-2024
Dalian University of Technology
2013-2024
Peking University
2015-2023
ETH Zurich
2023
Microsoft Research (India)
2021-2023
Tsinghua University
2023
Chinese University of Hong Kong
2023
Microsoft (Finland)
2022
Beijing Institute of Technology
2021-2022
Microsoft Research (United Kingdom)
2020-2022
Multi-label classification is an important yet challenging task in natural language processing. It more complex than single-label that the labels tend to be correlated. Existing methods ignore correlations between labels. Besides, different parts of text can contribute differently for predicting labels, which not considered by existing models. In this paper, we propose view multi-label as a sequence generation problem, and apply model with novel decoder structure solve it. Extensive...
A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, Multimodal Large Language Model (MLLM) that can perceive modalities, learn in context (i.e., few-shot), follow instructions zero-shot). Specifically, train Kosmos-1 from scratch on web-scale corpora, including arbitrarily interleaved text images, image-caption pairs, data. We evaluate various settings, zero-shot, few-shot,...
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where are sequences location tokens. Together with multimodal corpora, construct large-scale data grounded image-text pairs (called GrIT) train model. In addition existing MLLMs general modalities,...
Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite great success in performance, its working mechanism still remains open question. In this paper, we explain as meta-optimizers and understand implicit finetuning. Theoretically, figure out that Transformer attention has dual form of gradient descent. On top it, ICL follows: GPT...
In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence attention. Then retention mechanism sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, chunkwise recurrent. Specifically, parallel representation allows parallelism. The recurrent enables $O(1)$...
Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce LLM variant, namely BitNet b1.58, in which every single parameter (or weight) ternary {-1, 0, 1}. It matches full-precision (i.e., FP16 or BF16) Transformer with same model size and training tokens terms both perplexity end-task performance, while being significantly more cost-effective latency, memory, throughput, energy consumption. More profoundly, 1.58-bit...
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, introduce new normalization function ( <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DeepNorm</small> ) modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded stable way. The proposed combines best of two worlds, i.e.,...
In neural abstractive summarization, the conventional sequence-to-sequence (seq2seq) model often suffers from repetition and semantic irrelevance. To tackle problem, we propose a global encoding framework, which controls information flow encoder to decoder based on of source context. It consists convolutional gated unit perform improve representations source-side information. Evaluations LCSTS English Gigaword both demonstrate that our outperforms baseline models, analysis shows is capable...
Most of the existing models for document-level machine translation adopt dual-encoder structures. The representation source sentences and contexts are modeled with two separate encoders. Although these can make use contexts, they do not fully model interaction between sentences, directly adapt to recent pre-training (e.g., BERT) which encodes multiple a single encoder. In this work, we propose simple effective unified encoder that outperform baseline in terms BLEU METEOR scores. Moreover,...
Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Bo Zheng, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Furu Wei. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, introduce new normalization function (DeepNorm) modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded stable way. The proposed combines best of two worlds, i.e., good performance Post-LN and training Pre-LN, making DeepNorm preferred alternative. We successfully...
Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering maximum restricted. To address this issue, we introduce LongNet, Transformer variant that can scale to more than 1 billion tokens, without sacrificing performance on shorter sequences. Specifically, propose dilated attention, which expands attentive field exponentially as distance grows. LongNet...
Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, Furu Wei. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.
A sentence can be translated into more than one correct sentences. However, most of the existing neural machine translation models only use translations as targets, and other sentences are punished incorrect in training stage. Since for share similar bag-of-words, it is possible to distinguish from ones by bag-of-words. In this paper, we propose an approach that uses both bag-of-words targets stage, order encourage model generate potentially not appeared set. We evaluate our on a...
We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. In back propagation, only small subset of the full gradient to update model parameters. vectors are sparsified in such way that top-$k$ elements (in terms magnitude) kept. As result, $k$ rows or columns (depending on layout) weight matrix modified, leading linear reduction ($k$ divided by vector dimension) computational cost. Surprisingly, experimental results demonstrate we...
Language model pre-training has achieved success in many natural language processing tasks. Existing methods for cross-lingual adopt Translation Model to predict masked words with the concatenation of source sentence and its target equivalent. In this work, we introduce a novel method, called Alternating Modeling (ALM). It code-switches sentences different languages rather than simple concatenation, hoping capture rich context phrases. More specifically, randomly substitute phrases...
Shuming Ma, Xu Sun, Wei Li, Sujian Wenjie Xuancheng Ren. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.
We propose a novel model for multi-label text classification, which is based on sequence-to-sequence learning. The generates higher-level semantic unit representations with multi-level dilated convolution as well corresponding hybrid attention mechanism that extracts both the information at word-level and level of unit. Our designed effectively reduces dimension supports an exponential expansion receptive fields without loss local information, attention-over-attention able to capture more...
Text summarization and sentiment classification both aim to capture the main ideas of text but at different levels. is describe within a few sentences, while can be regarded as special type which ``summarizes'' into even more abstract fashion, i.e., class. Based on this idea, we propose hierarchical end-to-end model for joint learning classification, where label treated further ``summarization'' output. Hence, layer put upon layer, structure derived. Experimental results Amazon online...
Multi-label classification (MLC) aims to predict a set of labels for given instance. Based on pre-defined label order, the sequence-to-sequence (Seq2Seq) model trained via maximum likelihood estimation method has been successfully applied MLC task and shows powerful ability capture high-order correlations between labels. However, output are essentially an unordered rather than ordered sequence. This inconsistency tends result in some intractable problems, e.g., sensitivity order. To remedy...
Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang, Saksham Singhal, Xian-Ling Mao, Heyan Xia Song, Furu Wei. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.
While pretrained encoders have achieved success in various natural language understanding (NLU) tasks, there is a gap between these and generation (NLG). NLG tasks are often based on the encoder-decoder framework, where can only benefit part of it. To reduce this gap, we introduce DeltaLM, multilingual model that regards decoder as task layer off-the-shelf encoders. Specifically, augment encoder with pre-train it self-supervised way. take advantage both large-scale monolingual data bilingual...
Current Chinese social media text summarization models are based on an encoder-decoder framework. Although its generated summaries similar to source texts literally, they have low semantic relevance. In this work, our goal is improve relevance between and for summarization. We introduce a Semantic Relevance Based neural model encourage high similarity summaries. model, the represented by gated attention encoder, while summary representation produced decoder. Besides, score representations...