Shuming Ma

ORCID: 0000-0003-1091-1206
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Multimodal Machine Learning Applications
  • Advanced Text Analysis Techniques
  • Text Readability and Simplification
  • Speech Recognition and Synthesis
  • Text and Document Classification Technologies
  • Domain Adaptation and Few-Shot Learning
  • Neural Networks and Applications
  • Advanced Graph Neural Networks
  • Video Analysis and Summarization
  • Advanced Neural Network Applications
  • Machine Learning in Bioinformatics
  • Web Data Mining and Analysis
  • Face and Expression Recognition
  • Sentiment Analysis and Opinion Mining
  • Machine Learning and Data Classification
  • Expert finding and Q&A systems
  • Land Use and Ecosystem Services
  • Oil Spill Detection and Mitigation
  • Data Quality and Management
  • Ferroelectric and Negative Capacitance Devices
  • Environmental Impact and Sustainability
  • Machine Learning and Algorithms
  • Complex Network Analysis Techniques

Microsoft Research Asia (China)
2020-2024

Dalian University of Technology
2013-2024

Peking University
2015-2023

ETH Zurich
2023

Microsoft Research (India)
2021-2023

Tsinghua University
2023

Chinese University of Hong Kong
2023

Microsoft (Finland)
2022

Beijing Institute of Technology
2021-2022

Microsoft Research (United Kingdom)
2020-2022

Multi-label classification is an important yet challenging task in natural language processing. It more complex than single-label that the labels tend to be correlated. Existing methods ignore correlations between labels. Besides, different parts of text can contribute differently for predicting labels, which not considered by existing models. In this paper, we propose view multi-label as a sequence generation problem, and apply model with novel decoder structure solve it. Extensive...

10.48550/arxiv.1806.04822 preprint EN other-oa arXiv (Cornell University) 2018-01-01

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, Multimodal Large Language Model (MLLM) that can perceive modalities, learn in context (i.e., few-shot), follow instructions zero-shot). Specifically, train Kosmos-1 from scratch on web-scale corpora, including arbitrarily interleaved text images, image-caption pairs, data. We evaluate various settings, zero-shot, few-shot,...

10.48550/arxiv.2302.14045 preprint EN other-oa arXiv (Cornell University) 2023-01-01

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where are sequences location tokens. Together with multimodal corpora, construct large-scale data grounded image-text pairs (called GrIT) train model. In addition existing MLLMs general modalities,...

10.48550/arxiv.2306.14824 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite great success in performance, its working mechanism still remains open question. In this paper, we explain as meta-optimizers and understand implicit finetuning. Theoretically, figure out that Transformer attention has dual form of gradient descent. On top it, ICL follows: GPT...

10.18653/v1/2023.findings-acl.247 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2023-01-01

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence attention. Then retention mechanism sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, chunkwise recurrent. Specifically, parallel representation allows parallelism. The recurrent enables $O(1)$...

10.48550/arxiv.2307.08621 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce LLM variant, namely BitNet b1.58, in which every single parameter (or weight) ternary {-1, 0, 1}. It matches full-precision (i.e., FP16 or BF16) Transformer with same model size and training tokens terms both perplexity end-task performance, while being significantly more cost-effective latency, memory, throughput, energy consumption. More profoundly, 1.58-bit...

10.48550/arxiv.2402.17764 preprint EN arXiv (Cornell University) 2024-02-27

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, introduce new normalization function ( <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DeepNorm</small> ) modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded stable way. The proposed combines best of two worlds, i.e.,...

10.1109/tpami.2024.3386927 article EN cc-by IEEE Transactions on Pattern Analysis and Machine Intelligence 2024-04-10

In neural abstractive summarization, the conventional sequence-to-sequence (seq2seq) model often suffers from repetition and semantic irrelevance. To tackle problem, we propose a global encoding framework, which controls information flow encoder to decoder based on of source context. It consists convolutional gated unit perform improve representations source-side information. Evaluations LCSTS English Gigaword both demonstrate that our outperforms baseline models, analysis shows is capable...

10.18653/v1/p18-2027 article EN cc-by 2018-01-01

Most of the existing models for document-level machine translation adopt dual-encoder structures. The representation source sentences and contexts are modeled with two separate encoders. Although these can make use contexts, they do not fully model interaction between sentences, directly adapt to recent pre-training (e.g., BERT) which encodes multiple a single encoder. In this work, we propose simple effective unified encoder that outperform baseline in terms BLEU METEOR scores. Moreover,...

10.18653/v1/2020.acl-main.321 article EN cc-by 2020-01-01

Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Bo Zheng, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Furu Wei. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.

10.18653/v1/2022.acl-long.427 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, introduce new normalization function (DeepNorm) modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded stable way. The proposed combines best of two worlds, i.e., good performance Post-LN and training Pre-LN, making DeepNorm preferred alternative. We successfully...

10.48550/arxiv.2203.00555 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering maximum restricted. To address this issue, we introduce LongNet, Transformer variant that can scale to more than 1 billion tokens, without sacrificing performance on shorter sequences. Specifically, propose dilated attention, which expands attentive field exponentially as distance grows. LongNet...

10.48550/arxiv.2307.02486 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, Furu Wei. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.816 article EN cc-by 2023-01-01

A sentence can be translated into more than one correct sentences. However, most of the existing neural machine translation models only use translations as targets, and other sentences are punished incorrect in training stage. Since for share similar bag-of-words, it is possible to distinguish from ones by bag-of-words. In this paper, we propose an approach that uses both bag-of-words targets stage, order encourage model generate potentially not appeared set. We evaluate our on a...

10.18653/v1/p18-2053 article EN cc-by 2018-01-01

We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. In back propagation, only small subset of the full gradient to update model parameters. vectors are sparsified in such way that top-$k$ elements (in terms magnitude) kept. As result, $k$ rows or columns (depending on layout) weight matrix modified, leading linear reduction ($k$ divided by vector dimension) computational cost. Surprisingly, experimental results demonstrate we...

10.48550/arxiv.1706.06197 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Language model pre-training has achieved success in many natural language processing tasks. Existing methods for cross-lingual adopt Translation Model to predict masked words with the concatenation of source sentence and its target equivalent. In this work, we introduce a novel method, called Alternating Modeling (ALM). It code-switches sentences different languages rather than simple concatenation, hoping capture rich context phrases. More specifically, randomly substitute phrases...

10.1609/aaai.v34i05.6480 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2020-04-03

Shuming Ma, Xu Sun, Wei Li, Sujian Wenjie Xuancheng Ren. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.

10.18653/v1/n18-1018 preprint EN cc-by 2018-01-01

We propose a novel model for multi-label text classification, which is based on sequence-to-sequence learning. The generates higher-level semantic unit representations with multi-level dilated convolution as well corresponding hybrid attention mechanism that extracts both the information at word-level and level of unit. Our designed effectively reduces dimension supports an exponential expansion receptive fields without loss local information, attention-over-attention able to capture more...

10.18653/v1/d18-1485 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2018-01-01

Text summarization and sentiment classification both aim to capture the main ideas of text but at different levels. is describe within a few sentences, while can be regarded as special type which ``summarizes'' into even more abstract fashion, i.e., class. Based on this idea, we propose hierarchical end-to-end model for joint learning classification, where label treated further ``summarization'' output. Hence, layer put upon layer, structure derived. Experimental results Amazon online...

10.24963/ijcai.2018/591 article EN 2018-07-01

Multi-label classification (MLC) aims to predict a set of labels for given instance. Based on pre-defined label order, the sequence-to-sequence (Seq2Seq) model trained via maximum likelihood estimation method has been successfully applied MLC task and shows powerful ability capture high-order correlations between labels. However, output are essentially an unordered rather than ordered sequence. This inconsistency tends result in some intractable problems, e.g., sensitivity order. To remedy...

10.18653/v1/p19-1518 article EN cc-by 2019-01-01

Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang, Saksham Singhal, Xian-Ling Mao, Heyan Xia Song, Furu Wei. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

10.18653/v1/2021.emnlp-main.125 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

While pretrained encoders have achieved success in various natural language understanding (NLU) tasks, there is a gap between these and generation (NLG). NLG tasks are often based on the encoder-decoder framework, where can only benefit part of it. To reduce this gap, we introduce DeltaLM, multilingual model that regards decoder as task layer off-the-shelf encoders. Specifically, augment encoder with pre-train it self-supervised way. take advantage both large-scale monolingual data bilingual...

10.48550/arxiv.2106.13736 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Current Chinese social media text summarization models are based on an encoder-decoder framework. Although its generated summaries similar to source texts literally, they have low semantic relevance. In this work, our goal is improve relevance between and for summarization. We introduce a Semantic Relevance Based neural model encourage high similarity summaries. model, the represented by gated attention encoder, while summary representation produced decoder. Besides, score representations...

10.18653/v1/p17-2100 article EN cc-by 2017-01-01
Coming Soon ...