NFDI4DS | UHH-SEMS - Publication Details

Shuming Ma

ORCID: 0000-0003-1091-1206

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5113130010

Research Areas

Topic Modeling
Natural Language Processing Techniques
Multimodal Machine Learning Applications
Advanced Text Analysis Techniques
Text Readability and Simplification
Speech Recognition and Synthesis
Text and Document Classification Technologies
Domain Adaptation and Few-Shot Learning
Neural Networks and Applications
Advanced Graph Neural Networks
Video Analysis and Summarization
Advanced Neural Network Applications
Machine Learning in Bioinformatics
Web Data Mining and Analysis
Face and Expression Recognition
Sentiment Analysis and Opinion Mining
Machine Learning and Data Classification
Expert finding and Q&A systems
Land Use and Ecosystem Services
Oil Spill Detection and Mitigation
Data Quality and Management
Ferroelectric and Negative Capacitance Devices
Environmental Impact and Sustainability
Machine Learning and Algorithms
Complex Network Analysis Techniques

Microsoft Research Asia (China)
2020-2024

Dalian University of Technology
2013-2024

Peking University
2015-2023

ETH Zurich
2023

Microsoft Research (India)
2021-2023

Tsinghua University
2023

Chinese University of Hong Kong
2023

Microsoft (Finland)
2022

Beijing Institute of Technology
2021-2022

Microsoft Research (United Kingdom)
2020-2022

SGM: Sequence Generation Model for Multi-label Classification

OPENALEX - Publications

Pengcheng Yang Xu Sun Wei Li Shuming Ma Wei Wu and 1 more

Multi-label classification is an important yet challenging task in natural language processing. It more complex than single-label that the labels tend to be correlated. Existing methods ignore correlations between labels. Besides, different parts of text can contribute differently for predicting labels, which not considered by existing models. In this paper, we propose view multi-label as a sequence generation problem, and apply model with novel decoder structure solve it. Extensive...

10.48550/arxiv.1806.04822 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Language Is Not All You Need: Aligning Perception with Language Models

OPENALEX - Publications

Shaohan Huang Dong Li Wenhui Wang Yaru Hao Saksham Singhal and 12 more

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, Multimodal Large Language Model (MLLM) that can perceive modalities, learn in context (i.e., few-shot), follow instructions zero-shot). Specifically, train Kosmos-1 from scratch on web-scale corpora, including arbitrarily interleaved text images, image-caption pairs, data. We evaluate various settings, zero-shot, few-shot,...

10.48550/arxiv.2302.14045 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Kosmos-2: Grounding Multimodal Large Language Models to the World

OPENALEX - Publications

Zhiliang Peng Wenhui Wang Li Dong Yaru Hao Shaohan Huang and 2 more

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where are sequences location tokens. Together with multimodal corpora, construct large-scale data grounded image-text pairs (called GrIT) train model. In addition existing MLLMs general modalities,...

10.48550/arxiv.2306.14824 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers

OPENALEX - Publications

Damai Dai Yutao Sun Dong Li Yaru Hao Shuming Ma and 2 more

Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite great success in performance, its working mechanism still remains open question. In this paper, we explain as meta-optimizers and understand implicit finetuning. Theoretically, figure out that Transformer attention has dual form of gradient descent. On top it, ICL follows: GPT...

10.18653/v1/2023.findings-acl.247 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2023-01-01

Retentive Network: A Successor to Transformer for Large Language Models

OPENALEX - Publications

Yutao Sun Li Dong Shaohan Huang Shuming Ma Yuqing Xia and 3 more

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence attention. Then retention mechanism sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, chunkwise recurrent. Specifically, parallel representation allows parallelism. The recurrent enables $O(1)$...

10.48550/arxiv.2307.08621 preprint EN other-oa arXiv (Cornell University) 2023-01-01

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

OPENALEX - Publications

Shuming Ma Hongyu Wang Lingxiao Ma Lei Wang Wenhui Wang and 5 more

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce LLM variant, namely BitNet b1.58, in which every single parameter (or weight) ternary {-1, 0, 1}. It matches full-precision (i.e., FP16 or BF16) Transformer with same model size and training tokens terms both perplexity end-task performance, while being significantly more cost-effective latency, memory, throughput, energy consumption. More profoundly, 1.58-bit...

10.48550/arxiv.2402.17764 preprint EN arXiv (Cornell University) 2024-02-27

DeepNet: Scaling Transformers to 1,000 Layers

OPENALEX - Publications

Hongyu Wang Shuming Ma Li Dong Shaohan Huang Dongdong Zhang and 1 more

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, introduce new normalization function ( <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DeepNorm</small> ) modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded stable way. The proposed combines best of two worlds, i.e.,...

10.1109/tpami.2024.3386927 article EN cc-by IEEE Transactions on Pattern Analysis and Machine Intelligence 2024-04-10

Global Encoding for Abstractive Summarization

OPENALEX - Publications

Junyang Lin Xu Sun Shuming Ma Qi Su

In neural abstractive summarization, the conventional sequence-to-sequence (seq2seq) model often suffers from repetition and semantic irrelevance. To tackle problem, we propose a global encoding framework, which controls information flow encoder to decoder based on of source context. It consists convolutional gated unit perform improve representations source-side information. Evaluations LCSTS English Gigaword both demonstrate that our outperforms baseline models, analysis shows is capable...

10.18653/v1/p18-2027 article EN cc-by 2018-01-01

A Simple and Effective Unified Encoder for Document-Level Machine Translation

OPENALEX - Publications

Shuming Ma Dongdong Zhang Ming Zhou

Most of the existing models for document-level machine translation adopt dual-encoder structures. The representation source sentences and contexts are modeled with two separate encoders. Although these can make use contexts, they do not fully model interaction between sentences, directly adapt to recent pre-training (e.g., BERT) which encodes multiple a single encoder. In this work, we propose simple effective unified encoder that outperform baseline in terms BLEU METEOR scores. Moreover,...

10.18653/v1/2020.acl-main.321 article EN cc-by 2020-01-01

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

OPENALEX - Publications

Zewen Chi Shaohan Huang Li Dong Shuming Ma Bo Zheng and 6 more

Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Bo Zheng, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Furu Wei. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.

10.18653/v1/2022.acl-long.427 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

DeepNet: Scaling Transformers to 1,000 Layers

OPENALEX - Publications

Hongyu Wang Shuming Ma Dong Li Shaohan Huang Dongdong Zhang and 1 more

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, introduce new normalization function (DeepNorm) modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded stable way. The proposed combines best of two worlds, i.e., good performance Post-LN and training Pre-LN, making DeepNorm preferred alternative. We successfully...

10.48550/arxiv.2203.00555 preprint EN other-oa arXiv (Cornell University) 2022-01-01

LongNet: Scaling Transformers to 1,000,000,000 Tokens

OPENALEX - Publications

Jiayu Ding Shuming Ma Li Dong Xingxing Zhang Shaohan Huang and 2 more

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering maximum restricted. To address this issue, we introduce LongNet, Transformer variant that can scale to more than 1 billion tokens, without sacrificing performance on shorter sequences. Specifically, propose dilated attention, which expands attentive field exponentially as distance grows. LongNet...

10.48550/arxiv.2307.02486 preprint EN other-oa arXiv (Cornell University) 2023-01-01

A Length-Extrapolatable Transformer

OPENALEX - Publications

Yutao Sun Li Dong Barun Patra Shuming Ma Shaohan Huang and 4 more

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, Furu Wei. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.816 article EN cc-by 2023-01-01

Bag-of-Words as Target for Neural Machine Translation

OPENALEX - Publications

Shuming Ma Xu Sun Yizhong Wang Junyang Lin

A sentence can be translated into more than one correct sentences. However, most of the existing neural machine translation models only use translations as targets, and other sentences are punished incorrect in training stage. Since for share similar bag-of-words, it is possible to distinguish from ones by bag-of-words. In this paper, we propose an approach that uses both bag-of-words targets stage, order encourage model generate potentially not appeared set. We evaluate our on a...

10.18653/v1/p18-2053 article EN cc-by 2018-01-01

meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting

OPENALEX - Publications

Xu Sun Xuancheng Ren Shuming Ma Houfeng Wang

We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. In back propagation, only small subset of the full gradient to update model parameters. vectors are sparsified in such way that top-$k$ elements (in terms magnitude) kept. As result, $k$ rows or columns (depending on layout) weight matrix modified, leading linear reduction ($k$ divided by vector dimension) computational cost. Surprisingly, experimental results demonstrate we...

10.48550/arxiv.1706.06197 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Alternating Language Modeling for Cross-Lingual Pre-Training

OPENALEX - Publications

Jian Yang Shuming Ma Dongdong Zhang Shuangzhi Wu Zhoujun Li and 1 more

Language model pre-training has achieved success in many natural language processing tasks. Existing methods for cross-lingual adopt Translation Model to predict masked words with the concatenation of source sentence and its target equivalent. In this work, we introduce a novel method, called Alternating Modeling (ALM). It code-switches sentences different languages rather than simple concatenation, hoping capture rich context phrases. More specifically, randomly substitute phrases...

10.1609/aaai.v34i05.6480 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2020-04-03

Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation

OPENALEX - Publications

Shuming Ma Xu Sun Wei Li Sujian Li Wenjie Li and 1 more

Shuming Ma, Xu Sun, Wei Li, Sujian Wenjie Xuancheng Ren. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.

10.18653/v1/n18-1018 preprint EN cc-by 2018-01-01

Semantic-Unit-Based Dilated Convolution for Multi-Label Text Classification

OPENALEX - Publications

Junyang Lin Qi Su Pengcheng Yang Shuming Ma Xu Sun

We propose a novel model for multi-label text classification, which is based on sequence-to-sequence learning. The generates higher-level semantic unit representations with multi-level dilated convolution as well corresponding hybrid attention mechanism that extracts both the information at word-level and level of unit. Our designed effectively reduces dimension supports an exponential expansion receptive fields without loss local information, attention-over-attention able to capture more...

10.18653/v1/d18-1485 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2018-01-01

A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification

OPENALEX - Publications

Shuming Ma Xu Sun Junyang Lin Xuancheng Ren

Text summarization and sentiment classification both aim to capture the main ideas of text but at different levels. is describe within a few sentences, while can be regarded as special type which ``summarizes'' into even more abstract fashion, i.e., class. Based on this idea, we propose hierarchical end-to-end model for joint learning classification, where label treated further ``summarization'' output. Hence, layer put upon layer, structure derived. Experimental results Amazon online...

10.24963/ijcai.2018/591 article EN 2018-07-01

A Deep Reinforced Sequence-to-Set Model for Multi-Label Classification

OPENALEX - Publications

Pengcheng Yang Fuli Luo Shuming Ma Junyang Lin Xu Sun

Multi-label classification (MLC) aims to predict a set of labels for given instance. Based on pre-defined label order, the sequence-to-sequence (Seq2Seq) model trained via maximum likelihood estimation method has been successfully applied MLC task and shows powerful ability capture high-order correlations between labels. However, output are essentially an unordered rather than ordered sequence. This inconsistency tends result in some intractable problems, e.g., sensitivity order. To remedy...

10.18653/v1/p19-1518 article EN cc-by 2019-01-01

mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs

OPENALEX - Publications

Zewen Chi Li Dong Shuming Ma Shaohan Huang Saksham Singhal and 4 more

Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang, Saksham Singhal, Xian-Ling Mao, Heyan Xia Song, Furu Wei. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

10.18653/v1/2021.emnlp-main.125 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders

OPENALEX - Publications

Shuming Ma Li Dong Shaohan Huang Dongdong Zhang Alexandre Muzio and 4 more

While pretrained encoders have achieved success in various natural language understanding (NLU) tasks, there is a gap between these and generation (NLG). NLG tasks are often based on the encoder-decoder framework, where can only benefit part of it. To reduce this gap, we introduce DeltaLM, multilingual model that regards decoder as task layer off-the-shelf encoders. Specifically, augment encoder with pre-train it self-supervised way. take advantage both large-scale monolingual data bilingual...

10.48550/arxiv.2106.13736 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Application of cleaner production in a Chinese magnesia refractory material plant

OPENALEX - Publications

Jinhua Li Yun Zhang Shuai Shao Shushen Zhang Shuming Ma

10.1016/j.jclepro.2015.11.040 article EN Journal of Cleaner Production 2015-11-27

Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization

OPENALEX - Publications

Shuming Ma Xu Sun Jingjing Xu Houfeng Wang Wenjie Li and 1 more

Current Chinese social media text summarization models are based on an encoder-decoder framework. Although its generated summaries similar to source texts literally, they have low semantic relevance. In this work, our goal is improve relevance between and for summarization. We introduce a Semantic Relevance Based neural model encourage high similarity summaries. model, the represented by gated attention encoder, while summary representation produced decoder. Besides, score representations...

10.18653/v1/p17-2100 article EN cc-by 2017-01-01

Coming Soon ...