- Mathematics, Computing, and Information Processing
- Natural Language Processing Techniques
- Topic Modeling
- Advanced Database Systems and Queries
- Open Education and E-Learning
- Algorithms and Data Compression
- Digital Humanities and Scholarship
- Scientific Computing and Data Management
- Advanced Text Analysis Techniques
- Semantic Web and Ontologies
- Handwritten Text Recognition Techniques
- Computational Physics and Python Applications
- Educational Technology and Assessment
- Speech Recognition and Synthesis
- Biomedical Text Mining and Ontologies
- Wikis in Education and Collaboration
- Computational and Text Analysis Methods
- Educational Assessment and Pedagogy
- Research Data Management Practices
- Distributed and Parallel Computing Systems
- Data Quality and Management
- Academic integrity and plagiarism
- Intelligent Tutoring Systems and Adaptive Learning
- Advanced Data Storage Technologies
University of Göttingen
2023-2024
Stanford University
2023
University of Wuppertal
2019-2022
National Institute of Informatics
2020
University of Konstanz
2018-2019
Technische Universität Berlin
2017
Abstract Word embedding, which represents individual words with semantically fixed-length vectors, has made it possible to successfully apply deep learning natural language processing tasks such as semantic role-modeling, question answering, and machine translation. As math text consists of text, well expressions that similarly exhibit linear correlation contextual characteristics, word embedding techniques can also be applied documents. However, while mathematics is a precise accurate...
Mathematical formulae represent complex semantic information in a concise form. Especially Science, Technology, Engineering, and Mathematics, mathematical are crucial to communicate information, e.g., scientific papers, perform computations using computer algebra systems. Enabling computers access the encoded requires machine-readable formats that can both presentation content, i.e., semantics, of formulae. Exchanging such between systems additionally conversion methods for representation...
Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's In this paper, we present first in-depth study on distributions notation two large scientific corpora: open access arXiv (2.5B objects) reviewing service pure applied mathematics zbMATH (61M objects). Our lays foundation future research projects corpora. Further,...
Summarization for scientific text has shown significant benefits both the research community and human society. Given fact that nature of is distinctive input multi-document summarization task substantially long, requires sufficient embedding generation truncation without losing important information. To tackle these issues, in this paper, we propose SKT5SciSumm - a hybrid framework (MDSS). We leverage Sentence-Transformer version Scientific Paper Embeddings using Citation-Informed...
Purpose: Modern mathematicians and scientists of math-related disciplines often use Document Preparation Systems (DPS) to write Computer Algebra (CAS) calculate mathematical expressions. Usually, they translate the expressions manually between DPS CAS. This process is time-consuming error-prone. Our goal automate this translation. paper uses Maple Mathematica as CAS, LaTeX our DPS. Design/methodology/approach: Bruce Miller at National Institute Standards Technology (NIST) developed a...
This demo paper presents the first tool to annotate reuse of text, images, and mathematical formulae in a document pair-TEIMMA. Annotating content is particularly useful develop plagiarism detection algorithms. Real-world often obfuscated, which makes it challenging identify such cases. TEIMMA allows entering obfuscation type enable novel classifications for confirmed cases plagiarism. It enables recording different types HTML supports users by visualizing pair using similarity methods text math.
Wikipedia combines the power of AI solutions and human reviewers to safeguard article quality. Quality control objectives include detecting malicious edits, fixing typos, spotting inconsistent formatting. However, no automated quality mechanisms currently exist for mathematical formulae. Spell checkers are widely used highlight textual errors, yet equivalent tool exists detect algebraically incorrect Our paper addresses this shortcoming by making formulae computable. We present a method that...
We tackle the problem of neural machine translation mathematical formulae between ambiguous presentation languages and unambiguous content languages. Compared to on natural language, have a much smaller vocabulary longer sequences symbols, while their requires extreme precision satisfy information needs. In this work, we perform tasks translating from LaTeX Mathematica as well semantic LaTeX. While recurrent, recursive, transformer networks struggle with preserving all contained information,...
This poster summarizes our contributions to Wikimedia's processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 providing modern semantically enriched language-independent MathML 2020. Additionally, plans improve accessibility and discoverability of knowledge Wikimedia projects further.
High annotation costs from hiring or crowdsourcing complicate the creation of large, high-quality datasets needed for training reliable text classifiers. Recent research suggests using Large Language Models (LLMs) to automate process, reducing these while maintaining data quality. LLMs have shown promising results in annotating downstream tasks like hate speech detection and political framing. Building on success areas, this study investigates whether are viable complex task media bias a...
Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal notation. We make two contributions. First, we establish taxonomy content reuse by annotating potentially plagiarised 122 scientific document pairs. Second, analyze best-performing approaches to detect similarity on newly...
Media bias detection poses a complex, multifaceted problem traditionally tackled using single-task models and small in-domain datasets, consequently lacking generalizability. To address this, we introduce MAGPIE, the first large-scale multi-task pre-training approach explicitly tailored for media detection. enable at scale, present Large Bias Mixture (LBM), compilation of 59 bias-related tasks. MAGPIE outperforms previous approaches in on Annotation By Experts (BABE) dataset, with relative...
Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain mathematics presents a distinctive challenge, primarily due to its specialized structure and precision it demands. In this study, we adopted two-step approach for investigating proficiency LLMs answering mathematical questions. First, employ most effective LLMs, as identified by their...
Nowadays, Machine Learning (ML) is seen as the universal solution to improve effectiveness of information retrieval (IR) methods. However, while mathematics a precise and accurate science, it usually expressed by less imprecise descriptions, contributing relative dearth machine learning applications for IR in this domain. Generally, mathematical documents communicate their knowledge with an ambiguous, context-dependent, non-formal language. Given recent advances ML, seems canonical apply ML...