Moritz Schubotz

ORCID: 0000-0001-7141-4997
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Mathematics, Computing, and Information Processing
  • Natural Language Processing Techniques
  • Topic Modeling
  • Semantic Web and Ontologies
  • Scientific Computing and Data Management
  • Open Education and E-Learning
  • Advanced Database Systems and Queries
  • Research Data Management Practices
  • Academic integrity and plagiarism
  • Advanced Text Analysis Techniques
  • Wikis in Education and Collaboration
  • Digital Humanities and Scholarship
  • Algorithms and Data Compression
  • Advanced Data Storage Technologies
  • Peer-to-Peer Network Technologies
  • Blockchain Technology Applications and Security
  • Educational Technology and Assessment
  • Distributed and Parallel Computing Systems
  • Caching and Content Delivery
  • Machine Learning and Data Classification
  • Intelligent Tutoring Systems and Adaptive Learning
  • Data Mining Algorithms and Applications
  • Big Data and Business Intelligence
  • Misinformation and Its Impacts
  • Data Quality and Management

FIZ Karlsruhe – Leibniz Institute for Information Infrastructure
2019-2024

University of Wuppertal
2018-2023

University of Göttingen
2021-2023

Stanford University
2023

University of Konstanz
2017-2022

Technische Informationsbibliothek (TIB)
2021

University of Michigan
2021

National Institute of Informatics
2018

Technische Universität Berlin
2011-2016

Moritz Klinik
2014

Recent years have witnessed growing consolidation of web operations. For example, the majority traffic now originates from a few organizations, and even micro-websites often choose to host on large pre-existing cloud infrastructures. In response this, "Decentralized Web" attempts distribute ownership operation services more evenly. This paper describes design implementation largest most widely used Decentralized Web platform --- InterPlanetary File System (IPFS) an open-source,...

10.1145/3544216.3544232 preprint EN 2022-08-11

Mathematical formulae are essential in science, but face challenges of ambiguity, due to the use a small number identifiers represent an immense concepts. Corresponding word sense disambiguation Natural Language Processing, we disambiguate mathematical identifiers. By regarding and natural text as one monolithic information source, able extract semantics process term Processing (MLP). As scientific communities tend establish standard (identifier) notations, document domain infer actual...

10.1145/2911451.2911503 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2016-07-07

Current plagiarism detection systems reliably find instances of copied and moderately altered text, but often fail to detect strong paraphrases, translations, the reuse non-textual content ideas. To improve upon capabilities for such concealed in academic publications, we make four contributions: i) We present first approach that combines analysis mathematical expressions, images, citations text. ii) describe implementation this hybrid research prototype HyPlag. iii) novel visualization...

10.1145/3209978.3210177 article EN 2018-06-27

Abstract Word embedding, which represents individual words with semantically fixed-length vectors, has made it possible to successfully apply deep learning natural language processing tasks such as semantic role-modeling, question answering, and machine translation. As math text consists of text, well expressions that similarly exhibit linear correlation contextual characteristics, word embedding techniques can also be applied documents. However, while mathematics is a precise accurate...

10.1007/s11192-020-03502-9 article EN cc-by Scientometrics 2020-06-09

Literature recommender systems support users in filtering the vast and increasing number of documents digital libraries on Web. For academic literature, research has proven ability citation-based document similarity measures, such as Co-Citation (CoCit), or Proximity Analysis (CPA) to improve recommendation quality. In this paper, we report first large-scale investigation performance CPA approach generating literature recommendations for Wikipedia, which is fundamentally different from...

10.1145/2910896.2910908 article EN 2016-06-10

Mathematical formulae represent complex semantic information in a concise form. Especially Science, Technology, Engineering, and Mathematics, mathematical are crucial to communicate information, e.g., scientific papers, perform computations using computer algebra systems. Enabling computers access the encoded requires machine-readable formats that can both presentation content, i.e., semantics, of formulae. Exchanging such between systems additionally conversion methods for representation...

10.1145/3197026.3197058 preprint EN 2018-05-23

Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, funding agencies. Current detection systems reliably find instances of copied moderately reworded text. However, detecting concealed plagiarism, such as strong paraphrases, translations, the reuse nontextual content ideas an open problem. In this paper, we extend our prior on analyzing mathematical citations. Both are promising approaches improving primarily in Science, Technology,...

10.1109/jcdl.2019.00026 preprint EN 2019-06-01

Many digital libraries recommend literature to their users considering the similarity between a query document and repository. However, they often fail distinguish what is relationship that makes two documents alike. In this paper, we model problem of finding as pairwise classification task. To find semantic relation documents, apply series techniques, such GloVe, Paragraph Vectors, BERT, XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including...

10.1145/3383583.3398525 article EN 2020-08-01

This paper presents, to our knowledge, the first study on analyzing mathematical expressions detect academic plagiarism. We make following contributions. First, we investigate confirmed cases of plagiarism categorize similarities content commonly found in plagiarized publications. From this investigation, derive possible feature selection and comparison strategies for developing math-based detection approaches a ground truth experiments. Second, create test collection by embedding into...

10.1145/3132847.3133144 article EN 2017-11-06

Mathematical Information Retrieval concerns retrieving information related to a particular mathematical concept. The NTCIR-11 Math Task develops an evaluation test collection for document sections retrieval of scientific articles based on human generated topics. Those topics involve combination formula patterns and keywords. In addition, the optional Wikipedia provides individual from search that contain exactly one pattern. We developed framework automatic query generation immediate...

10.1145/2766462.2767787 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2015-08-04

Documents from science, technology, engineering and mathematics (STEM) often contain a large number of mathematical formulae alongside text. Semantic search, recommender, question answering systems require the occurring formula constants variables (identifiers) to be disambiguated. We present first implementation recommender system that enables accelerates annotation by displaying most likely candidates for identifier names four different sources (arXiv, Wikipedia, Wikidata, or surrounding...

10.1145/3298689.3347042 article EN 2019-09-10

Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's In this paper, we present first in-depth study on distributions notation two large scientific corpora: open access arXiv (2.5B objects) reviewing service pure applied mathematics zbMATH (61M objects). Our lays foundation future research projects corpora. Further,...

10.1145/3366423.3380218 preprint EN 2020-04-20

In this paper, we show how selecting and combining encodings of natural mathematical language affect classification clustering documents with content. We demonstrate by using sets documents, sections, abstracts from the arXiv preprint server that are labeled their subject class (mathematics, computer science, physics, etc.) to compare different text formulae evaluate performance runtimes selected algorithms. Our achieve accuracies up 82.8% cluster purities 69.4% (number clusters equals...

10.1145/3383583.3398529 preprint EN 2020-08-01

In natural language, words and phrases themselves imply the semantics. contrast, meaning of identifiers in mathematical formulae is undefined. Thus scientists must study context to decode meaning. The Mathematical Language Processing (MLP) project aims support that process. this paper, we compare two approaches discover identifier-definition tuples. At first use a simple pattern matching approach. Second, present MLP approach uses part-of-speech tag based distances as well sentence positions...

10.48550/arxiv.1407.0167 preprint EN other-oa arXiv (Cornell University) 2014-01-01

The identification and extraction of the events that news articles report on is a commonly performed task in analysis workflow various projects analyze articles. However, due to lack universally usable publicly available methods for articles, many researchers must redundantly implement event be used within their projects. Answers journalistic five W one H questions (5W1H) describe main story, i.e., who did what, when, where, why, how. We propose Giveme5W1H, an open-source system uses...

10.1145/3197026.3203899 article EN 2018-05-23

Detecting academic plagiarism is a pressing problem, e.g., for educational and research institutions, funding agencies, publishers. Existing detection systems reliably identify copied text, or near copies of but often fail to detect disguised forms plagiarism, such as paraphrases, translations, idea plagiarism. We present Semantic Concept Pattern Analysis - an approach that performs integrated analysis semantic text relatedness structural similarity. Using 25 officially retracted cases, we...

10.1145/3127526.3127535 article EN 2017-12-15
Coming Soon ...