- Mathematics, Computing, and Information Processing
- Natural Language Processing Techniques
- Topic Modeling
- Semantic Web and Ontologies
- Scientific Computing and Data Management
- Open Education and E-Learning
- Advanced Database Systems and Queries
- Research Data Management Practices
- Academic integrity and plagiarism
- Advanced Text Analysis Techniques
- Wikis in Education and Collaboration
- Digital Humanities and Scholarship
- Algorithms and Data Compression
- Advanced Data Storage Technologies
- Peer-to-Peer Network Technologies
- Blockchain Technology Applications and Security
- Educational Technology and Assessment
- Distributed and Parallel Computing Systems
- Caching and Content Delivery
- Machine Learning and Data Classification
- Intelligent Tutoring Systems and Adaptive Learning
- Data Mining Algorithms and Applications
- Big Data and Business Intelligence
- Misinformation and Its Impacts
- Data Quality and Management
FIZ Karlsruhe – Leibniz Institute for Information Infrastructure
2019-2024
University of Wuppertal
2018-2023
University of Göttingen
2021-2023
Stanford University
2023
University of Konstanz
2017-2022
Technische Informationsbibliothek (TIB)
2021
University of Michigan
2021
National Institute of Informatics
2018
Technische Universität Berlin
2011-2016
Moritz Klinik
2014
Recent years have witnessed growing consolidation of web operations. For example, the majority traffic now originates from a few organizations, and even micro-websites often choose to host on large pre-existing cloud infrastructures. In response this, "Decentralized Web" attempts distribute ownership operation services more evenly. This paper describes design implementation largest most widely used Decentralized Web platform --- InterPlanetary File System (IPFS) an open-source,...
Mathematical formulae are essential in science, but face challenges of ambiguity, due to the use a small number identifiers represent an immense concepts. Corresponding word sense disambiguation Natural Language Processing, we disambiguate mathematical identifiers. By regarding and natural text as one monolithic information source, able extract semantics process term Processing (MLP). As scientific communities tend establish standard (identifier) notations, document domain infer actual...
Current plagiarism detection systems reliably find instances of copied and moderately altered text, but often fail to detect strong paraphrases, translations, the reuse non-textual content ideas. To improve upon capabilities for such concealed in academic publications, we make four contributions: i) We present first approach that combines analysis mathematical expressions, images, citations text. ii) describe implementation this hybrid research prototype HyPlag. iii) novel visualization...
Abstract Word embedding, which represents individual words with semantically fixed-length vectors, has made it possible to successfully apply deep learning natural language processing tasks such as semantic role-modeling, question answering, and machine translation. As math text consists of text, well expressions that similarly exhibit linear correlation contextual characteristics, word embedding techniques can also be applied documents. However, while mathematics is a precise accurate...
Literature recommender systems support users in filtering the vast and increasing number of documents digital libraries on Web. For academic literature, research has proven ability citation-based document similarity measures, such as Co-Citation (CoCit), or Proximity Analysis (CPA) to improve recommendation quality. In this paper, we report first large-scale investigation performance CPA approach generating literature recommendations for Wikipedia, which is fundamentally different from...
Mathematical formulae represent complex semantic information in a concise form. Especially Science, Technology, Engineering, and Mathematics, mathematical are crucial to communicate information, e.g., scientific papers, perform computations using computer algebra systems. Enabling computers access the encoded requires machine-readable formats that can both presentation content, i.e., semantics, of formulae. Exchanging such between systems additionally conversion methods for representation...
Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, funding agencies. Current detection systems reliably find instances of copied moderately reworded text. However, detecting concealed plagiarism, such as strong paraphrases, translations, the reuse nontextual content ideas an open problem. In this paper, we extend our prior on analyzing mathematical citations. Both are promising approaches improving primarily in Science, Technology,...
Many digital libraries recommend literature to their users considering the similarity between a query document and repository. However, they often fail distinguish what is relationship that makes two documents alike. In this paper, we model problem of finding as pairwise classification task. To find semantic relation documents, apply series techniques, such GloVe, Paragraph Vectors, BERT, XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including...
This paper presents, to our knowledge, the first study on analyzing mathematical expressions detect academic plagiarism. We make following contributions. First, we investigate confirmed cases of plagiarism categorize similarities content commonly found in plagiarized publications. From this investigation, derive possible feature selection and comparison strategies for developing math-based detection approaches a ground truth experiments. Second, create test collection by embedding into...
Mathematical Information Retrieval concerns retrieving information related to a particular mathematical concept. The NTCIR-11 Math Task develops an evaluation test collection for document sections retrieval of scientific articles based on human generated topics. Those topics involve combination formula patterns and keywords. In addition, the optional Wikipedia provides individual from search that contain exactly one pattern. We developed framework automatic query generation immediate...
Documents from science, technology, engineering and mathematics (STEM) often contain a large number of mathematical formulae alongside text. Semantic search, recommender, question answering systems require the occurring formula constants variables (identifiers) to be disambiguated. We present first implementation recommender system that enables accelerates annotation by displaying most likely candidates for identifier names four different sources (arXiv, Wikipedia, Wikidata, or surrounding...
Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's In this paper, we present first in-depth study on distributions notation two large scientific corpora: open access arXiv (2.5B objects) reviewing service pure applied mathematics zbMATH (61M objects). Our lays foundation future research projects corpora. Further,...
In this paper, we show how selecting and combining encodings of natural mathematical language affect classification clustering documents with content. We demonstrate by using sets documents, sections, abstracts from the arXiv preprint server that are labeled their subject class (mathematics, computer science, physics, etc.) to compare different text formulae evaluate performance runtimes selected algorithms. Our achieve accuracies up 82.8% cluster purities 69.4% (number clusters equals...
In natural language, words and phrases themselves imply the semantics. contrast, meaning of identifiers in mathematical formulae is undefined. Thus scientists must study context to decode meaning. The Mathematical Language Processing (MLP) project aims support that process. this paper, we compare two approaches discover identifier-definition tuples. At first use a simple pattern matching approach. Second, present MLP approach uses part-of-speech tag based distances as well sentence positions...
The identification and extraction of the events that news articles report on is a commonly performed task in analysis workflow various projects analyze articles. However, due to lack universally usable publicly available methods for articles, many researchers must redundantly implement event be used within their projects. Answers journalistic five W one H questions (5W1H) describe main story, i.e., who did what, when, where, why, how. We propose Giveme5W1H, an open-source system uses...
Detecting academic plagiarism is a pressing problem, e.g., for educational and research institutions, funding agencies, publishers. Existing detection systems reliably identify copied text, or near copies of but often fail to detect disguised forms plagiarism, such as paraphrases, translations, idea plagiarism. We present Semantic Concept Pattern Analysis - an approach that performs integrated analysis semantic text relatedness structural similarity. Using 25 officially retracted cases, we...