- scientometrics and bibliometrics research
- Natural Language Processing Techniques
- Topic Modeling
- Biomedical Text Mining and Ontologies
- Semantic Web and Ontologies
- Wikis in Education and Collaboration
- Complex Network Analysis Techniques
- Digital Humanities and Scholarship
- Advanced Text Analysis Techniques
- Digital and Traditional Archives Management
- Research Data Management Practices
- Data Quality and Management
- Misinformation and Its Impacts
- Historical Economic and Social Studies
- Handwritten Text Recognition Techniques
- Library Science and Information Systems
- Art History and Market Analysis
- Computational and Text Analysis Methods
- Web visibility and informetrics
- Open Source Software Innovations
- COVID-19 diagnosis using AI
- Mathematics, Computing, and Information Processing
- Aesthetic Perception and Analysis
- Evolutionary Game Theory and Cooperation
- Academic Publishing and Open Access
University of Amsterdam
2019-2024
University of Bologna
2023-2024
University of Copenhagen
2024
The Alan Turing Institute
2018-2023
Turing Institute
2018-2023
Universidade Nova de Lisboa
2023
University of Lisbon
2023
Berlin State Library
2023
Europeana Foundation
2023
University College Dublin
2023
Efforts to make research results open and reproducible are increasingly reflected by journal policies encouraging or mandating authors provide data availability statements. As a consequence of this, there has been strong uptake statements in recent literature. Nevertheless, it is still unclear what proportion these actually contain well-formed links data, for example via URL permanent identifier, if an added value providing such links. We consider 531, 889 articles published PLOS BMC,...
Abstract Crypto art is limited-edition digital art, cryptographically registered with a token on blockchain. Tokens represent transparent, auditable origin and provenance for piece of art. Blockchain technology allows tokens to be held securely traded without the involvement third parties. draws its origins from conceptual art—sharing immaterial distributive nature artworks, tight blending artworks currency rejection conventional markets institutions. The authors propose collection...
A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR).Scholars libraries are increasingly using OCR-generated for retrieval analysis.However, the process creating through OCR introduces varying degrees error to text.The impact these errors on natural language processing (NLP) tasks has only been partially studied.We perform a series extrinsic assessment -sentence segmentation, named entity recognition, dependency parsing,...
The digital transformation is turning archives, both old and new, into data. As a consequence, automation in the form of artificial intelligence techniques increasingly applied to scale traditional recordkeeping activities, experiment with novel ways capture, organise, access records. We survey recent developments at intersection Artificial Intelligence archival thinking practice. Our overview this growing body literature organised through lenses Records Continuum model. find four broad...
As the COVID-19 pandemic unfolds, researchers from all disciplines are coming together and contributing their expertise. CORD-19, a dataset of coronavirus publications, has been made available alongside calls to help mine information it contains create tools search more effectively. We analyse delineation publications included in CORD-19 scientometric perspective. Based on comparison Web Science database, we find that provides an almost complete coverage research coronaviruses. not only...
Wikipedia is one of the main sources free knowledge on Web. During first few months pandemic, over 5,200 new pages COVID-19 were created, accumulating 400 million page views by mid-June 2020. 1 At same time, an unprecedented amount scientific articles and ongoing pandemic have been published online. Wikipedia’s content based reliable sources, such as literature. Given its public function, it crucial for to rely representative results, especially in a time crisis. We assess coverage...
There are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled methodology evaluating the accuracy of clustering solutions obtained using these measures. formally show proposed has an important consistency property. The empirical analyses we present publications in fields cell biology, condensed matter physics, and economics. Using BM25 text-based measure as evaluation...
Abstract As the COVID-19 pandemic unfolds, researchers from all disciplines are coming together and contributing their expertise. CORD-19, a dataset of coronavirus publications, has been made available along-side calls to help mine information it contains create tools search more effectively. We analyse delineation publications included in CORD-19 scientometric perspective. Based on comparison Web Science database, we find that provides an almost complete coverage research coronaviruses. not...
Wikipedia is one of the most visited sites on Web and a common source information for many users. As an encyclopedia, was not conceived as original information, but gateway to secondary sources: according Wikipedia's guidelines, facts must be backed up by reliable sources that reflect full spectrum views topic. Although citations lie at heart Wikipedia, little known about how users interact with them. To close this gap, we built client-side instrumentation logging all interactions links...
Abstract Wikipedia’s content is based on reliable and published sources. To this date, relatively little known about what sources Wikipedia relies on, in part because extracting citations identifying cited challenging. close gap, we release Citations, a comprehensive data set of extracted from Wikipedia. We extracted29.3 million 6.1 English articles as May 2020, classified being books, journal articles, or Web content. were thus able to extract 4.0 scholarly publications with...
Calls to make scientific research more open have gained traction with a range of societal stakeholders. Open Science practices include but are not limited the early sharing results via preprints and openly outputs such as data code reproducible extensible. Existing evidence shows that adopting has effects in several domains. In this study, we investigate whether one or leads significantly higher citations for an associated publication, which is form academic impact. We use novel dataset...
Data sharing is fundamental to scientific progress, enhancing transparency, reproducibility, and innovation across disciplines. Despite its growing significance, the variability of data-sharing practices research fields remains insufficiently understood, limiting development effective policies infrastructure. This study investigates evolving landscape practices, specifically focusing on intentions behind data release, reuse, referencing. Leveraging PubMed open dataset, we developed a model...
Sparked by issues of quality and lack proper documentation for datasets, the machine learning community has begun developing standardised processes establishing datasheets with intent to provide context information on provenance, purposes, composition, collection process, recommended uses or societal biases reflected in training datasets.This approach fits well practices procedures established GLAM institutions, such as collections' descriptions.However, digital cultural heritage datasets...
We investigated the similarities of pairs articles that are cocited at different cocitation levels journal, article, section, paragraph, sentence, and bracket. Our results indicate textual similarity, intellectual overlap (shared references), author authors), proximity in publication time all rise monotonically as level gets lower (from journal to bracket). While main gain similarity happens when moving from article cocitation, changes entail an increase especially section paragraph...
Purpose Wikipedia's inclusive editorial policy permits unrestricted participation, enabling individuals to contribute and disseminate their expertise while drawing upon a multitude of external sources. News media outlets constitute nearly one-third all citations within Wikipedia. However, embracing such radically open approach also poses the challenge potential introduction biased content or viewpoints into The authors conduct an investigation integrity knowledge Wikipedia, focusing on...
Calls to make scientific research more open have gained traction with a range of societal stakeholders. Open Science practices include but are not limited the early sharing results via preprints and openly outputs such as data code reproducible extensible. Existing evidence shows that adopting has effects in several domains. In this study, we investigate whether one or leads significantly higher citations for an associated publication, which is form academic impact. We use novel dataset...
Purpose This paper aims to expand the scope and mitigate biases of extant archival indexes. Design/methodology/approach The authors use automatic entity recognition on archives Dutch East India Company extract mentions underrepresented people. Findings release an annotated corpus baselines for a shared task show that proposed goal is feasible. Originality/value Colonial are increasingly focus attention historians public, broadening access them pressing need archives.
We consider the task of reference mining: detection, extraction and classification references within full text scholarly publications. Reference mining brings forward specific challenges, such as need to capture morphology highly abbreviated words dependence among elements a reference, both following codified styles. This is particularly difficult, little explored, with respect literature in arts humanities, where are mostly given footnotes. apply deep learning architecture for from explore...
Abstract This study presents the results of an experiment we performed to measure coverage Digital Humanities (DH) publications in mainstream open and proprietary bibliographic data sources, by further highlighting relations among DH other disciplines. We created a list journals based on manual curation bibliometric data. used that identify sources under consideration. ERIH-PLUS Social Sciences (SSH) publications. analysed citation links they included understand relationship between SSH...