- Topic Modeling
- Computational and Text Analysis Methods
- Natural Language Processing Techniques
- Web Data Mining and Analysis
- Advanced Text Analysis Techniques
- Hate Speech and Cyberbullying Detection
- Sentiment Analysis and Opinion Mining
- Genomics and Phylogenetic Studies
- Language and cultural evolution
- Digital Marketing and Social Media
- Pain Management and Placebo Effect
- Bioinformatics and Genomic Networks
- Microbial Natural Products and Biosynthesis
- Digital Humanities and Scholarship
- Algorithms and Data Compression
- Expert finding and Q&A systems
- Recommender Systems and Techniques
- RNA and protein synthesis mechanisms
- Protist diversity and phylogeny
- Biomedical Text Mining and Ontologies
- Machine Learning in Bioinformatics
- Speech Recognition and Synthesis
- Advanced Graph Neural Networks
- Anxiety, Depression, Psychometrics, Treatment, Cognitive Processes
- Data Quality and Management
University of Helsinki
2019-2023
Utrecht University
2023
Jožef Stefan Institute
2022
La Rochelle Université
2022
Tieto (Finland)
2021
Queen Mary University of London
2021
Centre National de la Recherche Scientifique
2020
Université Paris-Saclay
2020
Université Paris-Sud
2020
Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
2020
Abstract Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation protein function. Results Here, we report on results third CAFA challenge, CAFA3, that featured expanded analysis over previous rounds, both in terms volume data analyzed types performed. In a novel major new development, predictions assessment goals drove some experimental assays, resulting functional annotations for...
Abstract The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation protein function. Here we report on results third CAFA challenge, CAFA3, that featured expanded analysis over previous rounds, both in terms volume data analyzed types performed. In a novel major new development, predictions assessment goals drove some experimental assays, resulting functional annotations for more than 1000...
The way the words are used evolves through time, mirroring cultural or technological evolution of society. Semantic change detection is task detecting and analysing word in textual data, even short periods time. In this paper we focus on a new set methods relying contextualised embeddings, type semantic modelling that revolutionised NLP field recently. We leverage ability transformer-based BERT model to generate embeddings capable across Several approaches compared common setting order...
This paper describes the approaches used by Discovery Team to solve SemEval-2020 Task 1 - Unsupervised Lexical Semantic Change Detection. The proposed method is based on clustering of BERT contextual embeddings, followed a comparison cluster distributions across time. best results were obtained an ensemble this and static Word2Vec embeddings. According official results, our approach proved for Latin in Subtask 2.
Dynamic topic models (DTMs) capture the evolution of topics and trends in time series data.Current DTMs are applicable only to monolingual datasets.In this paper we present multilingual dynamic model (ML-DTM), a novel that combines DTM with an existing modeling method crosslingual evolve across time.We results on parallel German-English corpus news articles comparable Finnish Swedish articles.We demonstrate capability ML-DTM track significant events related show it finds distinct performs as...
Words with the suffix-ism are reductionist terms that help us navigate complex social issues by using a simple one-word label for them. On one hand they often associated political ideologies, but on other present in many domains of language, especially culture, science, and religion. This has not always been case. paper studies isms historical record digitized newspapers from 1820 to 1917 published Finland find out how language developed historically. We use diachronic word embeddings...
This paper presents the results of SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The was conducted with newly constructed dataset 4000 model labeled by 5 annotators each, spanning 3 NLP tasks: machine translation, paraphrase and definition modeling. tackled total 58 different users...
This paper addresses methodological issues in diachronic data analysis for historical research. We apply two families of topic models (LDA and DTM) on a relatively large set newspapers, with the aim capturing understanding discourse dynamics. Our case study focuses newspapers periodicals published Finland between 1854 1917, but our method can easily be transposed to any data. main contributions are a) combined sampling, training inference procedure applying huge imbalanced text collections;...
In this paper, we present the participation of EMBEDDIA team in SemEval-2022 Task 8 (Multilingual News Article Similarity). We cover several techniques and propose different methods for finding multilingual news article similarity by exploring dataset its entirety. take advantage textual content articles, provided metadata (e.g., titles, keywords, topics), translated images (those that were available), knowledge graph-based representations entities relations articles. We, then, compute...
This paper is a part of collaboration between computer scientists and historians aimed at development novel methods for historical newspapers analysis.We present case study ideological terms ending with -ism suffix in nineteenthcentury Finnish newspapers.We propose two-step procedure to trace differences word usages over time: training diachronic embeddings on several time slices when clustering selected words together their neighbours obtain context.The obtained clusters turn out be useful...
This paper addresses methodological issues in diachronic data analysis for historical research. We apply two families of topic models (LDA and DTM) on a relatively large set newspapers, with the aim capturing understanding discourse dynamics. Our case study focuses newspapers periodicals published Finland between 1854 1917, but our method can easily be transposed to any data. main contributions are a) combined sampling, training inference procedure applying huge imbalanced text collections;...
This paper presents M3L-Contrast -- a novel multimodal multilingual (M3L) neural topic model for comparable data that maps texts from multiple languages and images into shared space. Our is trained jointly on takes advantage of pretrained document image embeddings to abstract the complexities between different modalities. As model, it produces aligned language-specific topics as infers textual representations semantic concepts in images. We demonstrate our competitive with zero-shot...
Moderation of reader comments is a significant problem for online news platforms.Here, we experiment with models automatic moderation, using dataset from popular Croatian newspaper.Our analysis shows that while violate the moderation rules mostly share common linguistic and thematic features, their content varies across different sections newspaper.We therefore make our topic-aware, incorporating semantic features topic model into classification decision.Our results show information improves...
Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can compensated by mono-modal data quantity. Limited empirical evidence emerged or against either position, which we is due methodological challenges come with studying its effects on NLP In this...
Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can compensated by mono-modal data quantity. Limited empirical evidence emerged or against either position, which we is due methodological challenges come with studying its effects on NLP In this...