- Natural Language Processing Techniques
- Topic Modeling
- Advanced Text Analysis Techniques
- Biomedical Text Mining and Ontologies
- Text Readability and Simplification
- Semantic Web and Ontologies
- Authorship Attribution and Profiling
- Sentiment Analysis and Opinion Mining
- Hate Speech and Cyberbullying Detection
- Lexicography and Language Studies
- Spam and Phishing Detection
- Information Retrieval and Search Behavior
- Misinformation and Its Impacts
- Digital Communication and Language
- Data Visualization and Analytics
- Text and Document Classification Technologies
- Linguistics and language evolution
- Emotion and Mood Recognition
- linguistics and terminology studies
- Social Media and Politics
- Scientific Computing and Data Management
- Speech Recognition and Synthesis
- Religious, Philosophical, and Educational Studies
- Stock Market Forecasting Methods
- Speech and dialogue systems
Jožef Stefan Institute
2015-2025
University of Edinburgh
2018-2019
Jožef Stefan International Postgraduate School
2015-2017
University of Antwerp
2017
University of Ljubljana
2010-2012
Increasing amounts of freely available data both in textual and relational form offers exploration richer document representations, potentially improving the model performance robustness. An emerging problem modern era is fake news detection -- many easily pieces information are not necessarily factually correct, can lead to wrong conclusions or used for manipulation. In this work we explore how different ranging from simple symbolic bag-of-words, contextual, neural language model-based ones...
Abstract We present a set of novel neural supervised and unsupervised approaches for determining the readability documents. In setting, we leverage language models, whereas in three different classification architectures are tested. show that proposed approach is robust, transferable across languages, allows adaptation to specific task data set. By systematic comparison several on number benchmark new labeled sets two this study also offers comprehensive analysis classification. expose their...
Abstract The coronavirus disease 2019 (COVID-19) pandemic has been severely impacting global society since December 2019. related findings such as vaccine and drug development have reported in biomedical literature—at a rate of about 10 000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation interpretation. For instance, LitCovid is literature database COVID-19-related PubMed, which accumulated more than 200 with millions accesses each month by users...
Abstract With growing amounts of available textual data, development algorithms capable automatic analysis, categorization, and summarization these data has become a necessity. In this research, we present novel algorithm for keyword identification, that is, an extraction one or multiword phrases representing key aspects given document, called Transformer-Based Neural Tagger Keyword IDentification (TNT-KID). By adapting the transformer architecture specific task at hand leveraging language...
This article presents an interdisciplinary study combining advanced natural language processing techniques by using contextual embeddings and manual thematic analysis. We analyse Slovenian news articles on LGBT + topics, focusing the differences in connotation of word deep, whose usage differs most between mainstream conservative media groups, according to system for automatic measuring changes based embedding. At content level, shows that, media, deep is predominantly used a conventional...
We propose a new method that leverages contextual embeddings for the task of diachronic semantic shift detection by generating time specific word representations from BERT embeddings. The results our experiments in domain LiverpoolFC corpus suggest proposed has performance comparable to current state-of-the-art without requiring any consuming adaptation on large corpora. newly created Brexit news can be successfully used short-term yearly shift. And lastly, model also shows promising...
In this paper, we address the task of zero-shot cross-lingual news sentiment classification. Given annotated dataset positive, neutral, and negative in Slovene, aim is to develop a classification system that assigns category not only Slovene news, but another language without any training data required. Our based on multilingual BERTmodel, while test different approaches for handling long documents propose novel technique enrichment BERT model as an intermediate step. With proposed approach,...
Background: Advances in machine learning (ML) technology have opened new avenues for detection and monitoring of cognitive decline. In this study, a multimodal approach to Alzheimer's dementia based on the patient's spontaneous speech is presented. This was tested standard, publicly available dataset comparability. The data comprise voice samples from 156 participants (1:1 ratio control), matched by age gender. Materials Methods: A recently developed Active Data Representation (ADR)...
Abstract Automatic term extraction (ATE) is a natural language processing task that eases the effort of manually identifying terms from domain-specific corpora by providing list candidate terms. In this paper, we treat ATE as sequence-labeling and explore efficacy XLMR in evaluating cross-lingual multilingual learning against monolingual cross-domain context. Additionally, introduce NOBI, novel annotation mechanism enabling labeling single-word nested Our experiments are conducted on ACTER...
Text mining aims at constructing classification models and finding interesting patterns in large text collections. This paper investigates the utility of applying these techniques to media analysis, more specifically support discourse analysis news reports about 2007 Kenyan elections post-election crisis local (Kenyan) Western (British US) newspapers. It illustrates how methods can assist by contrast which provide evidence for ideological differences between international press coverage. Our...
This paper presents the Graded Word Similarity in Context (GWSC) task which asked participants to predict effects of context on human perception similarity English, Croatian, Slovene and Finnish. We received 15 submissions 11 system description papers. A new dataset (CoSimLex) was created for evaluation this task: it contains pairs words, each annotated within two different contexts. Systems beat baselines by significant margins, but few did well more than one language or subtask. Almost...
The use of background knowledge is largely unexploited in text classification tasks. This paper explores word taxonomies as means for constructing new semantic features, which may improve the performance and robustness learned classifiers. We propose tax2vec, a parallel algorithm taxonomy-based demonstrate its on six short problems: prediction gender, personality type, age, news topics, drug side effects effectiveness. constructed combination with fast linear classifiers, tested against...
Platforms that feature user-generated content (social media, online forums, newspaper comment sections etc.) have to detect and filter offensive speech within large, fast-changing datasets. While many automatic methods been proposed achieve good accuracies, most of these focus on the English language, are hard apply directly languages in which few labeled datasets exist. Recent work has therefore investigated use cross-lingual transfer learning solve this problem, training a model...
Abstract Learning from texts has been widely adopted throughout industry and science. While state-of-the-art neural language models have shown very promising results for text classification, they are expensive to (pre-)train, require large amounts of data tuning hundreds millions or more parameters. This paper explores how automatically evolved representations can serve as a basis explainable, low-resource branch with competitive performance that subject automated hyperparameter tuning. We...
Deep neural networks are becoming ubiquitous in text mining and natural language processing, but semantic resources, such as taxonomies ontologies, yet to be fully exploited a deep learning setting. This paper presents an efficient approach, which converts information related given set of documents into novel features that used for learning. The proposed Semantics-aware Recurrent Neural Architecture (SRNA) enables the system learn simultaneously from vectors raw documents. We test...
Depression is a mental illness that negatively affects person’s well-being and can, if left untreated, lead to serious consequences such as suicide. Therefore, it important recognize the signs of depression early. In last decade, social media has become one most common places express one’s feelings. Hence, there possibility text processing applying machine learning techniques detect possible depression. this paper, we present our approaches solving shared task titled Detecting Signs from...
State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets intrinsic evaluation embeddings based judgements similarity, ignore context; standard sense disambiguation take account context do not provide continuous measures meaning similarity. This paper describes an effort to build a new dataset, CoSimLex, intended fill this gap. Building pairwise...