- Topic Modeling
- Natural Language Processing Techniques
- Hate Speech and Cyberbullying Detection
- Sentiment Analysis and Opinion Mining
- Interpreting and Communication in Healthcare
- Advanced Text Analysis Techniques
- Speech and dialogue systems
- Wikis in Education and Collaboration
- Authorship Attribution and Profiling
- Social Media and Politics
- Computational and Text Analysis Methods
- Cancer-related gene regulation
- Open Source Software Innovations
- Complex Network Analysis Techniques
- Language, Discourse, Communication Strategies
- Service-Oriented Architecture and Web Services
- Digital Communication and Language
- Vehicle Routing Optimization Methods
- ICT Impact and Policies
- Information Retrieval and Search Behavior
- Spam and Phishing Detection
- Bullying, Victimization, and Aggression
- Web Data Mining and Analysis
- Optimization and Packing Problems
- Digital Platforms and Economics
Brock University
2024
University of Glasgow
2024
Queen Mary University of London
2021-2024
University of St Andrews
2024
Fujian Normal University
2024
University of Essex
2024
Jožef Stefan Institute
2024
Laurentian University
2024
Hong Kong University of Science and Technology
2024
University of Hong Kong
2024
We investigate to what extent the models trained detect general abusive language generalize between different datasets labeled with types. To this end, we compare cross-domain performance of simple classification on nine datasets, finding that fail out-domain and having at least some in-domain data is important. also show using frustratingly domain adaptation (Daume III, 2007) in most cases improves results over training, especially when used augment a smaller dataset larger one.
Martin Tutek, Ivan Sekulić, Paula Gombar, Paljak, Filip Čulinović, Boltužić, Mladen Karan, Domagoj Alagić, Jan Šnajder. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). 2016.
We present XHate-999, a multi-domain and multilingual evaluation data set for abusive language detection. By aligning test instances across six typologically diverse languages, XHate-999 the first time allows disentanglement of domain transfer effects in conduct series domain- language-transfer experiments with state-of-the-art monolingual transformer models, setting strong baseline results profiling as comprehensive resource Finally, we show that language-adaption, via intermediate masked...
Personality and demographics are important variables in social sciences computational sociolinguistics. However, datasets with both personality demographic labels scarce. To address this, we present PANDORA, the first dataset of Reddit comments 10k users partially labeled three models (age, gender, location), including 1.6k well-established Big 5 model. We showcase usefulness this on experiments, where leverage more readily available data from other to predict traits, analyze gender...
Personality and demographics are important variables in social sciences, whilein NLP they can aid interpretability removal of societal biases.However, datasets with both personality demographic labels scarce. Toaddress this, we present PANDORA, the first large-scale dataset Reddit commentslabeled three models (including well-established Big 5 model) (age, gender, location) for more than 10k users. Weshowcase usefulness this on experiments, where leveragethe readily available data from other...
We address the task of automatically detecting toxic content in user generated texts. fo cus on exploring potential for preemptive moderation, i.e., predicting whether a particular conversation thread will, future, incite comment. Moreover, we perform preliminary investigation model that jointly considers all comments outperforms only individual comments. Using an existing dataset conversations among Wikipedia contributors as starting point, compile new large-scale this consisting labeled...
Protocol standards, defined by the Internet Engineering Task Force (IETF), are crucial to successful operation of Internet. This paper presents a large-scale empirical study IETF activities, with focus on understanding collaborative and how these underpin publication standards documents (RFCs). Using unique dataset 2.4 million emails, 8,711 RFCs 4,512 authors, we examine shifts trends within development process, showing protocol complexity time produce has increased. With observations in...
Policy agenda research is concerned with measuring the policymaker activities.Topic classification has proven a valuable tool for policy research.However, manual topic coding extremely costly and time-consuming.Supervised offers cost-effective reliable alternative, yet it introduces new challenges, most significant of which are training set coding, classifier design, accuracy-efficiency trade-off.In this work, we address these challenges in context recently launched Croatian Agendas...
Effective projection-based cross-lingual word embedding (CLWE) induction critically relies on the iterative self-learning procedure. It gradually expands initial small seed dictionary to learn improved mappings. In this work, we present ClassyMap, a classification-based approach self-learning, yielding more robust and effective of CLWEs. Unlike prior methods, our allows for integration diverse features into process. We show benefits ClassyMap bilingual lexicon induction: report consistent...
The Internet Engineering Task Force (IETF) has developed many of the technical standards that underpin Internet. development process followed by IETF is open and consensus-driven, but inherently both a social political activity, latent influential structures might exist within community. Exploring understanding these essential to ensuring IETF’s resilience openness. We use network analysis explore graph participants, based on public email discussions co-author relationships, influence key...
An important concept in organisational behaviour is how hierarchy affects the voice of individuals, whereby members a given organisation exhibit differing power relations based on their hierarchical position. Although there have been prior studies relationship between and voice, they tend to focus more qualitative small-scale methods do not account for structural aspects organisation. This paper develops large-scale computational techniques utilising temporal network analysis measure effect...
When tweeting on a topic, Twitter users often post messages that convey the same or similar meaning. We describe TweetingJay, system for detecting paraphrases and semantic similarity of tweets, with which we participated in Task 1 SemEval 2015. TweetingJay uses supervised model combines overlap word alignment features, previously shown to be effective textual similarity. reaches 65.9% F1-score ranked fourth among 18 participating systems. additionally provide an analysis dataset point some...
In this paper we present the TakeLab-QA entry to SemEval 2017 task 3, which is a question-comment re-ranking problem. We classification based approach, including two supervised learning models – Support Vector Machines (SVM) and Convolutional Neural Networks (CNN). use features on different semantic similarity (e.g., Latent Dirichlet Allocation), as well several types of pre-trained word embeddings. Moreover, also some hand-crafted task-specific features. For training, our system uses no...
Nowadays it is becoming more important than ever to find new ways of extracting useful information from the evergrowing amount user-generated data available online. In this paper, we describe creation a set that contains news articles and corresponding comments Croatian outlet 24 sata. Our annotation scheme specifically tailored for task detecting stances sentiment user as well assessing if commentator claims are verifiable. Through data, hope get better understanding publics viewpoint on...